Silicon Dale

Distributions

Much of statistics - and indeed much of geostatistics - rests on bodies of theory that rely on data following one of a small number of theoretical probability distributions. The best known and most widely used of these is the gaussian or ‘normal' distribution. This is the familiar symmetrical bell- shaped curve, with positive and negative tails stretching to infinity in each direction. The next most commonly used in geology is the lognormal distribution, in which the logarithms of data values have a normal distribution: thus the data values themselves have a skewed distribution in which - in the density plot - the negative tail goes to zero at the origin (since the logarithm value - corresponds to a data value of 0) while the positive tail again stretches to infinity. Probability distributions themselves are ideal models fitted to observed frequency distributions obtained from sample data and commonly expressed in the form of histograms or cumulative frequency plots.

But do such statistical distribution models have any place in the earth sciences ? For a probability distribution to be a meaningful model of the data, we must be sure that we have a suitable data set. We need first to define the homogeneous population which we are sampling. We then need a set of representative independent observations drawn from that population. The type of distribution model which we fit must be appropriate to the known properties of the population.

Let us take a simple example: a set of iron analyses of samples from a high-grade iron-ore deposit. We shall not complicate it at this stage by considering spatial properties: let us assume that the samples are taken purely at random from a homogeneous part of the deposit. If we plot a histogram of grades we find that it is symmetrical and bell-shaped, with a peak at say 55% Fe. Conventional wisdom tells us that this is a normal (gaussian) distribution, and indeed we can obtain a good fit from a gaussian model. However, we need to look a little more closely at the geology. A normal distribution has tails which extend to infinity in both directions. The iron grades however are constrained. You cannot have less than 0% of iron in a sample. Nor can you have greater than 100%. In fact the upper limit is lower than that, because in a haematite deposit there are always three oxygen atoms to every two iron atoms, so the maximum possible iron grade is therefore 69.94%. As long as the variation in the deposit is fairly low (the histogram peak is narrow, the tails are low, and the computed standard deviation is small) the normal distribution remains quite a good model to use. But if the grade is very variable - or if the mean grade is high enough to be close to the upper limit of 69.94% - the model is no longer viable. We are indeed left without any well-known standard distribution model to fall back on.

Let us look at another example. Here we have a gold deposit in which the grades vary from a low background in most parts to patchy high values in a central region. If we plot a histogram of all the grades we may find we have a nice positively skewed plot that looks a bit like the lognormal distribution. If we plot a cumulative frequency curve on log-probability paper we may find that this match is ‘confirmed' by a straight line plot. Or we may find that by adding an extra parameter to the data before plotting we can obtain a nice straight line. Does this mean that the data are lognormally distributed ? Well, possibly. But if we look more closely at the population, we shall usually see that it is far from homogeneous. There are clearly identifiable ‘background' areas and other ‘patchy' or ‘high-grade' areas. If we take data locally from such different areas, we are likely to find that local histograms look nothing like the nice global histogram. Indeed, if we look more closely at the global data set, more often than not we spot breaks in slope in the cumulative curve. These spoil the nice model of lognormality. So what do we do ? One very common answer is to cut the highest grades - above the break in slope - and force them back to the ‘ideal' lognormal line. I discussed that in a recent issue of ESCA (Procrustes and the Golden Fleece, ESCA vol.16 no.9, May 2001) and showed why this practice is to be deplored. Another answer often proposed by geostatisticians is to divide the data set into homogeneous zones. Really this should be done on a geological basis (different rock types, fault-bounded blocks, or whatever) - and if such subdivision can be done then of course it should be, in the interest of obtaining homogeneous populations. If zones are defined purely on the basis of grade, however, as often seems to be done (for example ‘background', ‘low grade', ‘high grade') the result is not the desired homogeneous populations but a set of arbitrary populations each of which contains truncated data distributions - not at all the desired result.

If a physical surface is sampled - for example topographic elevation, or a geological boundary surface - then the resultant data set will inevitably be bounded both upwards and downwards. Let us assume that the surface is sampled regularly (obtaining histograms from clustered samples of autocorrelated data is a recipe for disaster!). If the surface has no regional trend or drift, then its peaks and troughs define the bounds of the distribution, and the actual form of the distribution will then depend crucially on the form of the surface. For example, a surface consisting of rounded hills and V-shaped valleys will give an elevation histogram which appears negatively skewed because more data points will be near hill-top elevations than in the valleys. If the surface has a trend, then the bounds could well be dependent simply on the data points near the outer boundary of the sampled area. In any case, the distribution as shown by a histogram will have a direct relationship to the data but will probably be nothing like any standard statistical model.

Thus we can see that the introduction of spatial considerations actually makes the situation very much more complex. The acceptance in geostatistics that there is correlation between adjacent observations means that you cannot obtain a set of mutually independent observations. It also means that unless you have strict stationarity you cannot define a distribution model for the global population - and local distributions are likely to vary in both type and parameters: the distribution of an area of background values may be lognormal or even approach normal, while the distribution of an area containing high grade samples is likely to have a very high positive skewness and be very poorly fitted even by a lognormal model.

In some cases statistical methods used may be robust to moderate deviations from ideal normal distributions. However, this is not always the case. For example, it is known that the use of many statistics based on the lognormal model (including lognormal kriging) is highly sensitive even to small discrepancies from the model. The real data distributions in geology are in any case likely to be very different from the idealised normal and lognormal models, and in fact, because of the constraints imposed by geological processes, may not match any of the classical statistical distribution models. Perhaps the best way to identify suitable distribution models is to carry out geological process simulation (a very different thing from geostatistical conditional simulation) and obtain notional distributions subject to the natural geological constraints.

Another option - accepting that real geological data distributions are likely to be complex and non-ideal - is to use robust, distribution-free, and nonparametric statistical methods wherever possible.

Stephen Henley
Matlock, England
steve@silicondale.co.uk
www.SiliconDale.com

Copyright © 2001 Stephen Henley
Distributions: Earth Science Computer Applications,v.17,no.2,p.1-2