Page:Sm all cc.pdf/34

From Wikisource
Jump to navigation Jump to search
This page has been proofread, but needs to be validated.
31

cating a perfect match of data to theory. Nor do we expect χ2 values that are extremely large, indicating a huge mismatch between the observed and predicted distributions.

The χ2 test, like a histogram, can use any data units and almost any binning interval, with the same proviso that a fine binning interval is most appropriate when N is large. Yet some χ2 tests are much easier than others, because of the need to calculate a predicted number of points for each interval. Here we will take the preliminary step of standardizing the data. Standardization transforms each measurement xi into a unitless measurement which we will call zi, where zi = (xi-X)/σ Standardized data have a mean of zero and a standard deviation of one, and any standardized array of approximately normally distributed data can be plotted on the same histogram. If we use a binning interval of 0.5σ, then the following table of areas under a normal distribution gives us the expected frequency [Nf(n)=N•area] in each interval.

Table 2: Areas of intervals of the normal distribution [Dixon and Massey, 1969]
σ Interval: <-3 -3 to -2.5 -2.5 to -2 -2 to -1.5 -1.5 to -1 -1 to -0.5 -0.5 to 0.0
Area: 0.0013 0.0049 0.0166 0.044 0.0919 0.1498 0.1915
σ Interval: >3 3 to 2.5 2.5 to 2 2 to 1.5 1.5 to 1 1 to 0.5 0.5 to 0.0

Equation 3 is applied to these 14 intervals, comparing the expected frequencies to the observed frequencies of the standardized data. Note that the intervals can be of unequal width. If the number of data points is small (e.g., N<20), one should reduce the 14 intervals (n=14) to 8 intervals by combining adjacent intervals of Table 2 [e.g., f(n) for 2σ to 3σ is .0166+.0049=.0215]. The following table shows the probabilities of obtaining a value of χ2 larger than the indicated amounts, for n=14 or n=8. Most statistics books have much more extensive tables of χ2 values for a variety of ‘degrees of freedom’ (df). When using such tables to compare a sample distribution to a Gaussian distribution that is estimated from the data rather than known independently, then df=n-2 as in Table 3.

Table 3. Maximum values of χ2 that are expected from a normal distribution for different numbers of binning intervals (n) at various probability levels (P) [Fisher and Yates, 1963].
P80 P90 P95 P97.5 P99 P99.5
n=8: 8.56 10.64 12.59 14.45 16.81 18.55
n=14: 15.81 18.55 21.03 23.34 26.22 28.3

For example, for n=14 intervals a χ2 value of 22 (calculated from equation 3) would allow one to reject the hypothesis of a normal distribution at the 95% confidence level but not at 97.5% confidence (21.03<22<23.34).

A non-normal value for χ2 can result from a single histogram bin that has an immense difference between predicted and observed value; it can also result from a consistent pattern of relatively small differences between predicted and observed values. Thus the χ2 test only determines whether, not how, the distribution may differ from a normal distribution.

Skewness is a measure of how symmetric the data distribution is about its mean. A distribution is positively skewed, or skewed to the right, if data extend substantially farther to the right of the peak than they do the left. Conversely, a distribution is negatively skewed, if data extend substan-