Page:Sm all cc.pdf/60

From Wikisource
Jump to navigation Jump to search
This page has been proofread, but needs to be validated.
57

Fitting a linear regression does not imply that the obtained trend is significant. The correlation coefficient (R) measures the degree to which two variables are linearly correlated. We have seen above how to calculate the slope m of what is called the regression of Y on X: Y=mX+b. Conversely, we could calculate the slope m' of regression of X on Y: X=m'Y+b'. Note that we are abandoning the assumption that all of the errors must be in the Yi. If X and Y are not correlated, then m=0 (a horizontal line) and m'=0 (a vertical line), so the product mm'=0. If the correlation is perfect, then m=1/m', or mm'=1. Thus the product mm' provides a unitless measure of the strength of correlation between two variables [Young, 1962]. The correlation coefficient (R) is:

The correlation coefficient is always between -1 and 1. R=0 for no correlation, R=-1 for a perfect inverse correlation (i.e., increasing X decreases Y), and R=1 for a perfect positive correlation. What proportion of the total variance in Y is accounted for by the influence of X? R2, a positive number between 0 and 1, gives that fraction.

Whether or not the value of R indicates a significant, or non-chance, correlation depends both on R and on N. Table 7 gives 95% and 99% confidence levels for significance of the correlation coefficient. The test is called a two-tailed test, in that it indicates how unlikely it is that uncorrelated variables would yield either a positive or negative R whose absolute value is larger than the tabulated value. For example, linear regression of federal budget deficits versus time gives a high correlation coefficient of R=0.76 (Figure 9C). This pattern of steadily increasing federal budget deficits is significant at >99% confidence; for N=30, the correlation coefficient only needs to be 0.463 for the 99% significance level (Table 7).

Table 7: 95% and 99% confidence levels for significance of the correlation coefficient [Fisher and Yates, 1963].
N: 3 4 5 6 7 8 9 10 11 12
R95 0.997 0.95 0.878 0.811 0.754 0.707 0.666 0.632 0.602 0.576
R99 1 0.99 0.959 0.917 0.874 0.834 0.798 0.765 0.735 0.708
 
N: 13 14 15 16 17 18 20 22 24 26
R95 0.553 0.532 0.514 0.497 0.482 0.468 0.444 0.423 0.404 0.388
R99 0.684 0.661 0.641 0.623 0.606 0.59 0.561 0.537 0.515 0.496
 
N: 28 30 40 50 60 80 100 250 500 1000
R95 0.374 0.361 0.312 0.279 0.254 0.22 0.196 0.124 0.088 0.062
R95 0.479 0.463 0.402 0.361 0.33 0.286 0.256 0.163 0.115 0.081

Table 7 exhibits two features that are surprising. First, although we have already seen that N=2 gives us no basis for separating signal from noise, we would expect that N=3 or 4 should permit us to determine whether two variables are significantly correlated. Yet if N=3 or 4 we cannot be confident that the two variables are significantly correlated unless we find an almost perfectly linear correlation and thus an R of almost 1 or -1. Second, although we might accept that more pairs of (Xi, Yi) points would permit detection of subtler correlations, it is still remarkable that with N>200 a cor -