Page:Sm all cc.pdf/59

This page has been proofread, but needs to be validated.

56

$m=[N\sum X_{i}Y_{i}-(\sum X_{i})(\sum Y_{i})]/[N\sum X_{i}^{2}-(\sum X_{i})^{2}]$

$b=[(\sum Y_{i})(\sum X_{i}^{2})-(\sum X_{i}Y_{i})(\sum X_{i})]/[N\sum X_{i}^{2}-(\sum X_{i})^{2}]$

Most spreadsheet and graphics programs include a linear regression option. None, however, mentions the implicit assumptions discussed above.

Linear regression fits the line that minimizes the squares of the residuals of Y_i deviations from the line. This concept is illustrated in Figure 15a, which shows a linear regression of leading National League batting averages for the years 1901-1920. This concept of minimizing the squares of Y_i deviations is very important to remember as one uses linear regression, for it accounts for several characteristics of linear regression.

First, we now understand the assumption that only the Y_i have errors and that these errors are random, for it is these errors or discrepancies from the trend that we are minimizing. If instead the errors were all in the X_i, then we should minimize the X_i instead (or, much easier, just rename variables so that Y becomes the one with the errors).

Second, minimizing the square of the deviation gives greatest weighting to extreme values, in the same way that extreme values dominate a standard deviation. Thus, the researcher needs to investigate the possibility that one or two extreme values are controlling the regression. One approach is to examine the regression line on the same plot as the data. Even better, plot the regression residuals -the differences between individual Y_i and the predicted value of Y at each X_i, as represented by the vertical line segments in Figure 15a. Regression residuals can be plotted either as a function of X_i (Figure 15b) or as a histogram.

Third, the use of vertical deviations accounts for the name linear regression, rather than a name such as linear fit. If one were to fit a trend by eye through two correlated variables, the line would be steeper than that determined by regression. The best-fit line regresses from the true line toward a horizontal no-fit line with increases of the random errors of Y. This corollary is little-known but noteworthy; it predicts that if two labs do the same type of measurements of (X_i, Y_i), they will obtain different linear regression results if their measurement errors are different.

Page:Sm all cc.pdf/59

Navigation menu

Search