of an exploding shell so far as to know the distance of each mark
measured (from an origin) along a right line, say the line of an
extended fortification, and it was known that the shell was fired
perpendicular to the fortification from a distant ridge parallel to the
fortification, and that the shell was of a kind of which the fragments
are scattered according to a normal law^{[1]} with a known coefficient
of dispersion; the question is at what position on the distant ridge
was the enemy's gun probably placed? By received principles
the probability, say P, that the given set of observations should
have resulted from measuring (or aiming at) an object of which the
real position was between *x* and *x* + ∆*x* is

∆*x* J exp − [(*x* - *x*_{1})^{2} + (*x* - *x*_{1})^{2} + &c.]/*c*^{2};

where J is a constant obtained by equating to unity
(since the given set of observations must have resulted from some
position on the axis of *x*). The value of *x*, from which the given
set of observations *most probably* resulted, is obtained by making P
a *maximum*. Putting *d*P/*dx* = 0, we have for the maximum
(*d*^{2}P/*dx*^{2} being negative for this value) the arithmetic mean of the
given observations. The accuracy of the determination is measured
by a probability-curve with modulus *c*/√*n*. This in the course of a
very long siege if every case in which the given group of shell-marks
*x*_{1}, *x*_{2}, . . . *x*_{n} was presented could be investigated, it would be
found that the enemy's cannon was fired from the position *x*′, the
(point right opposite to the) arithmetic mean of *x*_{1}, *x*_{2}, &c., *x*_{n}, with
a frequency assigned by the equation

*z* = (√*n*/√π*c*) exp − *n*(*x* − *x*′)^{2}/*c*^{2}.

The reasoning is applicable without material modification to the
case in which the data and the *quaesitum* are not absolute quantities,
but proportions; for instance, given the percentage of white balls
in several large batches drawn at random from an immense urn containing
black and white balls, to find the percentage of white balls
in the urn—the inverse problem associated with the name of Bayes.

131. Simple as this solution is, it is not the one which has most
recommended itself to Laplace. He envisages the *quaesitum* not so
much as that point which is *most probably* the real one, as that point
which may *most advantageously* be put for the real one. In our
illustration it is as if it were required to discover from a number
of shot-marks not the point^{[2]} which in the course of a long siege
would be most frequently the position of the cannon which had
scattered the observed fragments but the point which it would
be best to treat as that position—to fire at, say, with a view of
silencing the enemy's gun—having regard not so much to the frequency
with which the direction adopted is right, as to the extent
to which it is wrong in the long run. As the measure of the detriment
of error, Laplace^{[3]} takes “la Valeur moyenne de l'erreur à
craindre,” the mean first power of the errors taken positively on
each side of the real point. The mean s~~p~~quare of errors is proposed
by Gauss as the criterion.^{[4]} *Any* mean power indeed, the integral
of any function which increases in absolute magnitude with the
increase of its variable, taken as the measure of the detriment, will
lead to the same conclusion, if the normal law prevails.^{[5]}

132. Yet another speculative difficulty occurs in the simplest, and
recurs in the more complicated inverse problem. In putting P as
the probability, deduced from the observations that the real point
for which they stand is *x* (between *x* and *x* + ∆*x*), it is tacitly
assumed that prior to observation one value of *x* is as probable as
another. In our illustration it must be assumed that the enemy's
gun was as likely to be at one point as another of (a certain tract of)
the ridge from which it was fired. If, apart from the evidence of
the shell-marks, there was any reason for thinking that the gun was
situated at one point rather than another, the formula would require
to be modified. This a priori probability is sometimes grounded on
our *ignorance*; according to another view, the procedure is justified
by a rough general knowledge that over a tract of *x* for which P is
sensible one value of *x* occurs about as often as another.^{[6]}

133. Subject to similar speculative difficulties, the solution which
has been obtained may be extended to the analogous problem in
which the *quaesitum* is not the real value of an observed magnitude,
but the mean to which a series of statistics indefinitely prolonged
converges.^{[7]}

134. Next, let the modulus, still supposed given, not be the same
for all the observations, but *c*_{1} for *x*_{1}, *c*_{2} for *x*_{2}, &c. Then P becomes
proportional to

exp − [(*x* − *x*_{1})^{2}/*c*_{1}^{2} + (*x* − *x*_{2})^{2}/*c*_{2}^{2} + &c.].

And the value of *x* which is both the most probable and the “most
Method of least Squares.
advantageous” is (*x*_{1}/*c*_{1}^{2} + *x*_{2}/*c*_{2}^{2} + &c.)/(1/*c*_{1}^{2} + 1/*c*_{2}^{2} + &c.);
each observation being weighted with the inverse
mean square of observations made under similar
conditions.^{[8]} This is the rule prescribed by the “method
of least squares”; but as the rule in this case has been deduced
by genuine inverse probability, the problem does not exemplify
what is most characteristic in that method, namely, that a rule
deducible from the hypothesis that the errors of observations obey
the normal law of error is employed in cases where the normal law
is not known, or even is known not, to hold good. For example,
let the curve of error for each observation be of the form of

*z* = [1/√(π*c*)]× exp[−*x*^{2}/*c*^{2} − 2*j*(*x*/*c* - 2*x*^{3}/3*c*^{3})],

where *j* is a small fraction, so that *z* may equally well be equated to
(1/√π*c*)[1 - 2*j*(*x*/*c* - 2*x*^{3}/3*c*^{3})] exp − *x*^{2}/*c*^{2}, a law which is actually
very prevalent. Then, according to the genuine inverse method,
the most probable value of *x* is given by the quadratic equation
*d**dx*log P = 0, where log P = const. − ∑(*x* − *x*_{r})^{2}/*c*_{r}^{2} − ∑2*j*[(*x* − *x*_{r})^{3}/*c*_{r}^{3} − 2(*x* − *x*_{r})^{3}/3*c*_{r}^{3}],
∑ denoting summation over all the observations.
According to the “method of least squares,” the solution is the
weighted arithmetic mean of the observations, the weight of any
observation being inversely proportional to the corresponding
mean square, *i.e.* *c*_{r}^{2}/2 (the terms of the integral which involve *j*
vanishing), which would be the solution if the *j'*s are all zero. We
put for the solution of the given case what is known to be the solution
of an essentially different case. How can this paradox be justified?

135. Many of the answers which have been given to this question
seem to come to this. When the data are unmanageable, it is legitimate
to attend to a part thereof, and to determine the most probable
(or the “most advantageous”) value of the *quaesitum*, and the
degree of its accuracy, from the selected portion of the data as if it
formed the whole. This throwing overboard of part of the data in
order to utilize the remainder has often to be resorted to in the
rough course of applied probabilities. Thus an insurance office
only takes account of the age and some other simple attributes of
its customers, though a better bargain might be made in particular
cases by taking into account all available details. The nature of
the method is particularly clear in the case where the given set of
observations consists of several batches, the observations in any
batch ranging under the same law of frequency with mean *x*′_{r}
and mean square of error *k*_{r}, the function and the constants different
for different batches; then if we confine our attention to those parts
of the data which are of the type *x*′_{r} and *k*_{r}—ignoring what else may
be given as to the laws of error—we may treat the *x*′_{r}'s as so many
observations, each ranging under the normal law of error with its
coefficient of dispersion; and apply the rules proper to the normal
law. Those rules applied to the data, considered as a set of derivative
observations each formed by a batch of the original observations~~)~~
averaged, give as the most probable (and also the most advantageous)
combination of the observations the arithmetic mean weighted
according to the inverse mean square pertaining to each observation,
and for the law of the error to which the determination is liable
the normal law with standard deviation^{[9]} √(∑*k*/*n*)—the very rules
that are prescribed by the method of least squares.

136. The principle involved might be illustrated by the proposal
to make the economy of datum a littler less rigid: to utilize, not indeed
all, but a little more of our materials—not only the mean
square of error for each batch, but also the mean cube of error. To
begin with the simple case of a single homogeneous batch: suppose
that in our example the fragments of the shell are no longer scattered
according to the normal law. By the method of least squares it
would still be proper to put the arithmetic mean to the given observations
for the true point required, and to measure the accuracy of
that determination by a probability-curve of which the modulus is
√(2*k*), where *k* is the mean square of deviation (of fragments from
their mean). If it is thought desirable to utilize more of the data
there is available, the proposition that the arithmetic mean of a

- ↑ If normally in any direction indifferently according to the two-
or three-dimensioned law of error, then normally in one dimension
when collected and distributed in
*belts*perpendicular to a horizontal right line, as in the example cited below, par. 155. - ↑ Or small interval (cf. preceding section).
- ↑ “Toute erreur soit positive soit négative doit être considerée
comme un désavantage ou une perte réelle à un jeu quelconque,”
*Théorie analytique*, art. 20 seq., especially art. 25. As to which it is acutely remarked by Bravais (*op. cit.*p. 258), “Cette règle simple laisse à désirer une démonstration rigoureuse, car l'analogue du cas actuel avec celui des jeux de hasard est loin d'être complète.” - ↑
*Theoria combinationis*, pt. i. § 6. Simon Newcomb is conspicuous by walking in the way of Laplace and Gauss in his preference of the*most advantageous*to the*most probable*determinations. With Gauss he postulates that “the evil of an error is proportioned to the square of its magnitude” (*American Journal of Mathematics*, vol. viii. No. 4). - ↑ As argued by the present writer,
*Camb. Phil. Trans.*(1885), vol; xiv. pt. ii. p. 161. Cf. Glaisher,*Mem. Astronom. Soc.*xxxix. 108. - ↑ The view taken by the present writer on the “Philosophy of
Chance,” in
*Mind*(1880; approved by Professor Pearson,*Grammar**of Science*, 2nd ed. p. 146). See also “A priori Probabilities,”*Phil.**Mag.*(Sept. 1884), and*Camb. Phil. Trans.*(1885), vol. xiv. pt. ii. p. 147 seq. - ↑ Above, pars. 6, 7.
- ↑ The mean square .
- ↑ The standard deviation pertaining to a set of (
*n*/*r*) composite observations, each derived from the original*n*observations by averaging a batch thereof numbering*r*, is √(*k*/*r*)/√(*n*/*r*) = √(*k*/*n*), when the given observations are all of the same weight;*mutatis**mutandis*when the weights differ.