Popular Science Monthly/Volume 60/December 1901/A Mechanical Solution of a Literary Problem

From Wikisource
Jump to navigation Jump to search







THE title given to this paper, chosen after much hesitation and with no little reluctance, is not to be looked upon as an assumption of the definite and final solution of the principal problem to which attention has been directed. As a matter of fact I have hoped to conceal, for at least a page or two, the identity of this principal problem, in order that no well intentioned and good natured reader might be driven away by what is a very general, not altogether reasonable, but quite natural, prejudice. Whatever may be thought of the problem or of the importance of its solution, it is believed that the method here suggested and applied will be found to be of interest and, possibly, of considerable value in certain linguistic studies.

Nearly twenty years ago I devised a method for exhibiting graphically such peculiarities of style in composition as seemed to be almost purely mechanical and of which an author would usually be absolutely unconscious. The chief merit of the method consisted in the fact that its application required no exercise of judgment, accurate enumeration being all that was necessary, and by displaying one or more phases of the mere mechanism of composition characteristics might be revealed which the author could make no attempt to conceal, being himself unaware of their existence. It was further assumed that, owing to the well-known persistence of unconscious habit, personal peculiarities in the construction of sentences, in the use of long or short words, in the number of words in a sentence, etc., will in the long run manifest themselves with such regularity that their graphic representation may become a means of identification, at least by exclusion. In the present consideration the application of the method has been restricted to a study of the relative frequencies of the use of words of different lengths.

The method of procedure is simple and will be best explained by an example. One thousand words in 'Vanity Fair,' taken in consecutive order of course, were counted and classified as to the number of letters in each with the following result:

Letters— 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Words— 25 169 232 187 109 78 79 48 28 20 10 10 2 3

The graphic exhibition of this result is made by the well-known method of rectangular coordinates, using the number of letters in a word as the abscissa and the corresponding number of words in a thousand as the ordinate. On a sheet of 'squared' paper the numbers showing letters in each word, 1, 2, 3, 4, etc., are placed along the horizontal line and on the vertical above each of these is put a point whose distance from the base shows the number of corresponding words in every thousand, according to the scale shown at the left. These points are then joined by straight lines and the whole broken line may be called the 'word spectrum' or 'characteristic curve' of the author as derived from the group of words considered. The group of 1,000 words from 'Vanity Fair' enumerated above is thus graphically represented by the continuous line in Fig. 1, and the method of constructing the characteristic curve will be readily understood by comparing this with the numbers given. As a thousand is a very small number in a problem of this kind, the curve representing any single group of that number of words is practically certain to differ more or less from that of any other such group. In Fig. 1 the dotted line represents a group of 1,000 Fig. 1. Two Groups—1000 each—Vanity Fair. words, immediately following that already referred to. Perhaps the most astonishing thing about these two lines is not that they differ, but that they agree as well as they do. It is really remarkable that any marked peculiarity in the use of words is almost sure to be revealed in this way, even in comparatively small groups. In the two diagrams of Fig. 1 it is interesting to note their general sameness, especially as shown in a tendency to equality of words of six and seven letters and also in words of eleven and twelve letters.

When the number of words in each group is increased there is, of course, closer agreement of their diagrams, and this became so evident in the earlier stages of the investigation that the conclusion was soon reached that if a diagram be made representing a very large number of words from a given author, it would not differ sensibly from any other diagram representing an equally large number of words from the same author. Such a diagram would then reflect the persistent peculiarities of this author in the use of words of different lengths and might be called the characteristic curve of his composition. Curves similarly formed from anything that he had ever written could not differ materially from this, although curves of other authors might possibly but would not probably, agree closely with his.

Thus, if this principle were established, the method might be useful as a means of identification of authorship, and it might be relied upon with great confidence to show that a certain author did not write a certain composition.

In the earlier application of the method many interesting facts were brought out, some of which are worth mentioning here, although a full account of the preliminary work was published in 'Science' of March 11, 1887, It was soon discovered that among writers of English the threeletter word occurred much more frequently than any other. Indeed in the earlier investigation only one exception to this rule was found and that was in the writings of John Stuart Mill, who uses two-letter words more often than any other. This was surprising at first, especially in view of the large average word-length of Mill's composition, which is considerably in excess of that of any other author thus far examined, but it is easily explained by the very frequent appearance of prepositional phrases, necessitating the use of such two-letter words as in, on, to, of, etc., to an extent unapproached by other writers. Mill's writings furnished an opportunity for comparing the curves representing two different periods of an author's life. A comparison of two groups of 5,000 words each from his 'Political Economy' and his 'Essay on Liberty' showed the presence of the same peculiarities in word choosing, and in every thousand of the ten examined the two-letter word was in excess. No other writer of English has been found to use two letter words oftener than any other, but it is not at all improbable that there may be such.

Through the interest of Mr. Edward Atkinson, it became possible to give a partial answer to the question. Can an author purposely avoid the peculiarities of style that belong to his normal composition? Mr. Atkinson, having addressed a body of college alumni on a certain topic, afterward gave what he meant to be the same address to a body of workingmen, but in the latter instance he made a special effort to use simple, short words and sentences of the simplest and plainest construction. Although relating to the same topic the two addresses 'read' very differently, but their diagrams are strikingly alike in their main feature. As a matter of interest 'counts' were made of groups of about 5,000 words each from various languages other than English. The characteristic curves thus derived for Italian, Spanish, French, German, Latin and Greek are shown in Figs. 2 and 3, and, for convenience in comparison, that of Dickens's English is added. Many of these 'counts' were furnished by friends who became interested in the matter, and an incident of no little interest was the receipt of a column of numbers on a strip of paper with nothing to indicate its origin or meaning. Suspecting, however, that it might be a 'word count,' its diagram was constructed and it was instantly and beyond all reasonable doubt identified as coming from the Latin of Caesar.

Fig. 2.

The original published description referred to above concludes as follows:

From the examinations thus far made I am convinced that 100,000 words will be necessary and sufficient to furnish the characteristic curve of a writer—that is to say, if a curve is constructed from 100,000 words of a writer, taken from any one of his productions, then a second curve from another 100,000 words would be practically identical with the first and that this curve would, in general, differ from that formed in the same way from another writer, to such an extent that one could always be distinguished from another. To demonstrate the existence of such a curve would require the enumeration of the letters of several hundred thousand words from each of a number of writers. Should its existence be established the method might then be applied to cases of disputed authorship. If striking differences are found of known and suspected compositions of any writer, the evidence against identity of authorship would be quite conclusive. If the two compositions should produce curves which are practically identical, the proof of a common origin would be less convincing; for it is possible, although not probable, that two writers might show identical characteristic curves.

With this conclusion the matter remained for more than ten years. On innumerable occasions it was suggested that the process ought to be applied to an examination of the writings of Bacon and Shakespeare with a view of forever settling a controversy which will doubtless forever remain unsettled. This, of course, had been all along in view, but it involved an expenditure of time and labor in letter and word counting quite beyond what might be expected from individual enthusiasm.

Fig. 3.

The operation is not one of thrilling interest, and volunteer assistance could not be depended upon when the number of things to be counted and classified grew into millions.

That the method has been applied at last to this most curious and yet most interesting question is entirely due to the liberality of Mr. Augustus Heminway, of Boston, who kindly offered to defray the expenses of the work, that is, to employ persons to count and classify nearly two millions of words. Besides expressing my indebtedness to Mr. Heminway, I wish to make grateful acknowledgment of the excellent and entirely satisfactory manner in which the heavy task of counting was performed by the ladies who undertook it, Mrs. Richard Mitchell and Miss Amy C. Whitman, of Worcester, Massachusetts. Their intelligent interest in the problem itself, together with their excellent knowledge of the various authors under examination and familiarity with the literature of the Shakespearean period, contributed greatly to the easy accomplishment of the work. The operation of counting was greatly facilitated by the construction of a simple counting machine by which a registration of a word of any given number of letters was made by touching a button marked with that number. One of the counters, with book in hand, called off 'five,' 'two,' 'three,' etc., as rapidly as possible, counting the letters in each word carefully and taking the words in their consecutive order, the other registering, as called, by pressing the proper buttons. Practice enabled the counters to do the work with remarkable rapidity, so that, although they were occupied for several months, the total time required was really only about one-quarter of the original estimate. The work was very exhausting, however, and could not be kept up satisfactorily more than three to five hours each day. After some preliminary work the counting of Shakespeare was seriously begun, and the result from the start with the first group of a thousand words was a decided surprise. Two things appeared from the beginning: Shakespeare's vocabulary consisted of words whose average length was a trifle below four letters, less than that of any writer of English before studied; and his word of greatest frequency was the four-letter word, a thing never met with before. His preference for the four-letter word may be said, indeed, to constitute the striking characteristic of his composition. At first it was thought that it might be a general characteristic of the English of his time, but that was found to be not the case. Its appearance in the composition of one or two of his contemporaries will be considered presently. Altogether about 400,000 words of Shakespeare were counted and classified, including, in whole or in part, nearly all of his most famous plays. His 'characteristic curve' is most persistent, that based on the first 50,000 words differing very little from that of the whole count. Two groups have been formed by combining alternate small groups (single plays or parts of plays) in a purely mechanical way, so as to include as nearly as may be the same number of words in each. The curves corresponding to them are plotted in Fig. 4, where, however, the differences have been of necessity somewhat exaggerated in order to make them show at all. The practical identity of these curves must be regarded as convincing evidence of the soundness of the original assumption. Not all of the Shakespeare count was completed at one time; other authors were taken up, and it is worth noting that the counters declared their ability to recognize Shakespeare by the mere 'run of words' without knowing what book or author was in hand, more especially on account of the exceptional excess of four-letter words.

The characteristic curve of Bacon was developed along Fig. 4. Shakespeare—two groups, about 200,000 Words each. with that of Shakespeare and was based on his 'Henry VII.' the 'Advancement of Learning' and a large number of his shorter essays, the total number of words being nearly 200,000.

Besides these, extensive counting was done from the writings of Ben Jonson, Addison, Milton, Beaumont and Fletcher, Christopher Marlowe, Goldsmith and Lord Lytton and small groups from a few more modern authors. It is possible, here, to give only general conclusions and to exhibit the diagrams of the more important and interesting results.

One of the first questions likely to be raised is, when an author writes both prose and poetry, will the two styles of composition follow the same general law and show the same characteristic curves? Unfortunately it is not possible to answer this as completely as could be desired, as no one has written enough in two or more different styles, as prose, poetry, history, essay, drama, etc., to produce normal characteristic diagrams. Several of the authors above named were examined with this point in view, and while some of them exhibited somewhat different curves in play writing and in essay or serious prose composition Fig. 5. Shakespeare——Poetry. . . . Prose.in every case any marked peculiarity found in one style was also found in the other. A good example of this is shown in the two Shakespeare curves of Fig. 5. The continuous line is based on his 'Rape of Lucrece' and 'Venus and Adonis,' while the broken line is his normal curve in play writing.

It will be noted that the Shakespearean peculiarity of an excessive use of four letter words is shown in the same degree in both and that while there are apparent differences of considerable magnitude the curves are really strikingly alike, every bend in on having a corresponding flexure in the other. This is typical of all comparisons of different styles of composition by the same author. Undoubtedly there will always be found differences in the graphic representations of serious prose compositions and those Fig. 6. Two Groups, Ben Jonson. of a higher vein, poetry or play, by the same writer, but the evidence at hand goes to show that the leading personal peculiarities of composition will invariably be found in both.

Fig. 6 shows the curves of two groups of about 75,000 words each from the plays of Ben Jonson, the most notable literary contemporary of Shakespeare. Their close agreement is another very satisfactory confirmation of the fundamental principle and their difference from the Shakespearean curve is striking. It will be observed that Jonson follows the usual practice of making use of the three-letter word most frequently.

Fig. 7 shows the characteristic curves of Bacon and Shakespeare side by side and may be regarded, perhaps, as the objective point of the entire investigation. The reader is at liberty to draw any conclusions he pleases from this diagram.

Should he conclude that, in view of the extraordinary differences in these lines, it is clear that Bacon could not have written the things ordinarily attributed to Shakespeare, he may yet, possibly, be willing to admit that, in Mr. Heminway's Fig. 7.———Bacon. . . . . Shakespeare. own words, 'the question still remains, who did?' Assuming this question to be a reasonable one, the method now under consideration can never do more than direct inquiry or suspicion.

During the progress of the count it seemed as if the Shakespearean peculiarity of the excessive use of words of four letters was unique, that no other writer would be found with this characteristic. On working out the results of a very extensive count of the plays of Beaumont and Fletcher, however, it was found that on the final average the number of four-letter words was slightly greater than that of three letters, although the excess was by no means so persistent in small groups. The curve of their composition is, Fig. 8.——Beaumont and Fletcher. . . . Shakespeare. on the whole, quite like that of Shakespeare. The lack of persistency of form among small groups may be accounted for by the fact that the work is in a large, though unknown, degree a joint product. The comparison with Shakespeare is shown in Fig. 8.

It was in the counting and plotting of the plays of Christopher Marlowe, however, that something akin to a. sensation was produced among those actually engaged in the work. Here was a man to whom it has always been acknowledged, Shakespeare was deeply indebted; one of whom able critics have declared that he 'might have written the plays of Shakespeare.' Indeed a book has been only recently published to prove that he did write them. Even this did not lessen the interest with which it was discovered that in the characteristic curve of his plays Christopher Fig. 9.———Marlowe. . . . Shakespeare. Marlowe agrees with Shakespeare about as well as Shakespeare agrees with himself, as is shown in Fig. 9. Finally, an interesting incident developed in an examination of a bit of dramatic composition by Professor Shaler, of Harvard University, entitled 'Armada Days.' It was a brochure of only about twenty thousand words, printed for private circulation, in which the author had endeavored to compose in the spirit and style of the Elizabethan Age. Although too small to produce anything like a 'normal' curve it was counted and plotted, and the diagram indicated that Professor Shaler had not only caught the spirit of the literature of the time, but that he had also unconsciously adopted the mechanism which seems to characterize it. In the excess of the four-letter word and in other respects the curve was rather decidedly Shakespearean, although it was written before its author knew anything of such an analysis as this.