User:CharlesSpencer/sandbox/HowTo

From Wikisource
Jump to navigation Jump to search

How to...[edit]

How I generated the data files for (semi-)automated processing of the Private Acts of the 58th Congress:

Base data[edit]

My base data came from the List of Private Acts and Resolution which is to be found in Volume I of the work. It has names, grant/increase and page numbers, but is missing Chapter numbers, marginal notes, amount of pension, quality of pensioner and any provisos. It is, however, not a bad place to start!

Using the subst: function wrapping each of a long series of transclusion {{Page:United_States_Statutes_at_Large_Volume_33_Part_2.djvu/[target]}} functions in a sandbox page, I generated a very long single page of text. I then published the page, and copied and pasted its entire contents to Word.

Refinement phase[edit]

Here I used Word's quite sophisticated search and replace functionality to process the data for transfer to Excel. Anybody who understands regex could probably do what I did in seconds, but essentially I search/replaced as appropriate to create Name <Tab> Increase/Grant <Tab> Name <Tab> Date <Tab> Page number <C/R>.

Once this was in reasonable shape I then copied to Excel and used if statements to compare the two name cells - the initial version is in italics in the original, and therefore tends to OCR less well. However, there is very little point in checking lines where both name entries are the same - I haven't found a single one yet where the same error occurred in both cells!