Improving the proofreading rate
There are an increasing number of indexed works that are not proofread and are effectively "dead". There were 2,200 last year and now are 3,300. In the last year, nearly 1,400 new index files were added but only 106 were completely checked.
Using different techniques including match and split, the basic time for preparing a work can come down to a few tens of seconds per page, but proofreading a page still needs 2-5 minutes or more, depending on the size and difficulty of the page.
Changing this situation needs a lot more proofreaders. At the moment the goal is not much more than one book a month. That needs to become hundreds of books to avoid the creation of the index file system becoming a disappointment.
Yet these index pages are effectively invisible to the newcomer. The only list that is presented is those books that are "done". The category list is worse than useless - it's a turnoff.
So we need to set up a proofreading project.
A look at the current state of play shows over 1,700 single volume index files not yet proofread:
|Up to 2010||898||329||707||76|
|Total left to do||1,767||2,165|
|Have other problems||86||113|
These stats are broad-brush - mostly derived from Hesperion's lists. Multi-volume works have been separated—those which have 3 or more similar titles with mainly numbering differences - e.g. 700 copies of the Scientific American. Since they now dominate in numbers, we need to treat them separately to avoid them overwhelming the other categories as there are only around 100 different titles for 2,000 index files. The proofreading record is also noticeably worse than for single volume works. I've also excluded around 300 non-DJVU/PDF files from this analysis.
To attract new proofreaders clear lists of works that need proofreading and validation are needed, kept up-to-date by at least semi-automatic methods. We need to cater for a wide variety of tastes: different people will be attracted to poetry, history, novels, science, law reports, government documents etc. Categorising the index files would help this.
The main method of rewarding effort has been through marks placed on users' pages concerning "proofread of the month". Would that scale up by a factor of 10 or 100? Doubtful. Another approach would be to have a "progress page" where recent advances are recorded. This has to be able to deal with short and long works- 5 or 500 pages. Milestones are important: confirmed transitions from "To be proofread" to "To be validated" to "Done".
The top 10 in the multiple category, with the number of index files and total pages to be proofed for each, are:
|Index file name||Files||Av pages||Total pages|
|Index:United States Statutes at Large Volume 1.djvu||199||1296||257,869|
|Index:Notes and Queries - Series 1 - General Index.djvu||152||570||86,625|
|Index:Popular Science Monthly Volume 1.djvu||92||806||74,159|
|Index:All the Year Round - Series 1 - Volume 1.djvu||59||644||37,969|
|Index:Dictionary of National Biography volume 01.djvu||63||464||29,228|
|Index:Sacred Books of the East - Volume 1.djvu||51||495||25,233|
|Index:Title 3 CFR 1936-1938 Compilation.djvu||38||581||22,065|
|Index:Confederate Veteran volume 01.djvu||31||597||18,506|
|Index:Federal Cases, Volume 15.djvu||14||1286||18,008|
|Index:Philosophical Transactions - Volume 001.djvu||39||4334||16,870|
If these are ever to be completed, then it needs a very different strategy. Here is a summary of the size of the task for multi-volume files.
- To 23 December
- 23 December to 25 August
- 25 August to 31 July
Proofreading single volume works
|Pages||No. files||Total pages||notproofed|
There is a huge range of document sizes as shown by the table on the right:
Remembering that the amount of work is basically the number of pages, one can see that much of the work is in documents between 250 and 700 pages and these are very challenging tasks. We propose three sizes of task based on the number of pages to be proofed.
|Range||Files||Total Pages||notproofed||Percent done|
|Up to 50||197||4,087||3,177||22.3|
The first category currently includes quite a few works which need just a few pages tidied up to move on into the validation class. Here is an (almost) current listing. One thing that is obvious from these is that many are simply requiring pictures to be extracted from the djvu files. But these shorter works may also appeal to people who like to see completion of a task.
The intermediate length files are much more of a challenge. It's a much longer list and needs breaking up to smaller groups to be presentable. It's here where categorising would be of most use.
The longer works need a much more sustained approach, or else a cooperative one. They are more similar to the multi-volume works.
To make this more attractive it would help to be able to break down these long lists by categories: novels, histories, legislation, science etc. This would require the index files to be categorised, which is not present policy but would have significant benefits in the long run. Not only would these be categorised much earlier, but transclusion could be used to copy the categories into the mainspace, including the individual chapters if this is thought suitable.
The list of works to be validated is currently much shorter than the list needing proofreading for the not very good reason that not that many books have yet reached that stage. There aren't yet clear guidelines for validation: for example, if someone finds a problem which they can't fix and mark a page as problematic, many editors assume that the work needs to be moved back a stage to the "To be proofread" stage. But is this necessarily sensible? "To be validated" does not imply the proofreading removed all problems, or there wouldn't be need for a second stage. Maybe we need a second "blue" stage.
It is tempting to assume that validation can be given to newcomers as "there won't be so many problems". But in some ways it needs more experience to ensure that it keeps to desirable Wikisource guidelines. Much clearer guidelines are needed.
On the other hand, validation should be a more pleasant reading experience and we need to consider how we can encourage people who come to Wikisource purely to read a work to stay and contribute to the validation process. It could be an excellent way of recruiting proofreaders.
The toolserver currently produces many useful statistics that not many people look at. But to encourage proofreading we do need some more specialised reports, such as:
- The number of pages advanced a stage (to proofed, validated)
- The number of index files advanced from "To be proofread" to "To be validated" and "Done".
- More focused reports on user activity.
The reports linked earlier have been produced by a mix of toolserver tools and manual collation but there's no reason why they can't all be produced on a regular basis automatically. We may need others.
There is already an established mechanism in WikiProject for handling big projects and one can identify projects for more than half a dozen of the bigger works in the list. It's noticeable that many of the other bigger works have had little done on them.
It would seem sensible to encourage the formation of new wikiprojects to coordinate the proofreading of these other big projects and to send interested people to the project. In reporting progress it would help to concentrate on the current most active volume(s).