User:Chris55/Essay2

Improving the proofreading rate

There are an increasing number of indexed works that are not proofread and are effectively "dead". There were 2,200 last year and now are 3,300. In the last year, nearly 1,400 new index files were added but only 106 were completely checked.

Using different techniques including match and split, the basic time for preparing a work can come down to a few tens of seconds per page, but proofreading a page still needs 2-5 minutes or more, depending on the size and difficulty of the page.

Changing this situation needs a lot more proofreaders. At the moment the goal is not much more than one book a month. That needs to become hundreds of books to avoid the creation of the index file system becoming a disappointment.

Yet these index pages are effectively invisible to the newcomer. The only list that is presented is those books that are "done". The category list is worse than useless - it's a turnoff.

So we need to set up a proofreading project.

A look at the current state of play shows over 1,700 single volume index files not yet proofread:

State	Single volume		Multi-volume
	Total/New	Done	Total/New	Done
Up to 2010^[1]	898	329	707	76
2010-11^[2]	512	42	519	11
2011-12^[3]	357	95	939	7
Total left to do	1,767		2,165
Need proofreading	1,318		2,003
Need validation	363		48
Have other problems	86		113

These stats are broad-brush - mostly derived from Hesperion's lists. Multi-volume works have been separated—those which have 3 or more similar titles with mainly numbering differences - e.g. 700 copies of the Scientific American. Since they now dominate in numbers, we need to treat them separately to avoid them overwhelming the other categories as there are only around 100 different titles for 2,000 index files. The proofreading record is also noticeably worse than for single volume works. I've also excluded around 300 non-DJVU/PDF files from this analysis.

To attract new proofreaders clear lists of works that need proofreading and validation are needed, kept up-to-date by at least semi-automatic methods. We need to cater for a wide variety of tastes: different people will be attracted to poetry, history, novels, science, law reports, government documents etc. Categorising the index files would help this.

The main method of rewarding effort has been through marks placed on users' pages concerning "proofread of the month". Would that scale up by a factor of 10 or 100? Doubtful. Another approach would be to have a "progress page" where recent advances are recorded. This has to be able to deal with short and long works- 5 or 500 pages. Milestones are important: confirmed transitions from "To be proofread" to "To be validated" to "Done".

The top 10 in the multiple category, with the number of index files and total pages to be proofed for each, are:

Index file name	Files	Av pages	Total pages
Index:United States Statutes at Large Volume 1.djvu	199	1296	257,869
Index:Notes and Queries - Series 1 - General Index.djvu	152	570	86,625
Index:Popular Science Monthly Volume 1.djvu	92	806	74,159
Index:All the Year Round - Series 1 - Volume 1.djvu	59	644	37,969
Index:Dictionary of National Biography volume 01.djvu	63	464	29,228
Index:Sacred Books of the East - Volume 1.djvu	51	495	25,233
Index:Title 3 CFR 1936-1938 Compilation.djvu	38	581	22,065
Index:Confederate Veteran volume 01.djvu	31	597	18,506
Index:Federal Cases, Volume 15.djvu	14	1286	18,008
Index:Philosophical Transactions - Volume 001.djvu	39	4334	16,870

If these are ever to be completed, then it needs a very different strategy. Here is a summary of the size of the task for multi-volume files.

↑ To 23 December
↑ 23 December to 25 August
↑ 25 August to 31 July

Proofreading single volume works

Pages	No. files	Total pages	notproofed
50	197	4,087	3,177
100	94	6,992	4,918
150	67	8,370	6,301
200	83	14,498	11,266
250	85	19,123	14,726
300	108	29,911	22,325
350	110	35,742	27,007
400	102	38,331	32,174
450	107	45,304	38,420
500	84	39,795	34,357
550	78	40,928	35,736
600	53	30,439	25,614
650	44	27,589	23,357
700	29	19,725	17,656
750	20	14,526	12,472
800	16	12,448	11,563
850	8	6,564	6,022
900	12	10,388	8,891
950	3	2,784	2,753
1000	5	4,893	4,760
1050	6	6,139	5,820
1100	5	5,350	5,272
1150	6	6,703	5,699
1200	3	3,580	3,299
1250	3	3,668	3,475
1300	1	1,267	1,197
1350	1	1,337	1,337
1400	1	1,388	1,353
1600	1	1,572	1,154
2050	2	4,056	3,734

There is a huge range of document sizes as shown by the table on the right:

Remembering that the amount of work is basically the number of pages, one can see that much of the work is in documents between 250 and 700 pages and these are very challenging tasks. We propose three sizes of task based on the number of pages to be proofed.

Range	Files	Total Pages	notproofed	Percent done
Up to 50	197	4,087	3,177	22.3
51-500	840	238,066	191,494	19.6
Over 500	297	205,344	181,164	11.8

The first category currently includes quite a few works which need just a few pages tidied up to move on into the validation class. Here is an (almost) current listing. One thing that is obvious from these is that many are simply requiring pictures to be extracted from the djvu files. But these shorter works may also appeal to people who like to see completion of a task.

The intermediate length files are much more of a challenge. It's a much longer list and needs breaking up to smaller groups to be presentable. It's here where categorising would be of most use.

The longer works need a much more sustained approach, or else a cooperative one. They are more similar to the multi-volume works.

To make this more attractive it would help to be able to break down these long lists by categories: novels, histories, legislation, science etc. This would require the index files to be categorised, which is not present policy but would have significant benefits in the long run. Not only would these be categorised much earlier, but transclusion could be used to copy the categories into the mainspace, including the individual chapters if this is thought suitable.

Validation

The list of works to be validated is currently much shorter than the list needing proofreading for the not very good reason that not that many books have yet reached that stage. There aren't yet clear guidelines for validation: for example, if someone finds a problem which they can't fix and mark a page as problematic, many editors assume that the work needs to be moved back a stage to the "To be proofread" stage. But is this necessarily sensible? "To be validated" does not imply the proofreading removed all problems, or there wouldn't be need for a second stage. Maybe we need a second "blue" stage.

It is tempting to assume that validation can be given to newcomers as "there won't be so many problems". But in some ways it needs more experience to ensure that it keeps to desirable Wikisource guidelines. Much clearer guidelines are needed.

On the other hand, validation should be a more pleasant reading experience and we need to consider how we can encourage people who come to Wikisource purely to read a work to stay and contribute to the validation process. It could be an excellent way of recruiting proofreaders.

Reporting progress

The toolserver currently produces many useful statistics that not many people look at. But to encourage proofreading we do need some more specialised reports, such as:

The number of pages advanced a stage (to proofed, validated)
The number of index files advanced from "To be proofread" to "To be validated" and "Done".
More focused reports on user activity.

The reports linked earlier have been produced by a mix of toolserver tools and manual collation but there's no reason why they can't all be produced on a regular basis automatically. We may need others.

Multi-volume Works

There is already an established mechanism in WikiProject for handling big projects and one can identify projects for more than half a dozen of the bigger works in the list. It's noticeable that many of the other bigger works have had little done on them.

It would seem sensible to encourage the formation of new wikiprojects to coordinate the proofreading of these other big projects and to send interested people to the project. In reporting progress it would help to concentrate on the current most active volume(s).

[1] To 23 December

[2] 23 December to 25 August

[3] 25 August to 31 July

[1]

[2]

[3]

User:Chris55/Essay2

Contents

Improving the proofreading rate

Proofreading single volume works

Validation

Reporting progress

Multi-volume Works

Navigation menu

User:Chris55/Essay2

Improving the proofreading rate

Proofreading single volume works

Validation

Reporting progress

Multi-volume Works

Navigation menu

Search