Help talk:Proofread

From Wikisource
Jump to navigation Jump to search

other language wikis[edit]

How to Adopt this process to other language wikis. Please explain -- 21:51, 29 September 2006 (UTC)

you have to have developers activate the ProofreadPage extension on your wiki. ThomasV 11:28, 30 September 2006 (UTC)
Small note to the annon: currently the extension works only in Page: prefix. See bugzilla:7534. Lugusto 555 18:54, 9 October 2006 (UTC)
Thanks ThomasV and Lugusto --Vyzasatya 16:55, 15 October 2006 (UTC)

Example with screenshots[edit]

This - to me, a wikinewsie - seems an unusual and interesting extension. We periodically end up with documents that are scanned and turned into PDFs or only available as PDFs. However, the content of the corresponding help page is a bit greek (or is that geek?) to me.

A worked example in a sandbox page that people could do themselves would be good, with screenshots of what you'll see as you work through the proofing process. --Brianmc 16:18, 25 June 2007 (UTC)

Hi. I'm not too sure what you're looking for, but you can find an example here. In fact, doing a search in Special:Allpages for everything in the Page: namespace will show you working examples of how this extension operates.
One thing you have to note if you are going to use it is:
  1. You must have the Page: namespace installed on whatever wiki you're working on.
  2. In order to get the side-by-side view, you need to have the name of the page in the "Page:" namespace be identical to the image name on Commons or your local wiki. For example, the page I referenced above is called "Page:A Treatise on Electricity and Magnetism Volume 1 020.jpg." This means that there is an image called "Image:A Treatise on Electricity and Magnetism Volume 1 020.jpg" on Commons that the software will call up and create on the side-by-side view.
If you have any other questions or need me to clarify, please ask.—Zhaladshar (Talk) 16:56, 25 June 2007 (UTC)
I've had a play with this - it didn't quite sink in there was a "Page:" namespace that the extension was embedded in. A few followup questions...
  • Does it work with PDFs?
  • Do you need OCR software for it?
  • If it works with PDFs, what can you use to scape text content out of them?
Thanks for pointing me to where it'd been used. --Brianmc 16:36, 26 June 2007 (UTC)

About PDFs: no, currently this extension does not work with PDFs, only image files. I don't know technically what goes in to this extension, but you can try contacting the extension's creator ThomasV (on his French WS page) about it if it is something at all possible to do. If it is just not possible, I can think of two other solutions to the PDF problem: convert the PDF to DjVu format (this extension now works with DjVu files) or--a more time consuming process--manually take snapshots of each page on the PDF and turn them into PNG images and upload them to the wiki.

And, no, you also don't need OCR software to use this, but it really is recommended. It takes a ton of time to manually type a page whereas with OCRing all you'd have to do is spend a few minutes double checking the scan results.—Zhaladshar (Talk) 17:14, 26 June 2007 (UTC)


  • Should display the image next to the edit window for the page you are editing, and also show the image during preview.
  • The page navigation tabs should be present on any document that has pages. — Omegatron 03:30, 9 September 2007 (UTC)

Examples in English[edit]

As this is the English Wikisource, could we get examples in English rather then french (fr:Page:Essais-Livre 1-0076.jpg)? Jeepday 12:50, 19 June 2008 (UTC)

Done. Yann 14:23, 19 June 2008 (UTC)
Very cool, thank you :) Jeepday 20:09, 19 June 2008 (UTC)

Update Color[edit]

As I have been working on Index:Personal Recollections of Joan of Arc.djvu I have figured out that the hyper links to pages are color coded as to their status Help:Page Status

  • Clear = Not yet built
  • Purple = Problematic
  • Red = Not proofread
  • Yellow = Has been proofread
  • Green = Validated

But I have not figured out how to make it update. I have created up to page 334 but the index is only showing up to page 221 as built. A few days ago when User:Sbh was proofreading some of the pages it updated a few times, now it has not updated in a while. How can I force it to update? It is a great tool for tracking progress, but only if it updates as progress occurs. Jeepday (talk) 12:54, 5 July 2008 (UTC)

I just tired something that seemed to work, on Index:Personal Recollections of Joan of Arc.djvu, I clicked the edit tab, then I saved without making any changes. No new version was created in history but it did update the colors on the index immediately and correctly. Jeepday (talk) 21:05, 6 July 2008 (UTC)
there is a bug with index pages : for djvu files, the update is not automatic. the workaround is as follows : you can force the update by clicking on the 'Pages' link (with action=purge in the url) ThomasV 21:21, 6 July 2008 (UTC)
  • I was not able to make your solution work, I tried several different interpretations of your directions. Jeepday (talk) 23:08, 6 July 2008 (UTC)

Solution The solution is given at Wikisource:Scriptorium#Adding purge link to Index pages, There is a link on every DjVu page to refresh (purge), near the top of the page you see this list


The Pages is a blue link that refreshes, purges or updates (how ever you want to look at it) the page status colors. Jeepday (talk) 11:22, 20 July 2008 (UTC)

Partial transclusion[edit]

I am trying to follow the directions at Help:Side by side image view for proofreading#Partial transclusion on the article page The Pilgrim Cook Book/Soups and am not making it work. Can anyone fix it so I can see how it should work then I can do it on the rest of the book. Jeepday (talk) 11:11, 14 September 2008 (UTC)

Thank you, User:Psychless for fixing it at 13:23 on 14 September. Diff Jeepday (talk) 22:20, 14 September 2008 (UTC)

Alternate means[edit]

A discussion in Scriptorium alerted me to different means to invoke section that differs from the instructions. These being:

  • as per the instructions
{{#section:Page:List of Carthusians 1800-1879.djvu/213|R}}
  • the alternate
{{Page|List of Carthusians 1800-1879.djvu/213|section=R}}

The former does not produce the [page] link, whereas the latter means does. For simplicity sake, I would like to see our instructions modified to demonstrate the latter. I also think that it is a more understandable scripting.

Also from the same discussion in Scriptorium, it would be worthwhile for us to identify that each page transclusion would be better framed on individual lines. I would like for the instructions to reflect that similarly. -- billinghurst (talk) 00:48, 1 February 2009 (UTC)

Not heard any screams of Nooooo! I will wait a few more days and then update the page according to my suggestions. -- billinghurst (talk) 05:31, 12 February 2009 (UTC)
I guess the docs just lag behind the current practice a bit, but now that the Page template is in such wide use, it seems like an odd omission. So it's probably at least worth a mention that the template calls the parser function and adds that link. -Steve Sanbeg (talk) 18:51, 12 February 2009 (UTC)

GIF files[edit]

Does proofread page not work with gifs, or am I doing something else wrong? See Index:Littell's Living Age--T. Mazzei (talk) 05:47, 9 December 2008 (UTC)

It's been a while since you asked, but those pages look fine to me. The red links go to images of the page and you can create and add the text just like a djvu project. Forward and back buttons don't quite seem to work exactly correctly, but even though the image isn't displayed you can click on create and it pops up for you. --Mkoyle (talk) 22:18, 28 March 2009 (UTC)

Blank page template[edit]

I deleted the text referring to use of the {{blank page}} template which was basically eliminated a few weeks ago. I wouldn't have noticed except I check my watchlist every once in a while and saw a huge number of edits to pages deleting this. Does anything about the new 'without text' page status? I assume something ought to be noted here, but I can't think what to include here. --Mkoyle (talk) 18:56, 26 June 2009 (UTC)

This help page[edit]

This help page has gotten a bit ungainly. Much needed information is here, but with the addition of my DjVu how-to it looks cluttered... I am trying to think how to clean this up for the benefit of new users... any comments / suggestions / help is appreciated. --Mkoyle (talk) 21:58, 26 June 2009 (UTC)

Hidden headers - default to open?[edit]

Thanks to sDrewth on IRC, I've just discovered the headers on the proofreading pages, which were hidden as "Show header and footer fields when editing in the Page namespace" in my preferences was unselected. I assume that's the default setting as I haven't touched by preferences here before. Could this be changed so that not hiding the headers is the default for new users? Speaking as an experienced wikimedian (primarily on wikipedia), this option is very non-obvious at present. Presumably this requires a Bugzilla entry after consensus is reached here? Mike Peel (talk) 12:12, 2 January 2010 (UTC)

I think that we have done this. If it is not the case, then please state so. — billinghurst sDrewth 01:18, 19 May 2010 (UTC)

Ex Libris / faceplate pages[edit]

It would be useful to add a note that Ex Libris / faceplate pages are not part of the book, and do not need to be marked with {{use page image}} and accordingly can be marked as NO TEXT and move one. Probably also goes for the pages with borrowing slips and other gumph at the back of books. — billinghurst sDrewth 03:52, 3 November 2010 (UTC)

Pages difference[edit]

In Gesenius' Hebrew Grammar, and probably in many other DjVu and PDF files the technical page number and the logical page number in the book are not the same. On Index pages it is marked using <pagelist>. Is there a way to have the <pages> tag understand this automatically without having to calculate the pages difference every time? --Amir E. Aharoni (talk) 14:14, 5 November 2010 (UTC)

Sadly, no. The software wouldn't exactly know how to distinguish a page "xxv" from a page 4 (since the page number might not even be listed and the location of the numbers differ from book to book). All it can tell is the technical order, so we're stuck still having to calculate the difference.—Zhaladshar (Talk) 14:22, 5 November 2010 (UTC)
It may not be implemented now, but it doesn't seem completely impossible.
Here, for example the source tag says 'from="94" to="99"', but the page numbers on the left show the correct logical numbers: 70 to 75.
Can i request it as an enhancement to an extension? Which extension is it exactly - ProofreadPage? --Amir E. Aharoni (talk) 14:35, 5 November 2010 (UTC)
Many of the files display the correct page when viewed at I expect that information is contained in many djvus, getting the software to recognise that pagination would be a boon. Compensating for page skips, i.e. unnumbered pages, is particularly tedious. I think I have seen roman displayed in carefully made djvu files, others just restart the numbering; having this indexing will greatly assist our page listing, even if it is not automatic. cygnis insignis 16:01, 5 November 2010 (UTC)
I think that the DjVu with which i am working doesn't have it in itself, but the index page for it here defines the correct numbers. The software displays them correctly, but i need to write them manually in the source, which is quite a pain.
Currently this book uses {{page}} through a special template i created for it, which does this calculation internally ({{GHGpage-transclude}} and {{GHGpage-calc}}). Having this done automatically would be nice.
I'll ask at oldwikisource, i guess. --Amir E. Aharoni (talk) 20:56, 5 November 2010 (UTC)
Definitely something for oldwikisource:Wikisource talk:ProofreadPage.

If you know the offset and it is standard for a book, I would have thought that you would be able to script the calculation. I would suggest utilisation of the {{#tag:pages ...}}. I have done some of that sort of coding at {{Authority reference}} and run a little in User:Billinghurst/monobook.js that pulls the BASEPAGENAME for which work I am playing. — billinghurst sDrewth 07:30, 6 November 2010 (UTC)

fromsection / tosection[edit]

The fromsection and tosection attributes often have an identical value. Is there a shortcut for that? --Amir E. Aharoni (talk) 14:50, 5 November 2010 (UTC)

That may be a logical option for when they are, but maybe check and ask around oldwikisource:Wikisource:ProofreadPage to see if there is reason not to. I always use s1, s2 as the label, so does the developer I think, that allows me to know what to call without editing the Page. cygnis insignis 16:18, 5 November 2010 (UTC)
No, so when I code, eg. {{DNBset}} I just call the same parameter into both places. For the biographical works I have been trying to utilise the section name as the SUBPAGENAME, and that way it pulls it as the parameter. Utilising s1/s2 is less obvious on pages with multiple sections, especially when a named section is sometimes more obvious. — billinghurst sDrewth 07:36, 6 November 2010 (UTC)

Don't we have a search here?[edit]

Hey there, I'm quite new to this thing. This is where I came from. AFAICS, there is no way to search for titles to be proofread. Plus, I am unable to find any of the to-be-proofread ones via the general search. Everything is so confusing, let alone hardly self-explanatory and Heaven knows what else. Finally, I found the page where I wanted to go to (instead of pressing next 200 several hundred times) by studying the URL format and hacking it. (Since I was sick and tired of searching around any more.) So much for "intuitive UI"! -andy 19:20, 20 December 2010 (UTC)

Unfortunately there is no easy way to get a list of works that need to be proofread. The best ways to find something you are interested in are to browse, talk to someone that edit works similar to what you are interested in, or add your own djvu files to proofread. —Spangineer (háblame) 21:14, 20 December 2010 (UTC)
The pages to be proofread can be found here: Category:Index Not-Proofread. and/or here: Category:Not proofread. I will add a TOC to those pages to aid in navigation. --Eliyak T·C 22:18, 20 December 2010 (UTC)
I forgot about Category:Index Not-Proofread; excellent suggestion. —Spangineer (háblame) 22:32, 20 December 2010 (UTC)
Plus we also have Wikisource:Proofread of the Month which is always active and a great place to learn, and always check its talk page. — billinghurst sDrewth 10:50, 21 December 2010 (UTC)

PDF vs DjVu[edit]

I've removed this text:

This format is preferable to pdf because it is a free format, and more to the point, because Wikimedia Commons does not allow uploads of .pdf files.

Neither of these is true anymore, PDF is now free and Commons now accepts it. So, besides the higher compression why do we prefer DjVu?--Doug.(talk contribs) 19:40, 21 December 2010 (UTC)

Djvu is optimised for text, and well supported here, size is an issue with large volumes of this. The improved image support of pdf is not usually needed, and I'm not sure the text layer is available in the 'Page: namespace'. (I think that transcribing would be a low priority for some pdf files, eg. modern texts not available in djvu, when they are already well presented, search-able, and freely accessible.) cygnis insignis 04:54, 22 December 2010 (UTC)
This just seems contrary to the general rule that we prefer the highest quality image available. We also have contradictory information about illustrations, here we tell people to zoom the DjVu to 100% whereas at Help:DjVu_files we tell people not to unless there's nothing else available. FYI, I'm being Devil's advocate here, trying to understand our logic so we can improve our help pages. I don't know enough about this stuff to actually argue a change, but we should be able to explain it someone as dense as me. ;-) --Doug.(talk contribs) 06:26, 22 December 2010 (UTC)
The general rule would be the best image for the presentation, the scan text needs only to be adequate, that version serves our transcript. The page explains how to zoom in on a page, is that what you mean? The help pages do need simplifying, I agree, cygnis insignis 07:13, 22 December 2010 (UTC)
Sorry, I was referring to another page, not this one, my mistake. On Help:Adding_images we tell people to zoom them to 100%. This point belongs there not here, but it's part of the overall, why we prefer DjVu. I still wonder if there isn't a difference, maybe even a big one, between saying "we prefer DjVu" and saying "DjVu is adequate". Really for proofreading, particularly for older or low quality originals, the image quality is quite important.--Doug.(talk contribs) 18:30, 22 December 2010 (UTC)
It is important that image quality for text is adequate for a transcription, so higher where necessary. Obtaining an image that is intended to be displayed, an engraving or other illustration, should be the highest quality that is reasonably possible. Without looking, because I wrote it, zooming into an online scanner refers to obtaining a jpeg from the site that provides the scan, nothing to do with proofreading. Maybe that can be more explicit. PDF is okay , djvu is poor to passable quality, for the presentation of illustrations. cygnis insignis 19:32, 22 December 2010 (UTC)
Ahah! I thought it *might* be talking about jpeg and not DjVu. I think it could be clarified. Chatting with others on IRC, the point came up that apparently text layer extraction for PDF has never been developed on WS, you can generate text with the OCR button but an existing text layer apparently won't be recognized. This may just be because at the time the ability was developed in 09 for DjVu, PDF had only just become free. I guess we'll have to ask ThomasV and see if that's the case and if he's working on it for PDF.--Doug.(talk contribs) 21:37, 22 December 2010 (UTC)

"Remove end-of-line hyphens and line breaks."[edit]

The above quote is plain and simple, but I've not been following it. Specifically I have not been removing line breaks until I recently noticed the last line break in a Page: is seen as a paragraph break.

The reason why I didn't remove line breaks is because it makes it easier to locate text in the original document for comparisons, etc.

I have noticed other reasons for (re)moving line breaks is because some formatting and templates don't work across new lines, especially italics !

While editing this note I've checked the last line break issue with the sandbox and it happens when there is no blank line starting the footer, and only affects the Page: display, not what is seen when the pages are combined.

So I'd suggest line breaks be not removed, just moved to fix end-of-line hyphens and to ensure formatting and templates work. Mark Hurd (talk) 16:20, 14 November 2011 (UTC)

I always (try) to remove line breaks as the last step of proofreading. The primary reason being that different applications will handle the line breaks differently, I refer mostly to usage outside of Wikisource. The secondary reason being that it just looks cleaner. JeepdaySock (talk) 17:01, 14 November 2011 (UTC)
External example W:Help:Books. JeepdaySock (talk) 17:05, 14 November 2011 (UTC)
I did a 'quick' test on en.wp and didn't notice a difference. Do you have a more specific citing of an issue? Mark Hurd (talk) 02:19, 15 November 2011 (UTC)
Re: It being "easier to locate text in the original document for comparisons" as one of your reasons for not removing end-of-line breaks: My opinion on that is that a proofreader should be going word-by-word, line-by-line anyway, and that manually removing line breaks actually helps you keep your place in the proofreading process (I also recommend following along with the cursor/right arrow key or whatever)... And say you have to answer the phone or the door while in the middle of proofreading a page,—it becomes easier to locate where you left off by there being a distinction between no line breaks and line breaks. And yes, italics & other formatting are affected by line breaks as you stated. Failing to remove end-of-line breaks as a habit leaves a greater likelihood that more formatting errors will result. I feel that you should err on the side of being meticulous—even if it might be inconvenient. Londonjackbooks (talk) 17:30, 14 November 2011 (UTC)
Fair enough, but I do suggest this step be left for the Validating editor. Mark Hurd (talk) 02:19, 15 November 2011 (UTC)
The reason I always remove line breaks in prose is that we are reproducing the text rather than the page. Doing so means that it doesn't matter what size window someone is working in, the text will wrap nicely. For me it's all about the end user. wrt the suggest of leaving it to the Validating editor, texts are frequently left unvalidated for long periods of time. The result would be that the usability of the text in the meantime is reduced. Beeswaxcandle (talk) 06:52, 15 November 2011 (UTC)
I also always remove line breaks, and I've been looking for a way of getting them to actually display (akin to what most text editors can do), so I don't get confused about what's the end of the editing window's line, and what's actually a newline character. I know Project Gutenberg don't remove them, and I've always found it an annoyance when using their texts. As beeswaxcandle said, we're reproducing the text, rather than the book. —— Sam Wilson ( TalkContribs ) … 07:09, 15 November 2011 (UTC)
Likewise, I always remove linebreaks from prose for the same reasons Beeswaxcandle mentions. Leaving the line breaks in is a very bad idea. I generally apply my clean up script to the page as my first action on my initial edit (removing all line breaks and hyphens), though sometimes, it is my last action before I save. I do not see a lot of value in keeping the lines the same (though that's not always true when I'm working on other subdomains where I am not fluent), in over under view it can be helpful, but I started out here in side-by-side where things don't line up anyway. In few cases I removed the linebreaks via a bot job at the beginning.--Doug.(talk contribs) 08:10, 15 November 2011 (UTC)
The "pudding's" in the proofread. I "kick" myself when someone validates a page I have proofread, and they find errors that I overlooked... I can tell by my own work—and by another editor's correction (validation) of my work—whether I was distracted that day or whether my heart wasn't really in the piece I was working on (ditto for the validating editor!)... That reminds me... I have to check on something! :) Londonjackbooks (talk) 10:55, 17 November 2011 (UTC)
In addition, Re: "I do suggest [line break removal] be left for the Validating editor": Why I believe it shouldn't be left for the validating editor: Because a page is "eligible" for transclusion once proofread, it is important that it be proofread sufficiently enough that minimal-to-no mistakes would crop up. As we already know, failure to remove end-of-line breaks "corrupts" certain formatting (italics, etc.), so why would we want to wait for a page to be validated (which could take years) before we eliminate the possibility of error? As it is, even validated pages still have mistakes on them—even despite an editor's best effort. Both proofreading and validating should be done with a fine-toothed comb. Another thing that should be avoided: assuming that a page you are validating is already "up to par" as a result of "knowing" the proofreader. We all have "bad days!" So many more reasons for being meticulous at every step! :) Londonjackbooks (talk) 13:38, 17 November 2011 (UTC)


I going to start breaking this off into a separate page and trying to fully document it leaving a summary and link on this page. Unless there are strong objections. Probably will start on this tomorrow.--BirgitteSB 00:07, 22 July 2012 (UTC)

Makes sense to me. Jeepday (talk) 10:02, 22 July 2012 (UTC)
And to me. I'm concerned by the fact that the number of those active on English Wikisource has declined in the last couple of years—we need to encourage more people to start contributing. But not to load more books—proofreading and verification are the big shortcomings. But transclusion and loading of index files are quite complicated, so I think we should keep the proofreading help as simple as we can so that people can cut their teeth here and maybe stay here: they can read the books and correct them and know they are contributing, but without too many complications. Chris55 (talk) 20:36, 23 July 2012 (UTC)
I've started on it and am nowhere close to what I would like it to be. I just don't think this is something I can spend 4 hours on at one go and produce anything worthwhile. I plan to do more tomorrow. Feel free to poke me if you think I am leaving a mess about for too long.--BirgitteSB 01:28, 24 July 2012 (UTC)

Purging a file[edit]

Probably worth a note somewhere that if working on a djvu file with a text layer, and the text layer is not appearing the the file should be purged at Commons. Plus presuming that my edit remains to the Index page template, I have added File:View-refresh.svg View-refresh.svg to that template which will link through. — billinghurst sDrewth 12:23, 7 December 2012 (UTC)

More details for those of us who do not regularly work with the djvu files? Jeepday (talk) 11:35, 8 December 2012 (UTC)
The "caching coma" issue affects both PDF & DjVu files. This issue seems to affect files uploaded ~3 years ago on the norm and the larger the file, the more likely they seem to suffer from it. For reasons yet to be ascertained, files with perfectly good text-layers no longer appear to automatically dump that layer to the Page: namespace when a user goes to create a Page for the first time. Purging the base File: page a couple of times seems undo the log-jam, refresh the cache and allow the text-layer to come through here on en.WS as expected. We suffered through a recent rash of missing text-layers with similar characteristics to "caching coma" thanks to a core upgrade & was solved in Bugzilla: 42466. The affected files also need to be refreshed and that is done through purging the File: (cache). On Commons, this is done by selecting Purge from the Actions menu (move, delete, protect, etc. under Vector Skin) or the "*" tab (monobook skin). I find Purging 2x while there saves an extra trip back to Commons more often than not. -- George Orwell III (talk) 20:51, 8 December 2012 (UTC)
Sometimes the text layer is not imported when you start a Page: ns file, or sometimes there is a file change, eg. djvu updated. — billinghurst sDrewth 13:43, 9 December 2012 (UTC)
Are there contraindications for purging? (Or Can bad things happen if you purge the file to something like Black Beauty, a fully validated and featured work?)Jeepday (talk) 13:52, 9 December 2012 (UTC)