Wikisource:WikiProject OCR
| WikiProject OCR |
| This project is for users to request for scans to be OCRed for various Wikisource-related projects. |
Contents |
[edit] Instruction
The participants listed below are users who have access to some kind of OCR software and are willing to extract text from scanned documents.
Users who desire for a text to be OCRed should place their request under the Requests section with the following format:
[[Title of the book]] (year published) - Author. # of pages. [source where pages can be found]
Note: "year published" should be when it was published in the U.S. as this will make determining the copyright status easier.
While these are the general instructions for requesting that a project be scanned, other users may have more specific instructions if they are to take on a project.
[edit] Participants
[edit] Zhaladshar
[edit] Instructions
Preference given to:
- Smaller requests
- Requests where obtaining the scans is easier (such as downloading a ZIP file instead of having to access each scan and download them all individually)
- Works that are hard to find in text form elsewhere on the Internet
- Works that I do not proofread
I will only work on two large projects at a time (they are first come, first serve) and will work smaller projects in the mix as I make time for them.
[edit] Current projects
| Title | Year published | Author | Pages | Source | Completion |
|---|---|---|---|---|---|
| Historical Library | 1814 | Diodorus Siculus (trans. G. Booth) | 677 | < 5% |
[edit] Benn Newman
[edit] Instructions
Preference given to:
- Smaller requests
- Requests where obtaining the scans is easier (such as downloading a ZIP file instead of having to access each scan and download them all individually)
- Works that are hard to find in text form elsewhere on the Internet
- Works that I have not proofread
[edit] Current projects
[edit] User:Inductiveload
[edit] Instructions
Preference given to:
- Larger or non-standard requests, or where image batch-processing or DjVu conversion is needed
- English requests
- Requests where obtaining the scans is hard (batch-downloading is my favourite bot activity)
- Works that are hard to find in text form elsewhere on the Internet
- Works that are likely to be proofread soon
- Large reference works which, even if not proofread soon, provide a valuable reference resource.
[edit] Current projects
[edit] Requests
[edit] Done
- Single European Act (on Wikipedia) a European Union treaty of 1986. It's quite short 29 pages a available in scanned PDF form. I've been looking for a text version for a while, but have never managed to find one. [1] Blue-Haired Lawyer (talk) 18:07, 21 December 2008 (UTC)
- Vlas Mikhaĭlovich Doroshevich (w:ru:Дорошевич, Влас Михайлович) "The Way of the Cross" (translation by Stephen Graham, probably w:Stephen Graham (author)). Original Russian text in public domain (Doroshevich died in 1922). Book is public domain in USA (printed in 1916). --EugeneZelenko (talk) 03:41, 23 July 2009 (UTC)
-
- Index:The Way of the Cross, Doroshevich, tr. Graham, 1916.djvu. Inductiveload—talk/contribs 17:41, 8 June 2011 (UTC)
- Cyclopaedia, or Universal Dictionary of Arts and Sciences (on Wikipedia) (1728) - Ephraim Chambers. Seems to be about 1430, according to the TOC. [2] --Rory096 02:59, 23 November 2006 (UTC)
-
- Done via the Internet Archive. Index:Cyclopaedia, Chambers - Volume 1.djvu, Index:Cyclopaedia, Chambers - Volume 2.djvu, Index:Cyclopaedia, Chambers - Supplement, Volume 1.djvu, Index:Cyclopaedia, Chambers - Supplement, Volume 2.djvu - over 4000 pages in all! Inductiveload—talk/contribs 17:12, 19 November 2011 (UTC)
[edit] OCR bot
There is an automatic tool for OCRing single pages at time, which is useful for repairing text on pages where it is missing or incomplete. It is available through the editing toolbar in the Page: namespace. It is accessed by clicking the
button. The edit box will go grey while the server processes the image and the OCR will appear in the edit box within a few seconds (larger pages with more text take longer). You can check the status at http://toolserver.org/~phe/ocr.php. A further feature of the tool is that the next page is automatically OCR'd when one page is retrieved, so the next page's text should be ready by the time you edit the next page.