Wikisource:WikiProject OCR

From Wikisource
Jump to: navigation, search
WikiProject OCR
Shortcut:
WS:OCR
This project is for users to request for scans to be OCRed for various Wikisource-related projects.

Instruction[edit]

The participants listed below are users who have access to some kind of OCR software and are willing to extract text from scanned documents.

Users who desire for a text to be OCRed should place their request under the Requests section with the following format:

[[Title of the book]] (year published) - Author. # of pages. [source where pages can be found]

Note: "year published" should be when it was published in the U.S. as this will make determining the copyright status easier.

While these are the general instructions for requesting that a project be scanned, other users may have more specific instructions if they are to take on a project.

Requested uploads to Internet Archive[edit]

Uploading scan from any external website to Internet Archive saves the trouble of extracting the OCR text and Djvu conversion. Please follow the instructions of Help:Internet Archive/Requested uploads to request upload to IA.

Participants[edit]

Zhaladshar[edit]

Instructions[edit]

Preference given to:

  1. Smaller requests
  2. Requests where obtaining the scans is easier (such as downloading a ZIP file instead of having to access each scan and download them all individually)
  3. Works that are hard to find in text form elsewhere on the Internet
  4. Works that I do not proofread

I will only work on two large projects at a time (they are first come, first serve) and will work smaller projects in the mix as I make time for them.

Current projects[edit]

Title Year published Author Pages Source Completion
Historical Library 1814 Diodorus Siculus (trans. G. Booth) 677 < 5%

Benn Newman[edit]

Instructions[edit]

Preference given to:

  1. Smaller requests
  2. Requests where obtaining the scans is easier (such as downloading a ZIP file instead of having to access each scan and download them all individually)
  3. Works that are hard to find in text form elsewhere on the Internet
  4. Works that I have not proofread

Current projects[edit]

World Revolution


User:Inductiveload[edit]

Instructions[edit]

Preference given to:

  1. Larger or non-standard requests, or where image batch-processing or DjVu conversion is needed
  2. English requests
  3. Requests where obtaining the scans is hard (batch-downloading is my favourite bot activity)
  4. Works that are hard to find in text form elsewhere on the Internet
  5. Works that are likely to be proofread soon
  6. Large reference works which, even if not proofread soon, provide a valuable reference resource.

Current projects[edit]

Requests[edit]

  • Artabanzanus (1896) William M. Ferrar. pages 162/314 [1] novel in pdf format with doubled pages, scan seems fairly good otherwise. Thx in advance Misarxist (talk) 14:10, 20 March 2012 (UTC)

Done[edit]

Done, but the noisy text with long-s has not OCR'd very cleanly. Is it sufficient? Inductiveloadtalk/contribs 09:23, 22 February 2012 (UTC)
 :( I dont think so. Thanks for trying. Moondyne (talk) 13:33, 22 February 2012 (UTC)
But I just found this. I might OK after all. Moondyne (talk) 13:41, 22 February 2012 (UTC)
OK, sorry that didn't turn out so well. The OCR generated by Tesseract from that kind of scan is generally only really useful for match and split, since the noise and old-fashioned font work against clean OCR. Google has a much more powerful and well-tuned software for the job, but I don't know exactly what it is. Inductiveloadtalk/contribs 18:40, 22 February 2012 (UTC)

OCR bot[edit]

There is an automatic tool for OCRing single pages at time, which is useful for repairing text on pages where it is missing or incomplete. It is available through the editing toolbar in the Page: namespace. It is accessed by clicking the Button ocr.png button. The edit box will go grey while the server processes the image and the OCR will appear in the edit box within a few seconds (larger pages with more text take longer). You can check the status at http://tools.wmflabs.org/phetools/ocr.php. A further feature of the tool is that the next page is automatically OCR'd when one page is retrieved, so the next page's text should be ready by the time you edit the next page.

Requested uploads to Internet Archive[edit]

  • English Books from British India collection of 18 books from Savifa Virtual Library South Asia, University of Heidelberg. Solomon7968 (talk) 09:29, 4 February 2014 (UTC)
    • I can do the upload, but most of the work here is metadata. Please prepare 1) a list of URLs of the books to download, 2) a CSV table with title, creator, date, description, sponsor (digitising institution) etc. I can help you revise the table if you're unsure how to name the fields, but I don't have time for the data entering. --Nemo 09:46, 4 February 2014 (UTC)
    • @Solomon7968:: This set is small and ok to me to work on it. I can do the metadata sorting and the upload process, but only during the next week. This term is ok for you? I can also provide to you the resulting .CSV for further requests (ie., to help you to write new CSV files for new or bigger requests) Lugusto 19:03, 7 February 2014 (UTC)
      • Thanks for the offer! Many books in the collection are rare and not available elsewhere. I also dropped an E-Mail to their Contact person Nicole Merkel a month or so ago, but no response from them so far. Solomon7968 (talk) 03:31, 8 February 2014 (UTC)
        • Scans provided by SavifaDok for Transactions of the Agricultural and Horticultural Society of India v. II are in a very low quality. Fortunately I've found a better quality scan on GBS so I've picked this, did OCR+djvu processing locally on my machine (with ABBYY 11) and uploaded it to Commons as File:Transactions of the Agricultural and Horticultural Society of India - Vol 2.djvu. But please note that Transactions of the Agricultural and Horticultural Society of India isn't exactly a book, but a journal; GBS have also more issues for this title. I will check/work on 17 remaining files from SavifaDok tomorrow. Lugusto 19:46, 10 February 2014 (UTC)
          • @Solomon7968:: So sorry for this delay. Apparently all scans from SavifaDok are in very low quality. Again I've found a better scan quality on GBS (although there are also issues on both versions in the exactly same pages...), this time for A rapid sketch of the life of Raja Radhakanta Deva Bahadur with some notices of his ancestors, and testimonials of his character and learning. Done as in the previous book, File:A rapid sketch of the life of Raja Radhakanta Deva Bahadur.djvu. As soon as possible I will process more books from the 16 remaining. Lugusto 02:47, 15 February 2014 (UTC)

Google Books[edit]

PDF Scans derived from Google Books contains a warning which needs to be stripped off before adding the text to IA for facilitating proofreading for Wikisource. These are normally done by the user/bot "tpb" (not affiliated to Internet Archive): we dream of a way to suggest tpb books we're interested in; we can start accumulating Google Books URLs here and then maybe tpb at some point will fetch them.

Also see this Scriptorium thread opened by Yann. Solomon7968 (talk) 10:37, 4 February 2014 (UTC)

  • The work tpb has started seems to be ended years ago, but I'm not sure. In the meantime the GBS original collection grown considerably. Maybe we are in need of a tool to do direct research on GBS + warning page removal + IA upload instead? Lugusto 19:03, 7 February 2014 (UTC)
    • Many editors here equipped with the software to remove the warnings and watermarks replace the existing IA derived file on Commons with the clean one without warnings and watermarks. It is especially trouble-some for large files. An automated system for uploading to IA will help for sure. Solomon7968 (talk) 03:31, 8 February 2014 (UTC)


See also[edit]