User:HesperianBot

From Wikisource
Jump to: navigation, search

This account is Hesperian's bot account.At present it is uploading raw page scan images for pages tagged with {{missing image}} or {{raw image}}. Community approval for it to undertake this task can be found at [1].

How it works[edit]

Many of our works are illustrated. Compared to proofreading of text, providing good images is a relatively onerous and time-consuming task. This bot is designed to make it a bit easier by doing a lot of the work.

Most of our scans come from the Internet Archive, and for many of those scans the Internet Archive makes available a zip file of high resolution jp2 page scans. This bot untertakes to locate, download and upload the Internet Archive page scans for DjVu file pages that have illustrations.

1. Tagging missing images[edit]

When a proofreader encounters an image, rather than providing that image immediately, they have the option of flagging the image as missing and moving on. They may either return later to provide the image, or leave the image to be provided by another member of the Wikisource community.

We have two templates for tagging missing images. If the full page scan is a suitable placeholder for the missing image (that is, the image takes up the full page and is correctly oriented), then you can use {{raw image}} to display the full page scan. The idea is that it is often better to display a rough-and-ready image than to display an obtrusive message box.

If the full page scan isn't a suitable placeholder for the missing image, then use {{missing image}} to display a message box.

Don't use these templates to flag content that can be coded, such as missing tables, musical scores or chess diagrams; use {{missing table}}, {{missing score}} and {{missing chess diagram}} respectively.

Don't use these template to flag an image that is missing from the scan itself. Use {{bad image scan}}.

2. The bot's bit[edit]

From time to time, HesperianBot will go through all the DjVu file pages that are tagged as missing an image. It will check the File: page for the source of the file, and for those File: pages that provide a URL link to an Internet Archive document, it will go over to the Internet Archive and look for a zip file of jp2 page scans. If it finds it, it will download it and then attempt to identify the specific pages that need to be extracted. If successful it will extract them and convert them to PNG format. Yes, that's a lot of ifs; it is surprising how often it works.

3. Hesperian's bit[edit]

At this point, Hesperian casts an eye over the images, and if there are trivial things that can be done to improve them, such as 90 degree rotations, or cropping an image out of page text, he may (and usually does) perform them.

4. Upload and display[edit]

Another script is then run to upload the images to Wikisource. They are uploaded to a file name in the format "DJVUFILENAME-PAGE.png". The missing image flagging templates recognise files that use that naming convention and preferentially display the hi-res raw page scan if available, rather than the DjVu page scan. The template text also changes to advise people that a hi-res raw page scan is available.

As of October 2013, there are about 8000 raw page scans available online for restoration.

5. Your bit[edit]

HesperianBot's task is to get these images out of jp2 format in a zip file on the Internet Archive, and to put them at your fingertips, so you can easily download them and restore them.

It's your job to download the raw images, restore them, and re-upload them.

Things you might do to restore an image:

  1. Crop
  2. Rotate it so that verticals are really vertical and horizontals are really horizontal
  3. Eliminate yellow paper background
  4. Colour-balance

If you perform an incremental improvement that leaves the image needing more work, then feel free to upload your improved image over the top of HesperianBot's image.

If you fully restore an image, then upload it to Wikimedia Commons, using a naming format different from the one that HesperianBot uses. When you clobber HesperianBot's namespace by hijacking its file naming convention, you make Hesperian cry.

Don't claim copyright in your image restoration efforts: use the same licence as that of the DjVu file.

Once you've uploaded your restored image to Commons, insert the image into the Wikisource page, and remove of the missing image template.

6. Cleanup[edit]

From time to time, HesperianBot will generate a list of raw page scans that are no longer needed, and Hesperian will go through and delete them all.

Also, if incremental improvements render a raw page scan suitable for use as an image placeholder, Hesperian may from time to time go through an "upgrade" some image templates from {{missing image}} to {{raw image}}... but probably not very often.

As of December 2013, HesperianBot has over 1100 deleted contributions, suggesting that about this many raw page scan images have been restored.