Help:Digitising texts and images for Wikisource
|← Help:Contents||Digitising texts and images for Wikisource|
|proofread from a scan of the original, physical text: a book, magazine, newspaper, etc. The first step in this process is therefore scanning and digitising the text in the first place. If a scan cannot be found already available, then it will need to be scanned by a Wikisource volunteer. These instructions refer to a book being scanned, but apply equally to other print media.The material on Wikisource should ideally be|
|This page is under construction.
This help page or section is currently in the middle of an expansion or major revamping.
- 1 First steps
- 2 Scanning
- 3 Processing
- 4 Images and illustrations
- 5 Uploading
- 6 Notes
- 7 See also
- 8 External links
This help page assumes that you have access to a complete copy of the original work and that you have checked the copyright status to ensure that it is lawful to scan and upload the work to Wikisource. If you have not already done so, please check this now as you may find the end product of all your efforts ultimately barred from being hosted here for non-compliance with the copyright laws, policies or practices otherwise. This process is described in Help:Adding texts and Help:Adding images.
Scanning works can be done in one of several different ways, using different equipment.
The scanning of bound books can be difficult due to the binding. A book is an irregularly-shaped object and does not fit neatly into normal scanning devices. Care must also be taken not to damage books in the process of scanning them, unless destructive scanning is used.
The best means of scanning a book is a special scanner with a V-shaped cradle. This supports the book in a natural reading position, keeping the pages flat without damaging the book's spine. It is also very fast, as pages can simply be flipped as normal. Commercial book scanners of this type can be very expensive. Amateur and custom-made versions can be much cheaper, but need to be built from scratch.
A DIY version involves a cradle and one or two digital cameras to take the individual page scans. The cradle can be made of any material from cardboard, to wood, to metal; it must hold the book at a 90° angle, with each side of the book at 45° to the vertical. The camera must be pointed directly at the page and aligned properly; if not, the scan will appear skewed. Depending on the size of the book, the cradle may need to be adjustable to maintain the same angle with the camera as the scan proceeds (the thickness of the book will transfer from one side to the other, altering the centre position of the pages with respect to the cradle and the camera and causing gradual skewing of the output). A glass pane (either brought specially or adapted from a common picture frame) will be necessary to hold the pages flat during scanning. Lighting should be diffuse to provide even lighting of the pages. While the human eye can adjust to different levels of lighting, it will be especially noticable to computer software at the processing stage and it will interfere with the optical character recognition. Direct lighting may also cause glare on the glass pane holding the pages flat.
Flat-bed scanners are not as good as a v-cradle for scanning a book but they are the next best choice and different versions exist. These devices can also be expensive to purchase.
One version is special flat-bed device where the scan goes into the very edge of the device, allowing one side of a book to be laid flat with the hinge side of the binding at the outside edge of the scanner.
The other version is an over head scanner. The book is laid open beneath the scanner and it takes an image of either one page or both visible pages together. There will be some distorting in the scan towards the hinge of the book as the pages bend inwards rather than lie uniformally flat.
Alternatively, instead of using a special flat-bed book-scanner, books can be pressed against a standard flat-bed scanner. The same distorting described with the overhead scanner will also occur using this method. Pressing a book flat in this manner may also damage the book and its binding.
Flat-bed scanners have a limited scanning area due to the size of the machine. These devices are usually in A4 format and will take up to a quarto (approx 10in x 8in) book page size. Bigger pages than that need an A3 scanner. An alternative is to use a photocopier to reduce bigger pages to A4 format and scan the photocopies.
Photocopiers and multi-functional devices
Some modern office equipment includes a scanning function and processing software. The limitations described for flat-bed scanners apply here as well.
Although not as reliably high quality as scanning, simply taking photos of documents is a perfectly viable means of digitising. It's generally quicker and easier, especially as a camera is often permitted or possible where a scanner would not be. For an example of a document prepared using direct hand-held photography, see Base Facilities Report.
NB: If using a tripod, monopod or other stand, the function of a v-cradle or overhead flat-bed book scanner can be replicated using a normal digital camera.
Destructive scanning is not recommended, but it should be mentioned for completeness. This method avoids the problems in scanning irregularly shaped books as described above. As the name implies, this destroys the physical book as part of the scanning.
Destructive scanning means taking the book apart. This may involve cutting the pages free from the binding or removing staples, stitching or other parts of the book. The result will be a stack of loose pages instead of a bound book. These pages can then be laid flat on a scanning device and can even make use of an automatic feeder.
This is faster and easier than any other form of scanning, but, again, it will destroy the book.
Once you have your scans, you will need to process them into a single file. A scanned text should be a single file in the DjVu container file format. Some scanners may be able to output your scans in one of these formats. Many, however, will produce a series of individual page scans, probably in JPEG or JPEG2000 format. These need to converted to the container format.
Before creating the single file, it is a good idea to make copies or otherwise set aside any page scan with an illustration or other image. These will need to be extracted and uploaded separately so they can be added the final work during proofreading. Images should be extracted from the raw, unprocessed scans whenever possible. Any processing may result in lower image quality, especially if certain image file formats get repeatedly saved and compressed. Additionaly, the process of combining page scans to a single file involves some compression; PDF uses less compression than DjVu, but either will result in slightly inferior image quality. So images should not be extracted from the single file unless no other options exist. The original images are likely to be the best quality available to you.
Before creating the single file you may wish to alter the page images. Depending on the scanning method and circumstances, some or all pages may be skewed. They may need to be rotated, cropped, deskewed or otherwise manipulated. If the scans combine two pages into one image file they will need to be split into separate files. The goal is for each page scan to be an accurate image of a single, flat page from the original work.
Individual scans may need to be renamed. Some processing requires that the pages are in the correct order when sorted alphabetically. Using the filename such as "Name000", where Name is an identifier for the work and 000 is an incrementing page number, is a common way of achieving this. Some scanning methods may produce two sets of scans, separating the scans of the left- and right-hand pages, which will need to be recombined. In this case, it is best to rename each set separately, using and increment of 2 for each, so when they are copied into the same folder they are already in the correct order. (If one set was scanned from the back of the book to the front it will need to start at the highest page number and increment by -2 in order to fit with the other set.) The program IrfanView can perform batch renaming.
Some people choose to desaturate page images, creating a black and white image instead of colour. This is not recommended. It will reduce the final file size but this is no longer an important consideration with modern technology. Colour pages scans include more information than monochrome version; for example, they may include brown stains, and other discolouration, over black text which will be legible in colour but completely obscured in black and white. On a user-level, the stark white pages can cause eye strain for some users during proofreading.
The easiest method of processing scans is to upload them to the Internet Archive, which will perform this operation for you. Thus, create a zip file with the single-page scans, then upload it to the Internet Archive. After some time, a DjVu file with an OCR text layer will be generated.
See Help:DjVu files#The Internet Archive for details.
Images and illustrations
All illustrations and other images from your work need separate image files. They cannot be transferred directly from the scan file to the finished proofread transcription.
You should have saved your original page scans or set aside those with illustrations during the pre-processing stage. The images need to be extracted from these into a form usable by Wikisource and any re-users of our works. This will at least involve cropping the pages, but may require more processing (including but not limited to rotation, de-skewing, colour and level adjustment, addition of an alpha channel (transparency) and more). The free image processing software GIMP is useful for this.
When saving, please choose the most appropriate format for each individual image. The JPEG file format is best for photographs and details colour illustrations. The PNG file format is best for diagrams or simple monochrome illustrations.
Once you have created your file and extracted any illustrations, they should be uploaded to Wikimedia Commons.
If you have used a website to process the scan, and it has a stable URL, you can transfer it directly to Commons using the URL2Commons tool (see Help:URL2Commons). Otherwise you will need to download the file to your computer first and upload it the normal way. In the latter case, or if you created the file yourself, you should follow the normal uploading instructions at Wikimedia Commons.
If there is more than one file involved (for example, if there are illustrations) it can be useful to create a special category for the files. This should hold the scan and all related files in one place. This will be useful for you, or anyone else, when finding a specific file and for administrative purposes such as movement, renaming and recategorisation.
In a few cases, Wikimedia Commons will not accept some files. This is due to additional policies at Commons on top of the minimum legal requirement (works still under copyright in their home country but public domain the United States are not allowed on Commons). If this is the case, the files can be uploaded directly to Wikisource. All other advice still applies.
- Scans can often be found on sites such as the Internet Archive or Google Books.
- The Plustek OpticBook 3600 is an example of a special flat-bed scanner for book scanning.
- Help:DjVu files
- Help:Beginner's guide to Index: files
- Scanning guide on Commons
- mul:Wikisource:Google OCR — a Wikisource toolbar gadget, and standalone tool, for OCRing single images with the Google Cloud Vision service.
- IrfanView, a graphic viewer with batch-processing features.
- GIMP, (GNU Image Manipulation Program), image editing software.
- Scan Tailor, a useful post-processing tool for scanned pages which uses unpaper (example page). A basic unpaper GUI exists as well.
- PDF creation guides: from JPG (with pdfbeads) or from TIFF (with tiff2pdf, tesseract etc.) or from any image.
- DIY Book Scanner
- The $20 DIY Book Scanner, wired.com
- MobileRead Forum — General Discussion, often has information about all parts of this process
- Ubuntu OCR tools
- (2006) Google's Tesseract OCR engine is a quantum leap forward