Help:Image extraction

From Wikisource
Jump to navigation Jump to search
Image extraction
Shortcut:
H:EXTRACT
This page gives some guidance on extracting images from documents and upstream scan sources such as the Internet Archive, Google Books, etc.

Images in books can be difficult to extract from the source documents for various reasons:

  • The scan may be very noisy or have very strong foxing or page discoloration
  • Text from the other side my bleed through the page
  • The source document may be damaged, either physically or through heavy image compression
  • The scanning process may result in blurred or obscured images

Careful choice of image source and careful extraction can ameliorate some of these issues.

Image sources[edit]

Comparison of DjVu-derived (left and top) and "Read online" (right and bottom) images from the Internet Archive. The huge loss in quality is evident as blurring, blocking and loss of most detail.

It is tempting to take images you need directly out of PDF or DjVu files, but some PDF and DjVu documents are created with lossy compression that is optimised for text, not images, so it is not always the right way to go.

If the quality is not satisfactory and there is no other source, then extract it from the PDF/DjVu and tag the file with {{bad extraction}} at Commons. Otherwise, please use a better source, such as JPG/PNG/TIFF page scans.

It is always preferred to work from the largest and best quality source images. Screenshots, whether from a DjVu, PDF or webpage, are a method of last resort. Unless the image is displayed at exactly the original resolution, any rescaling process, up or down, results in image degradation.

Usually, the book file at Commons or Wikisource will provide a link to the original scan source (for example, the Internet Archive). Sometimes the same work has better scan elsewhere: see Wikisource:Sources for more information about possible alternative scan sources. Many sources provide access to individual page images (in formats such as JPG or JP2), as well as complete documents (in PDF or DjVu formats).

Extracting images from PDF files[edit]

Manually[edit]

You can save the embedded page images from some PDF reader software by right clicking the page images. Take care to only save this as a lossless format such as PNG, BMP or TIFF. JPG compression is not suitable and will introduce degradation:

Saving a page image from a PDF as a lossless format (TIFF, left) does not introduce new compression damage. Extracting as a lossy format (JPG, right), does introduce compression damage (the speckled noise around the letters was not present in the PDF).

pdfimages[edit]

Another software tool to extract images is pdfimages:

$ pdfimages -j -png input.pdf 'prefix'

This will produce PNG files, with JPEGs for images that were originally JPEG in the PDF, all beginning with the prefix prefix-.

Scans from the Internet Archive[edit]

Many works from the Internet Archive have high-quality images available.

From JP2 page images[edit]

This is the preferred way to obtain page images when possible.

Some works are scanned by the Internet Archive themselves. In this case, you can access high-quality photographs of the pages:

  • Go to the Internet Archive details page for the work (e.g. https://archive.org/download/naturalhistorymo00goss)
  • Go to the file list ("Show all" under the list on the right), then find the "_orig_jp2.tar" archive.
  • You can either download this entire archive file (large file size, but useful if the work has a lot of pages with images), or click "View online" and select the individual pages that you want.
    • For example, this is the original JP2 file for the example on the left: 0018.jp2.
  • Download these full-sized images and use these as the basis for image extraction.

From PDFs[edit]

Comparison of image from DjVu (left) and PDF (right) from the same Google Book scan, via the Internet Archive. Notice that the version extracted from the DjVu is very badly damaged by the compression.

Other works are imported from scans done by other companies, often Google Books. These scans do not have the original page photographs and so the Internet Archive generates the DjVu file from the PDF, which creates additional compression damage due to the text-centric method the Internet Archive uses to create DjVu files. The example on the right is a good-quality Google scan, and you can see the damage done by subsequent DjVu conversion.

If the DjVu was made from a Google Books scan, the Internet Archive extracts and archives the page images, usually these are bi-tonal TIFF files. These can be accessed online:

  • Go to the Internet Archive details page for the work (e.g. https://archive.org/details/mammalsminnesot00herrgoog)
  • Go to the file list ("Show all" under the list on the right), then find the "_tif.zip" archive.
  • As above, you can download the whole archive or just what you need on a page-by-page basis.

You can also save the images directly from the PDF if your software supports this (the TIF files are the same). Do not save the file as a lossy format (see above), use PNG or TIF.

Many Google Books scans are very low quality bi-tonal images, and scans from other sources are preferred for image extraction (look for colour preview images at the Internet Archive and suffixes such as "uoft", "rich" and "ala"). Digital Library of India (DLI) scans, often found at the Internet Archive, are also generally low-quality for the purposes of image extraction.

For example, the image on the right is now also available in an scan Internet Archive did themselves, which contains the original, colour page photographs: p12 of https://archive.org/details/mammalsofminneso00herrrich.

Hathi Trust[edit]

Scans at the Hathi Trust are presented as page images (unless you have institutional access, you cannot download complete PDFs).

  • Use the following URL scheme:
  • Change {ID} to match the book's ID
  • Change {N} to the page number you want. This is shown in the Hathi web viewer in the bottom right corner as #N / TOTAL.
  • Ensure size is a very large number (say, 10000). This is not a pixel size. No matter how large this number is, Hathi will not give an image larger than the original scan image. If it's set to 0 or omitted you will not get a full-size image.
    • In this case size=10000 provides the full image of 3,490px × 5,370px.

You can also download the page-by-page PDFs and extract the image that way (again, take care to only save losslessly). These images are identical to those obtained as above.

There is a "Data API", which can be used too, but this requires a free account to acquire the API keys.

Google Books[edit]

The Google Books web viewer doesn't make it easy to access page images directly. Some books can be downloaded directly from Google Books as PDF files (hover over "Ebook - Free" button and then choose "Download: PDF"), and then you can extract images from there. Do not screenshot from the webpage, the original images are usually about 3000 × 5000px and are scaled down and re-compressed when shown in the web viewer.

If the Internet Archive has imported this scan, the PDF is the same (in which case, the Internet Archive allows you to access the page images directly, see above).

Google Books scans are often missing images, embed images as almost-useless thumbnail-quality images, have torn or folded pages, or can include the hands or fingers of the scanner operator blocking the page. Generally, a Google Books scan is the worst quality scan available, find alternatives if possible.

Below is the same image from a Google Books scan and from an Internet Archive scan (no further processing on either image has been done):

Sometimes, the same scan is available at Hathi with vastly higher quality. It seems Google's public PDFs are heavily compressed, but other users like Hathi provide them at higher qualities. For example, the following two files are scans of the same physical book from the New York Public Library. As you can see, the whole image has been totally removed by the compression technique at Google Books.

British Library Access[edit]

Items available at British Library Access have both full-page JPG files available and whole-book PDFs. As usual, the JPG files are best to work with for the purposes of image extraction.

Not all works are scanned and available online. If the work has a "Digital Content: Go" button under the "I want this" tab in the catalogue, the scan is available. Click the "Go" button to open the viewer.

The files are downloaded using the "Download" button in the lower left. Navigate to the page you wish to extract from and click the button. If you are given a choice of different image sizes, choose the largest size available.

For example, for Hudibras (1780 edition):

Image processing[edit]

Extraction[edit]

Extraction of images from the original scan is quite a tricky task and the process to do so varies based on the image and the source. One of the most common extraction types to do is extracting a black-and-white image from a page with yellow-brown paper.

See Help:Image extraction/With GIMP for one method of doing that using GIMP.

The original image
The finished image

Moiré[edit]

Many illustrations are engraved, often using closely spaced lines. This can produce moiré effects when scanned, which can be tricky to remove without damaging the image.

Sometimes, it's possible to reduce the moiré at the desaturation step. Coloured moiré is often "invisible" in the HSV value channel, so desaturating by that channel, as opposed to luminance or luma ca help.. In the following example, the orange and blue bands (e.g. in the forehead area) are visible as light and dark bands after desaturation by luminance, but are mostly removed by desaturation by HSV value:

To do this in GIMP, choose Colours > Desaturate > Desaturate... and select the Mode from the drop-down.

File types[edit]

When uploading the images to Commons, there are two main choices for the file type: JPG or PNG. JPG is a lossy compression format designed for photos. PNG is a lossless format that's generally better suited to diagrams.

If the image is already compressed with lossy format such as JPG, JP2, and you have not processed it to remove compression noise, use JPG. Using a lossless format here will result in large files because the PNG format will expend a lot of bytes on slavishly encoding the pseudo-random compression noise of the original image.

If the image has a limited colour palette and/or contains very rapid changes from dark to light, use PNG. This includes most diagrams and engravings. This is because the sudden colour changes represent "high-frequency" image data, which causes image artifacts in lossy schemes.

For example, the red pixels below show where JPG compression of a PNG diagram introduces new noise:

April 01-40N-2100-Fieldbook of Stars-025 - JPEG noise.png

Also note that even if the original file is large at full-size, when it is resized to a few hundred pixels for use on a Wikisource page, it will likely be much smaller.

Image channels[edit]

Some formats allow you to specify how many channels an image has. For example, a greyscale image has only one channel (the lightness of each pixel) and a colour image has three (red, green and blue). If the image has an alpha channel (transparency), that adds one more.

When saving as PNG, if your file is a greyscale image, saving as a greyscale PNG (GREY or GREYA, depending on if you are using an alpha channel) rather than as RGB or RGBA can substantially reduce the filesize. Sometimes (not always), depending on the file, a greyscale PNG can be about the same size as a JPG, because the reduced channels outweighs JPGs compression savings.

JPG has no such option: all images are RGB. There is no alpha channel.