Help:Image extraction

From Wikisource
Jump to navigation Jump to search
Image extraction

Shortcut:
H:EXTRACT

This page gives some guidance on extracting images from documents and upstream scan sources such as the Internet Archive, Google Books, etc.

Images in books can be difficult to extract from the source documents for various reasons:

  • The scan may be very noisy or have very strong foxing or page discoloration
  • Text from the other side my bleed through the page
  • The source document may be damaged, either physically or through heavy image compression
  • The scanning process may result in blurred or obscured images

Careful choice of image source and careful extraction can ameliorate some of these issues.

Image sources[edit]

Comparison of DjVu-derived (left and top) and "Read online" (right and bottom) images from the Internet Archive. The huge loss in quality is evident as blurring, blocking and loss of most detail.

It is tempting to take images you need directly out of PDF or DjVu files, but some PDF and DjVu documents are created with lossy compression that is optimised for text, not images, so it is not always the right way to go.

If the quality is not satisfactory and there is no other source, then extract it from the PDF/DjVu and tag the file with {{bad extraction}} at Commons. Otherwise, please use a better source, such as JPG/PNG/TIFF page scans.

It is always preferred to work from the largest and best quality source images. Screenshots, whether from a DjVu, PDF or webpage, are a method of last resort. Unless the image is displayed at exactly the original resolution, any rescaling process, up or down, results in image degradation.

Usually, the book file at Commons or Wikisource will provide a link to the original scan source (for example, the Internet Archive). Sometimes the same work has better scan elsewhere: see Wikisource:Sources for more information about possible alternative scan sources. Many sources provide access to individual page images (in formats such as JPG or JP2), as well as complete documents (in PDF or DjVu formats).

Extracting images from PDF files[edit]

Manually[edit]

You can save the embedded page images from some PDF reader software by right clicking the page images. Take care to only save this as a lossless format such as PNG, BMP or TIFF. JPG compression is not suitable and will introduce degradation:

Saving a page image from a PDF as a lossless format (TIFF, left) does not introduce new compression damage. Extracting as a lossy format (JPG, right), does introduce compression damage (the speckled noise around the letters was not present in the PDF).

pdfimages[edit]

Another software tool to extract images is pdfimages. Usage is as follows:

$ pdfimages -j -png input.pdf 'prefix'

This will produce PNG files, with JPEGs for images that were originally JPEG in the PDF, with the filenames prefixed by prefix-.

High-res image service[edit]

List of images including high-res source images for a Page namespace scan

The image service used by the high-resolution image loader also provides direct high-res image links when it can for several image sources. This makes it easy to get the high res image for a given Page-namespace page if the file source supports it:

  • Internet Archive
  • Hathi Trust
  • Several other sources

However, you can also manually download images in most cases. The method for this will depend on the source.

Scans from the Internet Archive[edit]

Many works from the Internet Archive have high-quality images available.

From JP2 page images[edit]

This is the preferred way to obtain page images when possible.

Some works are scanned by the Internet Archive themselves. In this case, you can access high-quality photographs of the pages:

  • Go to the Internet Archive details page for the work (e.g. https://archive.org/download/naturalhistorymo00goss)
  • Go to the file list ("Show all" under the list on the right), then find the "_orig_jp2.tar" archive.
  • You can either download this entire archive file (large file size, but useful if the work has a lot of pages with images), or click "View online" and select the individual pages that you want.
    • For example, this is the original JP2 file for the example on the left: 0018.jp2.
  • Download these full-sized images and use these as the basis for image extraction.

From PDFs[edit]

Comparison of image from DjVu (left) and PDF (right) from the same Google Book scan, via the Internet Archive. Notice that the version extracted from the DjVu is very badly damaged by the compression.

Other works are imported from scans done by other companies, often Google Books. These scans do not have the original page photographs and so the Internet Archive generates the DjVu file from the PDF, which creates additional compression damage due to the text-centric method the Internet Archive uses to create DjVu files. The example on the right is a good-quality Google scan, and you can see the damage done by subsequent DjVu conversion.

If the DjVu was made from a Google Books scan, the Internet Archive extracts and archives the page images, usually these are bi-tonal TIFF files. These can be accessed online:

  • Go to the Internet Archive details page for the work (e.g. https://archive.org/details/mammalsminnesot00herrgoog)
  • Go to the file list ("Show all" under the list on the right), then find the "_tif.zip" archive.
  • As above, you can download the whole archive or just what you need on a page-by-page basis.

You can also save the images directly from the PDF if your software supports this (the TIF files are the same). Do not save the file as a lossy format (see above), use PNG or TIF.

Many Google Books scans are very low quality bi-tonal images, and scans from other sources are preferred for image extraction (look for colour preview images at the Internet Archive and suffixes such as "uoft", "rich" and "ala"). Digital Library of India (DLI) scans, often found at the Internet Archive, are also generally low-quality for the purposes of image extraction.

For example, the image on the right is now also available in an scan Internet Archive did themselves, which contains the original, colour page photographs: p12 of https://archive.org/details/mammalsofminneso00herrrich.

Hathi Trust[edit]

Scans at the Hathi Trust are presented as page images (unless you have institutional access, you cannot download complete PDFs).

  • Use the following URL scheme:
  • Change {ID} to match the book's ID
  • Change {N} to the page number you want. This is shown in the Hathi web viewer in the bottom right corner as #N / TOTAL.
  • Ensure size is a very large number (say, 10000). This is not a pixel size. No matter how large this number is, Hathi will not give an image larger than the original scan image. If it's set to 0 or omitted you will not get a full-size image.
    • In this case size=10000 provides the full image of 3,490px × 5,370px.

You can also download the page-by-page PDFs and extract the image that way (again, take care to only save losslessly). These images are identical to those obtained as above.

There is a "Data API", which can be used too, but this requires a free account to acquire the API keys.

Google Books[edit]

The Google Books web viewer doesn't make it easy to access page images directly. Some books can be downloaded directly from Google Books as PDF files (hover over "Ebook - Free" button and then choose "Download: PDF"), and then you can extract images from there. Do not screenshot from the webpage, the original images are usually about 3000 × 5000px and are scaled down and re-compressed when shown in the web viewer.

If the Internet Archive has imported this scan, the PDF is the same (in which case, the Internet Archive allows you to access the page images directly, see above).

Google Books scans are often missing images, embed images as almost-useless thumbnail-quality images, have torn or folded pages, or can include the hands or fingers of the scanner operator blocking the page. Generally, a Google Books scan is the worst quality scan available, find alternatives if possible.

Below is the same image from a Google Books scan and from an Internet Archive scan (no further processing on either image has been done):

Sometimes, the same scan is available at Hathi with vastly higher quality. It seems Google's public PDFs are heavily compressed, but other users like Hathi provide them at higher qualities. For example, the following two files are scans of the same physical book from the New York Public Library. As you can see, the whole image has been totally removed by the compression technique at Google Books.

British Library Access[edit]

Items available at British Library Access have both full-page JPG files available and whole-book PDFs. As usual, the JPG files are best to work with for the purposes of image extraction.

Not all works are scanned and available online. If the work has a "Digital Content: Go" button under the "I want this" tab in the catalogue, the scan is available. Click the "Go" button to open the viewer.

The files are downloaded using the "Download" button in the lower left. Navigate to the page you wish to extract from and click the button. If you are given a choice of different image sizes, choose the largest size available.

For example, for Hudibras (1780 edition):

Downloading images[edit]

Usually, you can download the images manually with your web browser.

Directly into GIMP[edit]

If you are using an image application like GIMP, you can load the image straight into the application using its URL using FileOpen Location.

Image processing[edit]

Extraction[edit]

Extraction of images from the original scan is quite a tricky task and the process to do so varies based on the image and the source. One of the most common extraction types to do is extracting a black-and-white image from a page with yellow-brown paper.

See Help:Image extraction/With GIMP for one method of doing that using GIMP.

The original image
The finished image

Automatic processing with ImageMagick[edit]

If you have ImageMagick (all operating systems), you can remove a paper texture automatically from a black-and-white image by adjusting the colour levels as follows:

convert foo.jpg -level-colors 'rgb(40,40,40),rgb(180,180,160)' -type Grayscale foo2.jpg

Fundamentally, this sets any colour below "40" (out of 255) to black, any colour above "180" to white, and rescales everything in-between. This is a fairly aggressive process and is only suitable for very even page colours and very clear and dark printing.

Original image cropped from a page scan Image processed with ImageMagick

If you need a transparent background (for example for a drop initial over a coloured background), you can then use the following command:

convert foo2.jpg -negate -background transparent -alpha Shape foo.png;

Moiré[edit]

Many illustrations are engraved, often using closely spaced lines. This can produce moiré effects when scanned, which can be tricky to remove without damaging the image.

Sometimes, it's possible to reduce the moiré at the desaturation step. Coloured moiré is often "invisible" in the HSV value channel, so desaturating by that channel, as opposed to luminance or luma can help. In the following example, the orange and blue bands (e.g. in the forehead area) are visible as light and dark bands after desaturation by luminance, but are mostly removed by desaturation by HSV value:

To do this in GIMP, choose Colours > Desaturate > Desaturate... and select the Mode from the drop-down.

File types[edit]

When uploading the images to Commons, there are two main choices for the file type: JPG or PNG. JPG is a lossy compression format designed for photos. PNG is a lossless format that's generally better suited to diagrams.

Using a lossy format[edit]

If the image is already compressed with lossy format such as JPG, JP2, and you have not processed it to remove compression noise, you can use JPG. Using a lossless format here will result in large files because the PNG format will expend a lot of bytes on slavishly encoding the pseudo-random compression noise of the original image.

Other images that should use JPG formats are photographs or paintings. For example the following painting does not benefit much from lossless compression, because it does not have very sharp edges that trigger substantial JPEG noise:

JPG image (Q=95)

Zoom of region, showing in red the pixels changed by more than 1% by that Q=95 JPG compression.

Although there are a few changed pixels, they are spread randomly in the image and are "hidden" amongst the existing image variations (as is the intention with JPG compression). Compare with the star map example below, which has much stronger clustering of noise around edges and, because the noise is on a flat background, is much more visible.

Some images of photographs or paintings contain substantial, highly-regular, background "noise" caused by the printing process. Usually, unless you have scanned the image yourself, this noise is heavily damage by the compression applied already by the scanning organisation. Very often, the repeating nature of this noise interacts with the block-based nature of the image compressor to produce very noticeable noise in the image at a much larger physical scale than the original noise[1]. For example, this zoomed-in portion of an image has large "blocks" of noise, even though the original printing texture was only a few pixels across:

If you have an image with such background texture (and it doesn't already have such artifacts), you should take great care when compressing with a lossy format. You at least should use a high JPEG quality factor (e.g. 95) to avoid introducing too much noise. Using a lossless format will result in a high file size, but it will provide the most accurate reproduction of the original image.

If you have an image that already has such artifacts, there is no point in using a lossless format to "preserve" the fine detail of the noise - it's already been swamped by the existing noise. The best thing you can do at this point if carefully try to remove the noise with a blur tool, de-screen tool or similar, though you will have to be extremely careful around genuine sharp edges in the image. For example, the image below was manually cleaned, but the necklace especially was not blurred. This means that the noise still exists in that area, but because the necklace has strong high-frequency details, it's not clearly visible. The noise isn't completely gone, but it's much reduced (see the original for comparison).

When removing noise from a compressed image, note that it is physically impossible to recover the original image quality - that information is lost forever in the compression step. So you have to use your judgement to find a balance between noise removal and damage to the underlying image (which usually manifests as blurring). What is appropriate strongly depends on the image and the noise type and amount.

Using a lossless format[edit]

If the image has a limited colour palette and/or contains very rapid changes from dark to light, use PNG. This includes most diagrams and engravings. This is because the sudden colour changes represent "high-frequency" image data, which causes image artifacts in lossy schemes.

This means that PNG is very often more suitable for:

  • Line drawings and diagrams
  • Images with sharp transitions from light to dark (e.g. text with clean edges, borders and printed decorations)
  • Engravings, which combine sharp edges with fine repetitive detail that magnifies JPG noise
  • Images with large regions of "flat" colour (PNG can efficiently compress flat regions of exactly the same colour)
  • Screenshots of software (these have both flat colours and sharp edges)

For example, the red pixels below show where JPG compression of a PNG diagram introduces new noise:

Image channels[edit]

Some formats allow you to specify how many channels an image has. For example, a greyscale image has only one channel (the lightness of each pixel) and a colour image has three (red, green and blue). If the image has an alpha channel (transparency), that adds one more.

When saving as PNG, if your file is a greyscale image, saving as a greyscale PNG (GREY or GREYA, depending on if you are using an alpha channel) rather than as RGB or RGBA can substantially reduce the filesize. Sometimes (not always), depending on the file, a greyscale PNG can be about the same size as a JPG, because the reduced channels outweighs JPGs compression savings.

JPG has no such option: all images are RGB. There is no alpha channel.

File sizes[edit]

Do not worry about file sizes. Even if the original file is large at full-size, when it is resized to a few hundred pixels for use on a Wikisource page, it will be much smaller. As an example, the star map above is ~500kB at full size, but ~120kB at the size presented above. The Wikimedia servers have plenty of space, and preserving the images at a high quality is more important than saving this space.

However, this doesn't mean you have to needlessly losslessly save images that are already highly compressed - particularly if that image is not likely to be used as a basis for further work. If you are saving an existing JPG image, there is no benefit from converting to PNG first - the information is already lost.

Decision tree[edit]

  • My image is a line diagram, engraving or has large blocks of a constant colour
    • You probably should be aiming for a lossless format to preserve sharp line edges.
    • I got the image from a source where it was already "clean" and saved as PNG:
      • → Keep it as PNG. Compressing as a JPG will re-introduce noise - see the star map example above.
    • I got the image from source where it was already compressed as a JPG or JP2:
      • I wish to get the highest-quality result:
        • → Remove the noise with an image processing tool and save as PNG
      • I don't want to have to process the image
        • → Upload the JPG as you have it - saving as PNG will not provide much benefit, as the image is already damaged. You should still link to the original source so someone wishing to extract the image properly can start from the "most original" image.
  • My image is a photograph or a painting
    • You are probably aiming for a JPG result
    • My image does not have a background printing texture (e.g. a modern photograph or other non-dot-based printing process)
      • → Upload as a high-quality JPG
      • → PNG can be used too if you are keen to provide a "master" copy for further manipulation, but will likely not provide much benefit to readers
    • My image has background printing texture
      • My image is already suffering from JPG compression of this texture
        • I wish to get the highest-quality result:
          • → Remove the noise with an image processing tool and save as a high-quality JPG
        • I don't want to have to process the image
          • → Upload the JPG as you have it - saving as PNG will not provide much benefit, as the image is already damaged. You should still link to the original source so someone wishing to extract the image properly can start from the "most original" image.
      • My image doesn't have noticeable compression damage to this texture
        • → Upload the image you have if possible, else:
        • → Upload a lossless format of the image, and/or:
        • → Upload a JPG of the image at a high enough quality that you don't start to degrade this texture (probably around 95, but it depends on the image)
  1. This is, conceptually, a form of moiré, but with the block size of the compressor being one of the "grids".