Help:DjVu files

From Wikisource
(Redirected from Help:Djvu files)
Jump to: navigation, search
DjVu files
Shortcut:
H:DV
This page explains how to create, use, and upload files in the DjVu format, which groups scanned images into a single container format.

Image extraction[edit]

Shortcut:
H:DJVUIMG
Comparison of DjVu-derived (left and top) and "Read online" (right and bottom) images from the Internet Archive. The huge loss in quality is evident as blurring, blocking and loss of most detail. However the reason for this is unclear, because the original author haven't described how the DjVu version was created.
Comparison of image from DjVu (left) and PDF (right) from the same Google Book scan, via the Internet Archive. Notice that the DjVu version is very badly damaged by the compression. However the reason for this is unclear, because the original author haven't described how the DjVu version was created.

It is tempting to take images you need directly out of the DJVU files, but some DjVu documents are created with lossy compression, so not always it is the right way to go. If the quality is not satisfactory and there is no other source, then extract from the DJVU and tag the file with {{bad extraction}} at Commons. Otherwise, please use a better source, such as JPG/PNG/TIFF scans of the text.

If the DJVU came from the Internet Archive, there are often high-quality JPG files that are viewable online (go to the IA details page, choose "read online", and from there you can increase the size of the image, then right click and save an image.

If the DJVU was made from a Google Books scan, the Google books PDF document can be used to good effect, as the DjVu is derived from the PDF in most cases, leading to additional compression damage. Many Google Books scans are very low quality bi-tonal images, and scans from other sources (look for colour preview images at the IA and suffixes such as "uoft" and "ala") are preferred for image extraction. In addition, the Google Books scans are often missing images, embed images as almost-useless thumbnail-quality images, have torn or folded pages, can include the hands or fingers of the scanner operator blocking the page. The example on the left is a good-quality Google scan, and you can see the huge damage done by subsequent DjVu conversion.

Conversion[edit]

Images to DjVu[edit]

Windows[edit]

DjvuToy is a software which provides different functionalities:

  • make a Djvu
  • merge Djvu files
  • split Djvu files
  • edit Djvu files
  • generate a bundled file
  • export from Djvu to another file
  • extract text from Djvu
  • download Djvu file structure info (eg. OCR coordinates)
Images → virtual printer → DjVu[edit]

If the page scans are made available as a PDF file, e.g. Google Books scans, then this can be directly converted into a DjVu file using one of the following:

  • The free Any2DjVu online service; this can also OCR the text and embed it in the .djvu file.
  • The freeware Pdf To Djvu GUI. Note that this requires the installation of the Cygwin environment as a prerequisite to its own installation.
  • The free software command-line pdf2djvu (available in repositories, also for Linux), which is usually as simple as pdf2djvu -o output.djvu input.pdf. There's also a GUI available.
  • If you need to crop the PDF document, you can use pdfcrop.pl (see below) for black margins or freeware Govert's PDF Cropper for white margins (it requires Ghostscript and .Net 2.0).

If the scanned images are made available as individual images, then the easiest option is to print them to a PDF document via one of the many "virtual printer" tools, such as the free PDFCreator; then convert the PDF document to DjVu as described above.

Note that there are many other options for converting pages to .djvu. One could convert using PostScript or multipage TIFF as the intermediate format, rather than PDF, but this would of course require different conversion tools. It is also possible to convert from .pdf or .ps to .djvu with the DjVuLibre software and its GSDjVu plug-in but due to licensing restrictions installing the plug-in is a fairly intricate process that involves compiling a patched version of Ghostscript.

Another free Windows tool that can come in handy for the images-to-pdf-to-djvu process is ConcatPDF, a GUI tool that permits easy splitting and merging of PDF files. An example of how ConcatPDF might be used is: if a 100-page document has previously been scanned and converted to .djvu and the single page #42 needs to be re-scanned, ConcatPDF would allow that one page to be inserted into the intermediate .pdf file without tracking down the other page images and re-composing the entire document. Installing ConcatPDF version 1.1 requires as prerequisites that the free Microsoft program libraries Microsoft .NET Framework Version 1 and the corresponding Visual J# .NET Redistributable Package be installed beforehand.

Images directly to DjVu[edit]

However, a far higher quality document can be achieved using the DjVuLibre software library. Jpeg images can be directly encoded into individual DjVu pages using the c44 encoder. Images in lossless formats such as PNG should be converted to PPM (for colour scans) or PGM (for greyscale scans), then encoded using c44. For bitonal (i.e. black-and-white) scans, such as most page text images, a smaller DjVu file can be obtained by converting the page images to the monochrome PBM format, then encoding to DjVu using the cjb2 encoder. All of these image format conversions can be performed by the freeware ImageMagick library (in batch, with mogrify). Individual DjVu pages can be aggregated into a multi-page DjVu using the djvm program; this program can also be used to insert or delete pages from a djvu file.

An important caveat of this process is that high quality scans come at the cost of larger files, and there is currently a 100 Mb limit on uploads to commons. The size can be substantially reduced by applying foreground/background separation with didjvu and/or minidjvu.

Scripting djVuLibre[edit]

This script allows you to take a whole directory of image files (JPG, PNG, GIF, TIFF, and any file than Imagemagick can convert to PPM) images and convert and collate them automatically into a DJVU file. Currently this script is for Windows, but it can be easily converted for Linux. To use it, you will need Python, Imagemagick and DjvuLibre.

Linux[edit]

See also: User:GrafZahl/How to digitalise works for Wikisource
Method 0 - converting graphic files with foreground/background separation[edit]

Just use didjvu.

You may consider preprocessing the scans with Scan Tailor.

Method 1 - page at a time with DjVuLibre[edit]

You need the djvu software, which includes a viewer, and some tools for creating and handling DJVU files. You will probably also need the Imagemagick software for converting scans from one format to another. The tool cjb2 is used to creating a DJVU file from a PBM or TIFF file. Therefore you need to convert your scans if they are not already in one of these formats.

  • Conversion from PNG format to PBM format with the tool convert from Imagemagick
convert rig_veda-000.png rig_veda-000.pbm
  • Depending on the quality of the original scans, you may find it useful to process them with the unpaper utility, which deletes black borders around the pages and aligns the scanned text squarely on the page. Unpaper is also capable of extracting two separate page images where facing pages of a book have been scanned into a single image. Another utility is mkbitmap, another pdfcrop.pl (Perl-based and free software, it requires Ghostscript and texlive-extra-utils on Ubuntu; it uses BoundingBox; it can crop a whole multipage PDF document in just one passage). PDFCrop (another one!) deletes white margins.
  • Creation of a DJVU file from a PBM file
cjb2 -clean rig_veda-000.pbm rig_veda-000.djvu
  • Adding the DJVU file to the final document
djvm -i rig_veda.djvu rig_veda-000.djvu

You need to repeat these steps with a script for each page of the book. Example:

#!/bin/bash
for n in `seq 1 9`
do
        i="rig_veda-$n.png"
        j=`basename $i .png`
        convert $i $j.pbm
        cjb2 -clean $j.pbm $j.djvu
        djvm -i rig_veda.djvu $j.djvu
done

There is also another way to add all the *.djvu parts into one:

djvm -c rig_veda.djvu rig_veda-000.djvu rig_veda-001.djvu rig_veda-002.djvu

See the following section for an automated process for multiple pages.

Method 2 - PDF to DjVu bash script[edit]

Use this script, which converts PDF document (multiple or single page) into images, automagically crop them with ImageMagick, convert them in DjVu and bundle them. This is very slow (huge PDF document can require days) but a little more efficient than the following method.

The resulting PDF document is quite big and low-quality, probably because of poor font recognition, which may be fixed by newer versions of poppler (the used library): the version available in repositories is usually several months old.[1]

You can also remove the pdftoppm part and use the script to convert multiple images directly in a multiple page PDF document. If images are not in pbm format, you can convert them with a single command using mogrify from ImageMagick.

Method 3 - pdf2djvu[edit]

Simply download the pdf2djvu tool from your repository to directly convert PDF document (single or multiple pages) into DjVu.

If the document contains the results of OCR (as is the case e.g. with FineReader output) then they are preserved in the DjVu document as the hidden text layer. Some other properties of the source document, including metadata, are also preserved. The quality and the size of the output depends primarily on the features of the source document but can also be controlled with several program parameters, such the resolution of foreground and background. The program is capable to use several threads to speed up the conversion.

The original author of this page made the following recommendation which does not seem valid:

Moreover, you need to crop directly the pdf before the conversion. On Linux this is quite difficult. You could use ImageMagick convert -crop, but attention: with multiple page big PDF document, this can take several GB of memory (the limit is 16 TB!) and kill your computer if you don't use the -limit area 1 option directly after -crop. This make the convertion very long.

The resulting PDF document is increased in size and reduced in quality because of rastering.[2]

See other crop tools above.

Method 4 - DjVuDigital[edit]

Use djvudigital,[3] which like pdf2djvu converts pdf directly in DjVu.[4] There are licensing problems, because the GSDjVu library has a different license, then you'll need to compile it by yourself; the included utils make this step quite easy, but still long (about 1 hour) and a bit annoying.[5]

But, then you can convert PDF document into DjVu with a single command (see the previous section for crop). The conversion is slow (I find it will complete a 300 page PDF document in about 30-40 minutes). The resulting DjVu is of higher quality and lower filesize compared to both the previous two methods.[1] Additionally, DjVuDigital can handle JPEG2000 (aka JPX) files embedded in PDF documents, which is a feature of many Google books. pdf2djvu, Any2Djvu and Internet Archive conversions all fail to convert these files, leaving blank pages in the output.

DjVuDigital has many advanced options to improve results, but they can be difficult to master.[6] In general, altering the --dpi option can give you a qucik reduction in filesize without too much fiddling.

Online ([almost] all systems)[edit]

Any2Djvu[edit]

Another method to convert the images to djvu is to zip them and use the Any2Djvu site to create the djvu file. The Any2Djvu will extract the images in the zip and create a OCRed djvu. OCR functions will only with English text.

Any2Djvu cannot handle huge files. Big files are best dealt with if you upload them by URL (e.g. by entering a link like ftp://ftp.bnf.fr/005/N0051165_PDF_1_-1DM.pdf). Conversion can take several hours. Any2Djvu will sometimes run out of memory on large or highly-detailed files and fail. It will also not convert "JPX" images embedded into PDF documents, which are common in Google Books scans.

The Internet Archive[edit]

Another method is to upload PDF document to the Internet Archive. You need to login (don't use OpenId, it won't function[7]).

Uploading[edit]

Click "Upload" at the top-right corner. The flash upload (standard "Share" button) won't function with Firefox (use Opera or Internet Explorer instead[8]) or Linux. You can use the standard JavaScript non-flash method (although there's a file size limit of 2 GB with Firefox, but not with Chromium); FTP upload is deprecated because it's slower and crashy but is the only easy to learn possibility if you have to upload many files (which shouldn't be the case here).

OCR tricks[edit]

When the upload has been completed, the Internet Archive will start the "derive" work: OCR to create an XML document of detected text based on the uploaded PDF file, then conversion of that to a DjVu file with embedded text, creation of plain text-only dump file, among others.[9]

Don't forget to set the correct language in the metadata before starting the derive (which is run automatically after upload if there's something to derive), otherwise the OCR language will be set to English and results will be poor for works based in any other language. It's not possible to set multiple OCR languages, but you're invited to upload the same book twice with the two languages to have two OCRs.[10] The length of processing time depends on the size and complexity of your file, as well as the current Internet Archive backlog of conversion tests.[11] You can check your progress in the queue here and more detailed information about jobs you submitted here (must be logged in).

The Internet Archive uses a professional, proprietary, commercial ABBYY software[12] with a quite good images and OCR output in many languages and fonts and an aggressive compression[13] which mantains an high quality of the final DjVu file.[1] However, the Internet Archive sometimes produces over-compressed DjVu files with poor quality. If this happens, you can often download a PDF document and convert manually. You can reduce the resolution the derivation aims at, which is normally set automatically by some "guessing", via the fixed-ppi field, setting it to 300 (dpi) or lower to reduce sizes, processing time and (sometimes) errors.

Images formats[edit]

Book scans split into several tiff, jpg, jp2 format images (other formats are not accepted) are converted ("derived") as well, if you put them in a properly created tar or zip archive.[14] It's usually better to upload uncompressed scans or JPEGs; the jp2 files produced in the derivation process are compressed in a way you won't be able to emulate without a lot of effort.

Troubleshooting[edit]

If you have severe problems with your deriving process and you need admin intervention (tasks shown in red in your tasks list), ask help at infoAt sign.svgarchive.org, they're usually amazingly helpful. General requests for help should be placed in the forums though, don't bother them for nothing!

DjVu to text[edit]

OCR via Any2DjVu[edit]

The OCR option available at the free conversion service Any2DjVu does do an OCR of the scanned image but the resulting text is embedded within the .djvu file itself and must be extracted so it can be used on Wikisource.

One way to do this is to use the DjVuLibre software to extract the text, via a command like

djvused myfile.djvu -e 'print-pure-txt' > myfile.txt

or

 djvutxt myfile.djvu > myfile-ocr.txt 

JVbot can automatically upload the text layer of a DJVU to the pages on Wikisource. For example, Robert the Bruce and the struggle for Scottish independence - 1909.

OCR via the Internet Archive[edit]

See above: if you upload a DjVu file, the derive process will OCR it.

OCR with Tesseract[edit]

OCR can be done with Tesseract, a free OCR software, and a script:

OCR with Tesseract 3.x and other free OCR engines[edit]

Use ocrodjvu.

DjVu to Images[edit]

Linux[edit]

To extract images from a DjVu file, you can use ddjvu

ddjvu -page=8 -format=tiff myfile.djvu myfile.tif

If you done all the pages (without -page=**) you can split the multi-page tiff into single pages png (or any other format)

convert -limit area 1 myfile.tif myfile.png

Manipulating[edit]

There's some advice about manipulating DjVu files or images to be used to generate DjVu elsewhere:

Splitting DjVu files[edit]

The DjVu documents come in two flavours: bundled and unbundled (indirect); the latter format stores every page in a separate file. The comment below made by the original author concerns only bundled documents, which should be avoided.

Large works can not be uploaded onto Wikimedia servers which have a 100 MB upload limit. To split the DjVu, use DjVuLibre "Save as", and specify a page range which will produce a file small enough to be uploaded. Some trial and error may be necessary.

djvused can do this to :

 djvused myfile.djvu -e 'select 10; save-page-with p10.djvu'

This can be done for every page.

To know the number of page of the file :

 djvused myfile.djvu -e 'n'

Removing a copyright page[edit]

Many of the already-created djvu files available at archive.org and other sites have the Google copyright page attached to the front of the document. Wikimedia policy, based on an analysis of the underlying law, does not accept that copyright is established on a public domain work simply by scanning or copying it or taking a two-dimensional photograph that faithfully represents its subject. See Wikimedia Commons for more information about scans, artwork and the position of the WMF.

Such copyright pages and other extraneous material can be removed with DjVuLibre, an open source program maintained by the inventors of djvu under the GNU Public License. Binaries are available for Windows, Mac, Linux, Solaris, and IRIX. It includes djvm.exe, which is run as a command-line utility. If you cannot figure out how to do this, you can message Mkoyle (talk), and he will do it for your file and email the file to you for upload. The command line to delete (-d) the first page (1) is as follows:

djvm -d filename.djvu 1

Displaying a particular page[edit]

Emily Dickinson Poems (1890).djvu

The [[Image:...]] link tag accepts a named parameter "page" so that, for example, this wiki code displays the image of page 164 of the file Emily Dickinson Poems (1890).djvu on the right, 150 pixels wide (the rear cover of the book, containing no text):

[[Image:Emily Dickinson Poems (1890).djvu|right|150px|page=164]]



The page image can be displayed in the DjVu in place of text as in Page:Personal Recollections of Joan of Arc.djvu/9 using:

{{use page image|caption=JOAN'S VISION}}


The page image can be displayed in the books Wikisource main space as with Personal Recollections of Joan of Arc/Book I/Chapter 2 using:

[[Image:Personal_Recollections_of_Joan_of_Arc.djvu|page=27|right|thumbnail|200px|THE FAIRY TREE]]

Notes[edit]

  1. 1.0 1.1 1.2 Example: this 205 MB PDF document of a 1691 book from Gallica is converted by pdf2djvu.sh script in a hardly readable 382.4 MB djvu, in a little better readable 316.7 MB djvu by djvudigital and in a better quality 51.3 MB djvu by the Internet Archive.
  2. For instance, this 55 MB PDF document when cropped with ImageMagick gives a 100 MB PDF document which converted with pdf2djvu gives a 86.2 MB djvu, while the Internet Archive gives directly a 10.1 MB djvu of better quality.
  3. Man page.
  4. A comparison here.
  5. Complete instructions here.
  6. Moreover, they can require the proprietary msepdjvu libray instead of csepdjvu: see superhero pres: is it independently reproducible?.
  7. See forums: Authentication error; not a valid OpenID, Login problems when I click "Share" .
  8. See forum.
  9. If your original PDF has no embedded text-layer, the derive process will automatically create a second, text-rich, PDF for you by applying the same previously detected OCR generated text to create one.

    Please, note, however, if your PDF comes from GoogleBooks and has a first-page disclaimer notice, the derive process will detect the disclaimer page's hidden text-layer, assume the rest of the pages in the PDF also have embedded hidden text-layers too when they never do and skip the automatic creation of the second PDF file altogeher. Keeping the disclaimer page but stripping it of all hidden text is the optimal approach for reasons having to do with the complimentary creation of a DjVu file at the same time - swapping it with a suitable null or blank page will do just as well and of course the last resort is deletion of the disclaimer page.

  10. See forum.
  11. Example: Vocabolario degli accademici della Crusca, 1691, took 5.1 days to derive.
  12. Version 9.0 since 2013.
  13. In the example, dimension is 1/6 compared to djvudigital output.
  14. FAQ; documentation of the format to use. Remember: put extensions in lowercase everywhere, use tif with a single f, put the ppi value of the images in the metadata. If your archive of images is not recognized as such, it may help to edit the metadata and set its format as "Single Page Processed TIFF ZIP" (even if it's a TAR) and so on. You should probably first the _images.zip archive format.

See also[edit]