User:GrafZahl/How to digitalise works for Wikisource

From Wikisource
Jump to: navigation, search

This is a quick rundown of how I usually create a Wikisource-ready DjVu scan of an old public domain or otherwise free work.

  • Often, I do not own the relevant works myself, so I visit a library which has them. I'm most interested in old mathematics journal articles, and most libraries provide those only for reference, not for borrowing. So I have to use a local photocopier. Normally, I let the machine send the copies directly to my private e-Mail address as 300dpi or 400dpi TIFF or PDF file. Not only are the fees much smaller than for creating hardcopies but this also saves an additional A/D step, leading to higher output quality. Plus, library photocopiers are much faster than my personal scanner. Unfortunately, this option is not available in small libraries with old photocopiers.
  • When I work with the raw scans, I use the PBM file format (that's the format created by my personal scanner).
    • To convert TIFF files to PBM, I first create a subdirectory called tifdir and split the original TIFF into its individual pages with the tiffsplit program from libtiff:
       tiffsplit scan.tif tifdir/article
will create files named articleaaa.tif, articleaab.tif, and so on in the tifdir subdirectory. Then I use the convert program from ImageMagick to convert the file to PBM (and possibly rotate the file in the process). For example
       cd tifdir/
       for file in *.tif; do convert -rotate 90 "$file" "$file".pbm; done;
will create rotated PBM files (Warning to mathematicians: the rotation algorithm uses left-handed (clockwise) rotation.)
  • To convert PDF files to PBM, I use the pdfimages utility from the Xpdf suite. The output file format depends on the format of the image embedded in the PDF. If it's not already PBM, you can use convert like above to convert the files to PBM.
  • Sometimes, the PBMs need to be cropped before they are converted to DjVu. I use a quick-and-dirty home-brewn pbmextract program for that which lets you specify the coordinates of the extraction rectangle (so you can read them off directly from some image manipulation program like The GIMP). The reason I don't use any off-the-shelf image manipulation software is that they're often not sufficiently capable of handling bitonal files.
  • Once I have the PBM files ready, I convert them to DjVu using the cjb2 and djvm programs from the DjVuLibre suite:
       for file in *.pbm; do cjb2 -dpi 400 -clean "$file" "$file".djvu; done;
       djvm -c finished_work.djvu *.djvu
Obviously, you may have to change the -dpi option depending on your situation. The -clean option removes "flyspecks", leftover artefacts from the scanning process. Of course, this also means the compression is no longer lossless, so depending on your source material you may want to omit this option.