Wikisource:Scriptorium/Archives/2009-10/Concerns about fidelity of Internet Archive DjVu files

From Wikisource
Jump to: navigation, search

Concerns about fidelity of Internet Archive DjVu files[edit]

Following on from my "mystery symbol" discussion above, I now have some serious concerns about the nature of the DjVu encoding used by the Internet Archive, and whether the results can be considered faithful scans.

Here is an image of a paragraph taken from the raw tif files provided to the Internet Archive by Google Books:

GB paragraph.png

And here is the same paragraph after the Internet Archive has encoded it into a DjVu file:

IA paragraph.png

If you look closely you will see that

  1. The R and E of "GREVILLEA" look quite different in the Google Books scan, but have been converted to exactly the same glyph in the Internet Archive DjVu file; and
  2. The u's in "frutices" and "aemulis" have both been converted to what look like small-caps N's.

What worries me is that I can't see how this would have happened unless the Internet Archive's DjVu encoder knows something about what kind of glyphs to expect to find on a page, and is willing to take a guess as to which one is correct—a process tantamount to low-level OCR. If it is the case that the Internet Archive's DjVu processing is guessing glyphs rather than faithfully reproducing whatever it sees, then this casts serious doubt upon how we do our work here. What is the point of using scans to ensure fidelity, if the scans themselves lack fidelity?

Hesperian 01:29, 10 September 2009 (UTC)

Come to think of it, the encoder need not know about particular glyphs in advance. This output is just as easily explained by the encoder assuming that there are a relatively small number of glyphs, and trying to cluster the glyph instances that it finds into that number of glyph classes. But this is largely irrelevant; infidelity is infidelity whatever the cause. Hesperian 01:53, 10 September 2009 (UTC)

Looks bad Concern Arlen22 (talk) 01:36, 10 September 2009 (UTC)

Notice however, that it is correct where the same word appears at the bottom. Strange If in doubt, throw it out Arlen22 (talk) 01:39, 10 September 2009 (UTC)

Cygnis insignis has pointed out that w:JBIG2 probably explains what is happening here:

"Textual regions are compressed as follows: the foreground pixels in the regions are grouped into symbols. A dictionary of symbols is then created and encoded, typically also using context-dependent arithmetic coding, and the regions are encoded by describing which symbols appear where. Typically, a symbol will correspond to a character of text, but this is not required by the compression method. For lossy compression the difference between similar symbols (e.g., slightly different impressions of the same letter) can be neglected."

I suppose the issue for us is are we going to cop that? Hesperian 02:52, 10 September 2009 (UTC)

"The key to the compression method [JB2] is a method for making use of the information in previously encountered characters (marks) without risking the introduction of character substitution errors that is inherent in the use of OCR [1]. The marks are clustered hierarchically. Some marks are compressed and coded directly using arithmetic coding (this is similar to the JBIG1 standard). Others marks are compressed and coded indirectly based on previously coded marks, also using a statistical model and arithmetic coding. The previously coded mark used to help in coding a given mark may have been coded directly or indirectly."
— DjVu: Analyzing and Compressing Scanned Documents for Internet Distribution.[1] Haffner, et al. AT&T Labs-Research

— "So it goes", Vonnegut.
— Sigh, Cygnis insignis (talk) 03:52, 10 September 2009 (UTC)

Apparently the upshot of this is that this issue is inherent to DjVu, rather than specifically to the Internet Archive encoder. This is only the Internet Archive's fault inasmuch as they use very lossy compression. This is bad news all round. :-( Hesperian 04:10, 10 September 2009 (UTC)

It’s important to emphasize that this is a consequence of the specific compression settings IA has chosen for their djvu encoder. More reasonable settings can produce better results. I grabbed the topmost png image from this post, converted to PAM, and ran it through the c44 djvu encoder at the default settings. The result was:
Not perfect by any means—it would surely have been better to begin with the source TIFF, rather than a PNG; and tweaking the compression settings or using masks to isolate the foreground text could have produced a smaller file with comparable image quality. But a big improvement over IA’s scan, I think. I suppose the lesson here is to do our own djvu conversions whenever possible. Tarmstro99 (talk) 13:16, 11 September 2009 (UTC)
I'm not much involved in this project, but would it behoove us to make our own DJVUs for these works? If we can't even proof the scans, they're not much use to us.—Zhaladshar (Talk) 16:21, 11 September 2009 (UTC)
support Arlen22 (talk) 18:00, 11 September 2009 (UTC)

could you provide a link to the raw tiff and to the djvu at IA, where you spotted this ? ThomasV (talk) 18:52, 11 September 2009 (UTC)
I see that there are actually two versions of the file online at Commons. The newer one (31 August 2009) is IA’s and contains the compression errors discussed above. The older one (10 August 2008) is GB’s and, at least on the page referenced above, is error-free. Perhaps rather than re-djvu from TIFFs, we could simply revert to the older, error-free version of the document that is already online? Tarmstro99 (talk) 18:59, 12 September 2009 (UTC)
Once I've taken full advantage of the OCR, I'll manually generate a smik DjVu from the jp2 images, and upload over the top. Hesperian 01:53, 18 September 2009 (UTC)
  • I think I have profected a process of taking the zip file of tiffs, uncompress the zip, converting the imagines to an uncompressed format (neccessary for the next step) and converting it to a nice high quality djvu file using gscan2pdf. The djvu is a small size with a higher quality than off of Can anyone give me a text that is really bad or better yet, can we start making a list of text need replacement? --Mattwj2002 (talk) 14:16, 26 September 2009 (UTC)
    • What's the process you use? I'm interested, because I'm trying to create DJVU files that are high quality but lower size. Right now I'm only getting pretty high-sized results.—Zhaladshar (Talk) 14:18, 26 September 2009 (UTC)
      • My process is pretty easy and involves using linux. In my setup I use Ubuntu. The first step is to download the zip files from the Internet Archive. :) Once the download is complete. You'll have unzip the file. This can be done either with the GUI or through the unzip console program. Then I use the following script to convert the tiffs to an uncompressed format (please excuse the messy coding):
ls -1 *.tif | while read line; do convert +compress $line $i.tiff; echo $i; let i++; done
mkdir tiff
mv *.tiff tiff/

Then I take files in the tiff directory and use gscan2pdf to make a djvu file. The djvu appear to be roughly the same quality as the original tiffs and a good size. I hope this helps. --Mattwj2002 (talk) 18:42, 26 September 2009 (UTC)

        • One other point, some of the tifs are also bad quality (from the Internet Archive). A good source might be pdf's from directly from Google. If you go that route, I recommend the following commands (please bare in mind it takes a lot of ram and time):
pdf2djvu -d 1200 -o file.djvu file.pdf

This can be done using Windows or Linux. I hope this helps. --Mattwj2002 (talk) 09:27, 27 September 2009 (UTC)

Another way[edit]

I do mine using ImageMagick and DjvuLibre, both freeware.

For typical black-text-on-white-paper pages, use ImageMagick to convert the tif/jp2/whatever to pbm format. PBM format is bitonal - every pixel is either fully black or fully white. Thus converting the bulk of a scan to this format gives you huge compression. Generally "convert page1.tif page1.pbm" gives you a sensible result, though you can fiddle around with manual thresholding if you want. It all depends on how much effort you are willing to invest in learning ImageMagick. DjVuLibre's cjb2 encoder will convert a PBM image into a DjVu file for you.

For pages with illustrations, convert to PGM for greytone images, or PPM for colour images. Then use DjVuLibre's c44 encoder to encode to DjVu.

Finally, use DjVuLibre's djvm to compile all the single-page djvu files into a single multi-page djvu. I find that listing all the files at once under the -c option doesn't work. You need to append one page at a time.

As for how to manage it all, rather than scripting, I find I get much more control and much more flexibility by enumerating the pages in a spreadsheet, and using formulae to construct the desired commands. e.g you can easily specify which pages should be treated as bitonal, which greyscale, and which colour, and define your formulae to produce the desired command for each case. Having done that, it is just a matter of copying a column of commands, and pasting it to the command line. It is a bit lowbrow, but it really does work well.

Hesperian 11:51, 27 September 2009 (UTC)

What's the quality of the DjVu you get when you a bi-tonal input? I've been using pdf2djvu because it doesn't reduce the colors of the PDF images when converting to DJVU and I get a nice, smooth looking result. Do DJVUs from bi-tonal images look good (or at least decent and not choppy) when all is said and done?—Zhaladshar (Talk) 13:01, 27 September 2009 (UTC)
Make up your own mind:
  • Pages from the Internet Archive version of An introduction to physiological and systematical botany typically look like this.
  • The IA version was missing a couple of pages, which I obtained elsewhere and shoe-horned in, having converted then to bitonal then encoded into DjVu using the bitonal encoder. Those pages look like this and this.
Hesperian 13:17, 27 September 2009 (UTC)
That answers my question. Thanks.  :) The quality isn't bad at all.—Zhaladshar (Talk) 13:24, 27 September 2009 (UTC)
To give an idea of what can be achieved, I managed to fit File:History of West Australia.djvu into 69Mb—not bad for 652 physically large pages at 200 dpi, including about forty plates that had to be retained in greyscale. It works out to about 250 pixels per bit; that's some serious compression. Hesperian 13:43, 27 September 2009 (UTC)