Help talk:DjVu files
Update for PDFtoDjVuGUI
Simplify the message
Shouldn't this file basically explain what is a DjVu file and why we like them? We could then have a subpage that talks about builds/extracts/whatevers. — billinghurst sDrewth 15:33, 21 December 2010 (UTC)
- I quite agree on this! I just want to learn to know some elementary things about DjVu. How is it organized. How can I handle it (offline). And things like that. It's really difficult to get some good information about that. I'm willing to contribute to this project, but without a little bit of help on how things work, that's quite difficult. Dick Bos (talk) 22:50, 13 November 2011 (UTC)
OCR via Any2DjVu
FWIW... something has changed at http://any2djvu.djvuzone.org/. Options and resolutions have been expanded a bit - don't know if this is an improvement or not. — George Orwell III (talk) 22:22, 27 February 2011 (UTC)
Should we improve this section a bit e.g. adding how to trim borders and split images which contain multiple pages or columns, or is this information elsewhere (I remember I have read something)? I'm quite in a hurry now so I'll just leave some links I found useful:     . Nemo 13:11, 12 March 2011 (UTC)
File size limitation and actions taken
The file size limit technical limitation of the Wikimedia projects is well known and causes a major obstacle for uploading complete content in its full glory. I believe openly available primary sources are the way forward, but it would seem as though some authors have taken the approach of downgrading the quality of the source, to the point of being almost indiscernible gibberish, to meet the current 100MB limit. I believe these choices for inferior quality alternatives are poor decisions, and will have long lasting negative effects on Wikisource. I, for one, intend on upload the California Statutes, and I intend on splitting the files into multiple parts.
As this seems to be the best place for such advice (it seems to be the main help page mentioning "100MB"), I propose that a section be added on the subject, including the various methods editors have used to work around the technical limitation, and the pros and cons of such choices. Given the relatively intensive nature of the work involved for these things, I would like to benefits from other editors' experiences, and I would like other editors to benefit from my experience, before such items are uploaded. This will lessen the need for items to be re-uploaded en masse in the future should future editors find other editors' decisions substandard.
- The problem of resolution vs. file size has more to do with the source file (i.e. the PDF) than it does with the DJVU file specification or the conversion process IMHO. I've come to learn that the best results come from manipulating the PDF first (cropping, centering, optimizing, etc.) and then converting it to DJVU - especially when we are talking about content that mostly consists of plain old text with the occassional gray-scale or line-art image peppered in here and there. Unfortunately, most default setting, both for manipulation and for conversion, are set to handle large high-quality color images just in case such images exist within the document in question and therefore "treats" simple black and white text as such an image (bloating the resulting file segment in the process by "thinking" the entire page is relevent rather than just the text content).
- Here's a recent example of my cropping an 8.5in x 11.0in original PDF to 5.5in x 9.0in PDF before conversion to djvu file format:
- Not only did cropping improve the rendering for ease-of-reading in the above, I could keep the DPI at 600 for the 1999 volume while keeping the benefits of the DJVU file spec's smaller file size at the same time. -- George Orwell III (talk) 01:05, 7 October 2011 (UTC)
- Isn't one of the basic problems/strengths of DjVu is that they have compression of a JPG image, so it makes it smaller, but it makes it lossy. [From a person who uses but doesn't dig into the technical side of djvu]. A couple of years in Scriptorium ago we looked at comparison examples and problems with the image compression, and this is why we do NOT use the djvu images directly or as a source for File:s. So it works well for text layers, and for a known place on a page, but it has weakness for the images themselves. We also see that with the difference between the black and white scans and the tonal scans as from University of Toronto. Such is life. — billinghurst sDrewth 03:10, 7 October 2011 (UTC)
- I beg to differ somewhat - djvu is excellent all around when the content is primarily black text on a white background. The only issue that djvu deals with poorly, as mentioned, are when images are present along with the text. The problem is rather than directly targeting those handful of pages containing images separately from the majority of plain text pages, most sites and/or utilities are setup for the opposite situation where the assumption is the content has an equal or greater amount of images than it has pages of nothing but plain-text. The resulting end product is consistently bloated/corrupted because it "treats" everything as if it where an image even when reality only text is present. What we wind up with typically here on en.WS is a so-so 450 page document optimized for the 2 dozen pages that happen to have had an image on them instead of an entirely optimized 450 page document, the 20 or so pages with images optimized separately from the bulk prior/during/after conversion.
- The "tonal" backgrounds are also overkill - a feature used in some PDFs known as the background-layer option... and applied after-the-fact simply to mimic the "aging" of the so-called "original". This too bloats the file by thinking every background is an image unto itself if not removed/disabled prior to the DJVU conversion. The proper way to "ease white page eye-strain" or "preserve scanned paper aging" is to apply a single color-shade after the file has been converted to a djvu - not bring each background image along from the source file in conversion. You might as well use the PDF instead of its converted djvu cousin at that point. -- George Orwell III (talk) 04:59, 7 October 2011 (UTC)
The DjVu format supports a range of encodings, and you can use different encodings on different pages within a file. For pure text, the best compression comes from an encoder that matches glyphs against a glyph dictionary. This is proprietary, you have to pay big money of it, so in general it is safe to assume that Wikisourcerers have no access. It is what the Internet Archive uses. Unfortunately they tend to push a little too hard for compression, use too few glyph classes, and end up folding characters into each other, which is a very very bad thing for us, who are trusting the image to proof against. Heaven knows what any2djvu uses, but they don't make it possible to specify encodings.
In my experience, the best way to build a good, highly-compressed DjVu file for free is to identify which pages need full colour encoding, which grey-scale encoding, and which, if any, bitonal encoding (sometimes text can safely be encoded bitonally; sometimes it is too blocky that way). Then individually encode each page, before bringing them together in a single file. For pages that can be encoded as bi-tonal or grey-scale, first use an image processing package like ImageMagick to convert from colour, and threshold out as much of the background noise as possible. Then encode to DjVu. Once you have a complete file, hopefully you can cajole Any2DjVu to embed OCR in it; but Any2DjVu is notoriously flaky with big files. Hesperian 01:13, 9 October 2011 (UTC)