Help talk:DjVu files

From Wikisource
Jump to: navigation, search

On Windows, the PDFill Free tools are a more flexible way to convert images to a PDF. John Vandenberg (chat) 07:00, 17 May 2008 (UTC)

Update for PDFtoDjVuGUI[edit]

Found that the update for PDFtoDjVuGUI with its internal updates works better than previous version. -- billinghurst (talk) 12:42, 19 July 2009 (UTC)


are therer someone method to get an no bitonal djvufile for Linux? I dislike black and white.--KRLS (talk) 23:00, 19 April 2010 (UTC)

Simplify the message[edit]

Shouldn't this file basically explain what is a DjVu file and why we like them? We could then have a subpage that talks about builds/extracts/whatevers. — billinghurst sDrewth 15:33, 21 December 2010 (UTC)

I quite agree on this! I just want to learn to know some elementary things about DjVu. How is it organized. How can I handle it (offline). And things like that. It's really difficult to get some good information about that. I'm willing to contribute to this project, but without a little bit of help on how things work, that's quite difficult. Dick Bos (talk) 22:50, 13 November 2011 (UTC)

OCR via Any2DjVu[edit]

FWIW... something has changed at Options and resolutions have been expanded a bit - don't know if this is an improvement or not. — George Orwell III (talk) 22:22, 27 February 2011 (UTC)


Should we improve this section a bit e.g. adding how to trim borders and split images which contain multiple pages or columns, or is this information elsewhere (I remember I have read something)? I'm quite in a hurry now so I'll just leave some links I found useful: [1] [2] [3] [4] [5]. Nemo 13:11, 12 March 2011 (UTC)

File size limitation and actions taken[edit]

The file size limit technical limitation of the Wikimedia projects is well known and causes a major obstacle for uploading complete content in its full glory. I believe openly available primary sources are the way forward, but it would seem as though some authors have taken the approach of downgrading the quality of the source, to the point of being almost indiscernible gibberish, to meet the current 100MB limit. I believe these choices for inferior quality alternatives are poor decisions, and will have long lasting negative effects on Wikisource. I, for one, intend on upload the California Statutes, and I intend on splitting the files into multiple parts.

As this seems to be the best place for such advice (it seems to be the main help page mentioning "100MB"), I propose that a section be added on the subject, including the various methods editors have used to work around the technical limitation, and the pros and cons of such choices. Given the relatively intensive nature of the work involved for these things, I would like to benefits from other editors' experiences, and I would like other editors to benefit from my experience, before such items are uploaded. This will lessen the need for items to be re-uploaded en masse in the future should future editors find other editors' decisions substandard.

What say you all? Int21h (talk) 16:30, 6 October 2011 (UTC)

The problem of resolution vs. file size has more to do with the source file (i.e. the PDF) than it does with the DJVU file specification or the conversion process IMHO. I've come to learn that the best results come from manipulating the PDF first (cropping, centering, optimizing, etc.) and then converting it to DJVU - especially when we are talking about content that mostly consists of plain old text with the occassional gray-scale or line-art image peppered in here and there. Unfortunately, most default setting, both for manipulation and for conversion, are set to handle large high-quality color images just in case such images exist within the document in question and therefore "treats" simple black and white text as such an image (bloating the resulting file segment in the process by "thinking" the entire page is relevent rather than just the text content).
Here's a recent example of my cropping an 8.5in x 11.0in original PDF to 5.5in x 9.0in PDF before conversion to djvu file format:
  • 2008 without trimming whitespace first
  • 1999 with the trimmed whitespace
Not only did cropping improve the rendering for ease-of-reading in the above, I could keep the DPI at 600 for the 1999 volume while keeping the benefits of the DJVU file spec's smaller file size at the same time. -- George Orwell III (talk) 01:05, 7 October 2011 (UTC)
Isn't one of the basic problems/strengths of DjVu is that they have compression of a JPG image, so it makes it smaller, but it makes it lossy. [From a person who uses but doesn't dig into the technical side of djvu]. A couple of years in Scriptorium ago we looked at comparison examples and problems with the image compression, and this is why we do NOT use the djvu images directly or as a source for File:s. So it works well for text layers, and for a known place on a page, but it has weakness for the images themselves. We also see that with the difference between the black and white scans and the tonal scans as from University of Toronto. Such is life. — billinghurst sDrewth 03:10, 7 October 2011 (UTC)
I beg to differ somewhat - djvu is excellent all around when the content is primarily black text on a white background. The only issue that djvu deals with poorly, as mentioned, are when images are present along with the text. The problem is rather than directly targeting those handful of pages containing images separately from the majority of plain text pages, most sites and/or utilities are setup for the opposite situation where the assumption is the content has an equal or greater amount of images than it has pages of nothing but plain-text. The resulting end product is consistently bloated/corrupted because it "treats" everything as if it where an image even when reality only text is present. What we wind up with typically here on en.WS is a so-so 450 page document optimized for the 2 dozen pages that happen to have had an image on them instead of an entirely optimized 450 page document, the 20 or so pages with images optimized separately from the bulk prior/during/after conversion.
The "tonal" backgrounds are also overkill - a feature used in some PDFs known as the background-layer option... and applied after-the-fact simply to mimic the "aging" of the so-called "original". This too bloats the file by thinking every background is an image unto itself if not removed/disabled prior to the DJVU conversion. The proper way to "ease white page eye-strain" or "preserve scanned paper aging" is to apply a single color-shade after the file has been converted to a djvu - not bring each background image along from the source file in conversion. You might as well use the PDF instead of its converted djvu cousin at that point. -- George Orwell III (talk) 04:59, 7 October 2011 (UTC)

The DjVu format supports a range of encodings, and you can use different encodings on different pages within a file. For pure text, the best compression comes from an encoder that matches glyphs against a glyph dictionary. This is proprietary, you have to pay big money of it, so in general it is safe to assume that Wikisourcerers have no access. It is what the Internet Archive uses. Unfortunately they tend to push a little too hard for compression, use too few glyph classes, and end up folding characters into each other, which is a very very bad thing for us, who are trusting the image to proof against. Heaven knows what any2djvu uses, but they don't make it possible to specify encodings.

In my experience, the best way to build a good, highly-compressed DjVu file for free is to identify which pages need full colour encoding, which grey-scale encoding, and which, if any, bitonal encoding (sometimes text can safely be encoded bitonally; sometimes it is too blocky that way). Then individually encode each page, before bringing them together in a single file. For pages that can be encoded as bi-tonal or grey-scale, first use an image processing package like ImageMagick to convert from colour, and threshold out as much of the background noise as possible. Then encode to DjVu. Once you have a complete file, hopefully you can cajole Any2DjVu to embed OCR in it; but Any2DjVu is notoriously flaky with big files. Hesperian 01:13, 9 October 2011 (UTC)

Splitting DjVu files[edit]

Is there any way to split some pages form a DjVu file into a file? djvused can only extract one page at a time.--維基小霸王 (talk) 16:45, 3 March 2018 (UTC)

@維基小霸王: My question would be more why would we want to split out pages from an existing file? At enWS we try to remain true to a published work, so why would we not upload the whole work? This is more in line with WS:WWI. If you have only particular interest in part of the work, then work on that part, the rest will be there for someone else to do in time. — billinghurst sDrewth 01:04, 4 March 2018 (UTC)
@Billinghurst:commons:Category:Wenyuange Siku Quanshu is a photocopy publication by Taiwanese Commercial Press in 1980s. Sometimes more than one book of w:Siku Quanshu is contained in one volume of it. It is better to upload separately for two reasons:
  1. To unlink books with no link. Commercial Press seems only to combine and split to control publication size, so they combined books with no link. There is no need to do so in Wikimedia Commons. Upload books separately will make the book more true to the original work rather than the modern publication and also make easier for users to download a particular book.
  2. Copyright reason. I don't know if the press gets some kind of copyright by combining books, but they surely has the copyright for the picture of Forbidden City in pages 2s of the books. Upload books separately and the copyrighted picture removed will make the upload more Public Domain. --維基小霸王 (talk) 04:12, 4 March 2018 (UTC)
Be wary. While the text of a work may be out of copyright and be able to be reproduced; a reproduction probably will have copyright for any production values and decisions, so the scans likely not be out of copyright. About the only time that it is the case is where there is an exact facsimile. — billinghurst sDrewth 06:10, 4 March 2018 (UTC)
They have printed two pairs of facing pages on one page. They also added page numbers on the bottom. Is that copyrighted? Should I contact the lawyer of WMF?--維基小霸王 (talk) 08:41, 4 March 2018 (UTC)