Wikisource:Scan Lab/Archives/2023-03

From Wikisource
Jump to navigation Jump to search
Warning Please do not post any new comments on this page.
This is a discussion archive first created in , although the comments contained were likely posted before and after this date.
See current discussion or the archives index.

c:Category:Book no. 6, Banguê Book Collection

This section was archived on a request by: Mpaa (talk) 22:31, 3 March 2023 (UTC)

Notifying all members of Scan Lab (more info · opt out): (User:Inductiveload, User:Xover, User:Mpaa) Could someone convert the images in this category into a DJVU file? I believe it's something simple, but I couldn't find a practical way to do it. Albertoleoncio (talk) 14:38, 1 December 2022 (UTC)

@Albertoleoncio In your defence: it's actually not that simple (each step is not that hard, but all together it's a bit of a pain). It's complicated by the fact that the images aren't named "lexicographically", so they don't naturally sort in the right order. You also have to download them all in the first place, which is cumbersome (I use a script, AFAIK there is no "official" solution provided by Commons). Then, they need to be split using a tool like Scan Tailor, and that is complicated by not all pages being exactly split in the same place. And then you can use tools like djvm to convert to a DjVu pages and glue them all together.
In this specific case, the tight binding also makes it impossible to see all the text, as a lot of it has vanished into the binding, and the flat-bed scanning style hasn't helped (a v-cradle scanner opens the book less and thus doesn't hide so much of the text around the curve of the other pages).
However, I have split the pages as best I can: File:Banguê Book, number 6.djvu. I have also used a rather high image size to avoid over-compressing the delicate writing. I did not attempt to OCR it as it would just be junk. Hopefully it works for you!

Also, please could you review and fill in the information template on that file page, as I have done only a cursory first draft. As for the license, while the JPGs have been claimed to be CC-BY-SA, I think this work and photos of the pages of it are more properly licensed as Public Domain. 00:03, 5 December 2022 (UTC) Inductiveloadtalk/contribs 00:03, 5 December 2022 (UTC)

@Inductiveload: I have no words to thank you for all this work. Downloading all the images was even easy using WLD (in German, but ok) and IDM (paid, but I already had a license for other reasons), but I had such a hard time with djvm (either it's outdated or it's not very clear) that I gave up and tried it in pdf, but I also gave up because I couldn't find any reliable tool that works offline and didn't charge me for the service.
If you know any manuals or step-by-step guides on how to do this process, that would be wonderful. In addition, I have already adjusted the license tags. Again, thank you very much! Albertoleoncio (talk) 00:55, 5 December 2022 (UTC)
@Albertoleoncio I'm glad you like it! There aren't any step-by-step guides that I know of, as there's not really any "one" process to do it. I have a completely horrible (and public) script at https://github.com/inductiveload/wstools/blob/master/wstools/make_document.py which is what I usually use. It's not a secret, but it's not really designed to work for anyone but me, as a completely bulletproof program would take too much time to write! There are two real "tricks" it uses: one is to covert the images to intermediate formats that can then be shifted to DjVu and the other is taking the Tesseract OCR output as a hOCR file and then inserting it into the DjVu file in the right place. Neither are conceptually hard, but they're also not completely straightforward in practice.
You can obviously always ask here. If you have just a few works to do, you can also upload the images in a zip to the Internet Archive and they'll make a PDF with OCR for you automatically, but it'll take a few hours to generate. Inductiveloadtalk/contribs 16:21, 29 December 2022 (UTC)

Page:The Works of John Locke - 1823 - vol 01.djvu/76

This section was archived on a request by: Mpaa (talk) 22:31, 3 March 2023 (UTC)

Should be replaced with page 80 from [1]. MarkLSteadman (talk) 15:14, 1 January 2023 (UTC)

@MarkLSteadman done. Mpaa (talk) 21:11, 1 January 2023 (UTC)
@Mpaa Thank you for your help! Unfortunately it seems now the text and images following the insertion are now off by one. Can you please shift the text forward one page starting at Page:The Works of John Locke - 1823 - vol 01.djvu/77 through Page:The Works of John Locke - 1823 - vol 01.djvu/341 (e.g. the text on 77 should be on 78)? MarkLSteadman (talk) 21:34, 1 January 2023 (UTC)
@MarkLSteadman My bad, I inserted the page and forgot to delete the bad one. On its way ... Mpaa (talk) 21:47, 1 January 2023 (UTC)


Index:Historic highways of America (Volume 12).djvu

This section was archived on a request by: Mpaa (talk) 22:31, 3 March 2023 (UTC)

Another case of two pages missing, pages 113 and 114 in this case. Note the jump from 112 to 115 at the moment, so no pages have been added twice in their places, they are just plain old missing. As always, thanks for the continued help with this series, TeysaKarlov (talk) 21:14, 11 February 2023 (UTC)

@TeysaKarlov can you please link images for replacement? Mpaa (talk) 21:58, 21 February 2023 (UTC)
Hi,
Sorry, I didn't realise I should do this (@Languageseeker usually seems to handle that sort of thing). Please add just the page including the large map from here https://archive.org/details/cu31924088422716/page/n115/mode/2up on page 113 of the Wikisource index and then a page without text on page 114 (any of the other blank pages in the current scan will do, if you need to "physically" put a page in).
Thanks,
TeysaKarlov (talk) 19:25, 22 February 2023 (UTC)
@TeysaKarlov: done. Mpaa (talk) 18:20, 24 February 2023 (UTC)
@Mpaa Thanks! TeysaKarlov (talk) 19:55, 24 February 2023 (UTC)

Index:المختصر في حساب الجبر والمقابلة.pdf

This section was archived on a request by: Mpaa (talk) 22:31, 3 March 2023 (UTC)

A user has uploaded this arabic dual-language work, though the issue is that the English language pages are RtoL in order, so transclusion is going to be an issue, as RtoL isn't going to work on a base English language website. I think that we are going to need to extract all the English language pages, and feed them in the reverse order, and create a new file:, then rebuild in either PDF or DJVU. I would suggest renaming the file to "The Algebra of Mohammed Ben Musa" to match the title, After that is done, then we can move the existing pages. — billinghurst sDrewth 01:12, 23 February 2023 (UTC)

@Billinghurst: done, see Index:The Algebra of Mohammed Ben Musa (1831).djvu.
Mpaa (talk) 21:53, 24 February 2023 (UTC)

Index:The church, the schools and evolution.djvu

Notifying all members of Scan Lab (more info · opt out): (User:Inductiveload, User:Xover, User:Mpaa) IA Upload generated a DJVU file from the JP2 scans, but the images are offset from the OCR. Can you please fix the DJVU file? —Beleg Tâl (talk) 15:21, 30 March 2023 (UTC)

@Beleg Tâl: Done Xover (talk) 17:50, 30 March 2023 (UTC)
Thank you! You're the best :) —Beleg Tâl (talk) 17:54, 30 March 2023 (UTC)
This section was archived on a request by: --Xover (talk) 20:17, 31 March 2023 (UTC)

Eginton glass

Please can someone split File:Reference to some of the works executed in stained glass - William Raphael Eginton.pdf, into it original seven pages? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 20:45, 22 March 2023 (UTC)

@Pigsonthewing: File:Reference to some of the works executed in stained glass - William Raphael Eginton.djvu. Note that you'll have odd pagination due to not scanning the blank page after the title (it looks like the pagination is 1:title, 2:blank, 3:first text page, and then numbered pp. 4–8). I generally also recommend scanning all parts of the physical work: outside and inside covers, blank pages, etc. It's more faithful to the published artefact, and it can sometimes help a lot with orientation and quality control when doing things like this (or setting up the pagelist, or...). Xover (talk) 10:04, 2 April 2023 (UTC)
Thank you. I didn't scan this work. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 16:23, 2 April 2023 (UTC)
This section was archived on a request by: --Xover (talk) 18:12, 2 April 2023 (UTC)

Urbiztondo Ordinance No. 8 (2012)

Notifying all members of Scan Lab (more info · opt out): (User:Inductiveload, User:Xover, User:Mpaa) Three files on Commons (1, 2, 3) need to be concatenated. There are some moves and deletions to accomplish after the fact, but those aren’t necessary now. Should be File:Urbiztondo Ordinance No. 8 (2012).pdf. TE(æ)A,ea. (talk) 02:03, 24 March 2023 (UTC)

@TE(æ)A,ea.: Will File:Urbiztondo Ordinance No. 8 (2012).djvu do? Xover (talk) 09:31, 2 April 2023 (UTC)
This section was archived on a request by: TE(æ)A,ea. (talk) 16:42, 3 April 2023 (UTC)

Scalo marittimo and poor scan quality

Notifying all members of Scan Lab (more info · opt out): (User:Inductiveload, User:Xover, User:Mpaa) Hi! I scanned my first book directly in a pdf file, but the quality is horrible. The scanner was set to 200x200px/in and "shades of grey". When I view the file on my computer, with Document viewer or Evince, the text is clear and with a good resolution, but on Commons (and on Wikisource) the quality is horrible. See by yourself: Scalo marittimo - Raffaele Viviani - 10 commedie.pdf. What should I do? Why there is an issue with the uploaded file? Thanks --Ruthven (talk) 13:57, 30 March 2023 (UTC)

@Ruthven: That looks like generational loss and multiple rounds of too aggressive lossy compression. In particular, your original scan was of relatively low resolution, lost fidelity by crushing a color scan into a grayscale colourspace, and then most likely applied a lossy compression algorithm. On the Wikimedia side, this then gets extracted from the PDF by having ghostscript "print" the page to a JPEG, with a new round of lossy compression, with output size picked by dumbly looking at the first physical page in the PDF, and then rescaled by Thumbor to the different size "thumbnails" that are needed (including the size used for the page name on Wikisource), and with yet another lossy decompression/compression step.
That is, there's plenty of blame to go around: MediaWiki's handling of PDFs is rather poor; the PDF format is actually also very poor for this kind of job (unless you put massive effort into it, and even then it's kinda crap); and your scan choices didn't give us the best starting point.
So to get better results you can try to tweak all of these factors as far as you're reasonably able. Start with scanning in full colour, and in as high a resolution as your computing hardware will conveniently support (300dpi is minimum; 1200dpi and 2400dpi are not unheard of). Scan to TIFF (or PPM) or, in a pinch, to JPEG, and make sure you either use a lossless compression algorithm or use the highest quality settings. I strongly recommend DjVu over PDF as the container document format, but for most people DjVu means you'll have to request converting images to DjVu here (which I'm happy to do, but is presumably a lot less convenient for you) since there are few user-friendly tools for this. If you go with PDF here, make sure whatever software you use does not apply any lossy compression or scales down the page images (as most consumer tools do by default). Treat the raw scan files as masters, and the PDF as a generated product.
With the above you've already given us a much better starting point that should let MediaWiki produce better output despite still having poor PDF handling. The final way we can affect that is making sure the first physical page of the PDF has roughly the same resolution (x/y and dpi) as the majority of other pages in the PDF. MediaWiki is dumb, so if the first page is much larger or much smaller than the rest, you'll get loops of upscaling and downscaling that affects the final displayed image badly.
If your computing hardware (mostly storage space, usually) forces you to make tradeoffs, the priorities should be 1) colour scans, 2) no lossy compression, 3) raw resolution. Or put another way, treat colour scans and lossless compression as the minimum and then crank up the scan resolution as much as you practically can. If your output images are somewhere in the range 2000–4000px by 3000–6000px (i.e. from 2000x3000 to 4000x6000) you're in the typical range from "average" to "high quality". Below that and you start being more susceptible to quality loss from other sources; above that range and the quality returns are diminishing (there's only so much information to be extracted from a printed book). Xover (talk) 09:16, 2 April 2023 (UTC)
@Xover Thank you for the very clear answer. I'll try to do the following: rescan at higher resolution to an image format (I'll see what the scanner handles), and the I'll try to convert these files in a DJVU (but I don't know how to add the text layer). However, I've seen that exists pdf2djvu. If I have a correct PDF at the beginning, does Mediawiki handles the Djvu format netter?
The Yogyakarta Wikisource group is using CamScanner from mobile app, and the results are better than mine: Category:Hibah Dana Wiki Komunitas Yogyakarta 2022. Do you think that's a valid path to follow? The process seems easier and quicker. Ruthven (talk) 12:36, 2 April 2023 (UTC)
@Ruthven: Yeah, the need to add OCR is why I say there aren't any user-friendly tools for end users to create DjVu files from scratch. I have some custom tooling to do it, but they aren't really suitable for normal people to use. But as mentioned, if you upload a zip file of the raw images somewhere I'd be happy to do it for you (just can't promise response time).
Scanning to a PDF and then converting that to a DjVu is a bad idea, as each conversion is going to involve recompressing and reencoding, and each time you do that you lose some fidelity. It's possible that the net result you see on Wikisource will be slightly better in some cases due to MediaWiki's problems with PDFs, but that's going to be edge cases and you'll still have reduced the quality of the file (it just won't be immediately obvious that that's what's happened).
I'm not familiar with CamScanner, but since most current iPhone cameras produce roughly 3000x4000 pixel images it should be possible to get decent resolution with a phone. For the so-called 48MP phones the resolution of the sensor should certainly be enough. But phone cameras use a lot of computational photography techniques that may not necessarily be what you want. The CamScanner website suggests it can be made to scan to image files (I think, it's not entirely clear) so that might be an option to explore. With sufficient resolution and lossless compression settings the PDF produced might also be good enough that MediaWiki's derived thumbnails are of acceptable quality.
If you can get good enough results and the workflow is convenient for you, CamScanner certainly sounds like an option you should consider. Xover (talk) 13:21, 2 April 2023 (UTC)
This section was archived on a request by: --Xover (talk) 10:53, 3 April 2023 (UTC)

Bulandshahr

Notifying all members of Scan Lab (more info · opt out): (User:Inductiveload, User:Xover, User:Mpaa) Could a DJVU be generated of [[2]] for inclusion in the April MC? Languageseeker (talk) 22:50, 30 March 2023 (UTC)

@Languageseeker: Done File:Bulandshahr (1884).djvu Xover (talk) 09:32, 1 April 2023 (UTC)
This section was archived on a request by: --Xover (talk) 11:43, 4 April 2023 (UTC)

Index:Historic highways of America (Volume 13).djvu

Hello, and Notifying all members of Scan Lab (more info · opt out): (User:Inductiveload, User:Xover, User:Mpaa) page 81 of this index is missing, and page 82 marked without text, although I expect the image in the original text crossed two pages. The closest to complete version of the image I have yet found is here: https://archive.org/details/in.ernet.dli.2015.87573/page/n83/mode/2up, although a little is still cut off, e.g. https://babel.hathitrust.org/cgi/pt?id=mdp.49015002227636&view=1up&seq=85 or https://archive.org/details/cu31924088422724/page/n83/mode/2up. Perhaps add the first archive link as p. 81, and the first hathi link as p. 82, and hopefully with some image editing magic, it will be all good once transcluded. Thanks, TeysaKarlov (talk) 19:42, 13 March 2023 (UTC)

@TeysaKarlov: Done. New version uploaded with the missing map patched in from the IA link. Xover (talk) 15:34, 3 April 2023 (UTC)
@Xover Many thanks! TeysaKarlov (talk) 20:04, 3 April 2023 (UTC)
This section was archived on a request by: --Xover (talk) 14:44, 4 April 2023 (UTC)

The Midland Naturalist

each contain scans of two volumes; could someone split them into separate files (at the indicated pages of the PDF), please? No need to keep the originals, so far as I am concerned, and you might want to loos the Google Books cover sheets at the start and end, at the same time.

Additionally, volume 12 has some duplicated pages; see pp 341- and pp.353-. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 15:28, 12 March 2023 (UTC)

@Pigsonthewing:
Please check that they are ok. They're generated from alternate scans so there's no guarantee all pages are present etc. I'll keep the source files around for a short while before freeing up the disk space, and can fairly easily tweak the files in that span (move pages around, add missing pages, remove extra pages, etc.). --Xover (talk) 12:29, 2 April 2023 (UTC)
Thank you. I will check them as soon as I get chance. I have a transcription started for vol 1, and there is a plate (which should be page 189) missing, before or after this bank page. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 18:59, 3 April 2023 (UTC)
This section was archived on a request by: --Xover (talk) 07:01, 6 April 2023 (UTC)

Great Qing Legal Code

Notifying all members of Scan Lab (more info · opt out): (User:Inductiveload, User:Xover, User:Mpaa) The scan I've been working on, Index:Ta Tsing Leu Lee; Being, The Fundamental Laws, and a Selections from the Supplementary Statutes, of the Penal Code of China.djvu, is missing page 41 of the work, while another scan that was also started, which has other differences, mostly in layout, has that page, Index:Ta Tsing Leu Lee (1810).pdf, if page 41 can be added from that scan into the first one, and maybe the two files transcribed text be merged, that would be helpful, the only page I can remember that I know can be merged from the 2nd scan, is that of lxiii, which is a title page. Reboot01 (talk) 03:05, 23 March 2023 (UTC)

@Reboot01: Look closer. A cursory scan shows p. xv is also missing. We can patch it up, but we'll need a full check of pages missing and then identify specific sources for alternate pages. Xover (talk) 09:37, 2 April 2023 (UTC)
Page xv does appear to be there, it's just been misplaced, it was labeled as xvii, and was in the wrong location. I've modified the index to show this Reboot01 (talk) 17:59, 2 April 2023 (UTC)
@Reboot01: I've patched it up for you, which was a heck of a job since in addition to the misplaced p. xv and missing p. 41, this scan was missing blank pages at at DjVu index 39, 65, 67, 127, 247, 271, 347, and 555 that were all part of the pagination (didn't you notice that the page numbers were off?). Please check scans in more detail before making requests here so that we can more efficiently fix scans. Xover (talk) 10:37, 3 April 2023 (UTC)
This section was archived on a request by: --Xover (talk) 07:01, 6 April 2023 (UTC)