User talk:Inductiveload

From Wikisource
Jump to navigation Jump to search


Inductiveload User Area
Main User Page Talk Page Gallery Contributions

Phil Trans - Illuminated Initial - W.pngELCOME to my user talk page. Feel free to leave me a message if there is a problem or you would like my help, or anything else.

I am also active on Commons. If you would like help with a file I uploaded or would like me to make a file for you, please ask at my user talk page there. If the request is Wikisource-centred, ask here.

Anything you write on this page will be archived, so please be polite (I will be more amenable then) and don't write anything you will regret later! My purpose here is to make interesting and useful documents open to the public. I am never trying to make trouble, and any problems can almost certainly be resolved quickly and easily if everyone stays calm.

Please sign your posts by typing four tildes (~~~~) after your post, and continue conversations where they start.This helps to keep discussions coherent for future readers! If I leave a message on your page, then please reply continue there. My replies to messages on this page will be here.

Archives
Older dicussions from this page are stored in archives. Especially interesting or detailed topics are kept in sub-archives. Below is a list of all archives and sub-archives:
Wikisource user page Commons user page Wikibooks user page Wikipedia user page


TUSC token 918678eb806b340b1408614584aaa822[edit]

I am now proud owner of a TUSC account!

/* Problematic */ Overlapping sidenotes[edit]

What do you suggest to avoid the overlap? Page:The_Laws_of_the_Stannaries_of_Cornwall.djvu/122 ShakespeareFan00 (talk) 20:44, 15 August 2020 (UTC)

Scan[edit]

Hello. If commons:File:The Building News and Engineering Journal, Volume 22, 1872.djvu has a text layer (and do not know whether it does), it does not load in the editing window in the page namespace (which I accessed by previewing the index page). I also get a message saying "ws_ocr_daemon robot is not running" when I try to use the OCR button. I do not know what to do about this. I would be grateful if someone could assist. James500 (talk) 06:09, 6 September 2020 (UTC)

@James500: It is likely that what you observe is due to a bug in MediaWiki's DjVu handling that is triggered by certain kinds of invalid or pathological text data in a DjVu file. I have regenerated the file in question from the source scans and uploaded the new file over the old one. Try again and see if you get the OCR text loaded now.
Regarding the OCR button in the wikitext editor, this is a known issue and is unlikely to be resolved soon. If you often use that function, I recommend turning on the Google OCR gadget in your preferences: it works the same, only using Google's internal OCR software, but is generally robust and available. --Xover (talk) 15:20, 6 September 2020 (UTC)
Thank you. The OCR is loading now. James500 (talk) 21:09, 6 September 2020 (UTC)
Thank you @Xover: for the help!. Do you have a script to repair borked DjVu files, perhaps by nuking the text layer on invalid pages? It's something I have never done myself. For reference, the DjVu text layer bug is (I think) phab:T219376 (reported by Xover).
Even when the OCR tool is working, it can still be useful to have the Google OCR on hand, as sometime one of them works better than the other, especially for text in columns. Inductiveloadtalk/contribs 11:09, 7 September 2020 (UTC)
I grab the original scan .jp2 files and manually convert them to jpeg with GraphicsMagic, and then have a custom script that puts them in the right order, runs tesseract on each page to generate hOCR structured text, convert the page jpeg to DjVu, convert the hOCR to sexpr, add the sexpr to the page DjVu, and then compiles the page DjVus to a new DjVu book. I also have some related utilities to redact individual whole pages of an existing .djvu file, and some premade images and .djvu components to aid in manually insert placeholder pages or redact parts of pages.
None of this is very user friendly or documented (you basically need to be me to easily use it), but I'm happy to share the code on request. I have a long-term todo to set up an interactive web frontend for manipulating DjVu files, but my todo is already waaaay too long. (I also want to figure out a way to take full advantage of the DjVu format's features to optimize file size as well, but…)
PS. Just for reference, I didn't go for nuking the existing text layer for a couple of reasons. One is that some manipulations would then work on multiply-encoded image data, and would compound the problem when reencoding it afterwards (and most DjVus from IA etc. are already very aggressively compressed). By starting from "pristine" sources the end result, both image and OCR, will be better. The other is that the text layer is fragile and MediaWiki's extraction even more so, so it's safer to generate it from scratch. My script that converts from hOCR to sexpr is reinforced against certain classes of bug in tesseract and tries to guarantee the resulting sexpr data is valid. phab:T219376 is just one bug, that manifests as OCR being offset relative to the scan images; the main culprit here is phab:T240562 where MW fails to extract the text layer at all. --Xover (talk) 13:05, 7 September 2020 (UTC)
@Xover: ah right, I though maybe you were just hot-fixing the file rather than regenerating from scratch. In that case I understand that the script would be pretty complex. Thanks for reference to the correct issue. Inductiveloadtalk/contribs 07:54, 8 September 2020 (UTC)
It's not really that complex; it's just not very polished and lacks documentation. It's not user friendly, even for technical people, is what I'm saying. But it's not rocket science by any means. --Xover (talk) 08:08, 8 September 2020 (UTC)
@Xover: I'd be grateful if I could take a look, as I have a set of JPGs I could do with turning into a DjVu and I don't currently have a handy script to do that (plus I haven't got a battle-hardened OCR layer mechanism). Inductiveloadtalk/contribs 14:57, 30 September 2020 (UTC)
I'll find somewhere to dump it and write up some kind of guidance. Might not be until this weekend though. --Xover (talk) 16:33, 30 September 2020 (UTC)
Thanks! It doesn't have to work or even be very well documented, so don't feel it needs to be too detailed, I have half a system I just need to figure out the hocr->DjVu OCR and I don't know the gotchas like you do. Inductiveloadtalk/contribs 17:37, 30 September 2020 (UTC)
Also, do you do the fancy IA text-background separation (I think that's how they get their massive compression - a bitonal test layer over a lower-res page background). Just a straight C44 isn't particularly impressive in terms of text quality vs. file size. Inductiveloadtalk/contribs 10:33, 1 October 2020 (UTC)

┌──────────────────────────┘
Code is now in my sandbox.

I convert .jp2 to either JPEG or PBM manually using GraphicsMagic (gm mogrify -format jpeg '*.jp2') (PBM for scans without images that Google has already crushed because the loss of fidelity doesn't matter and the size savings are worth it; but not all inputs will produce usable PBM outputs, so always check these). Then run this script as "hocr2sexpr *.jpeg", which spits out a file called "output.djvu". I haven't bothered with any real command line argument handling so I hardcode the output filename and unconditionally leave behind all the temporary files for easier debugging. Changing the text language also requires modifying the code just now. All this will get command line switches to control once I get around to it.

If you want to run this directly you'll probably have to install some dependencies (all four of the modules up top are non-core I think, and they have additional deps in turn). On macOS all of them are available through Homebrew, and they should be available in most package managers on Linux. I have no idea of the state of things on Windows.

I've commented the code so you should be able to navigate it reasonably well if you just want to grab the core hOCR/sexpr logic. It presupposes that you're using a push-parser for the HTML, and reusability will be much lower if you're using a pull or pseudo-DOM parser. Feel free to ping me if I can help with anything.

Regarding the separated DjVuDocument files, that is indeed what IA does (well, did). I've looked a bit at it but there are no finished command line tools for working with these, so you'd need to partially implement support for the component file formats. Given the relatively low resolution of current scans it is also hard to automatically extract the text image without making it unreadable for manual analysis (several cases of completely unreadable text here seem to be the result of a non-optimal DjVu compression from IA). However, with all those caveats in mind, supporting this would do wonders for our file size and interactive performance, and I see no inherent reason we shouldn't be able to come up with settings that address both concerns (possibly by manual per-scan hinting; there's usually high degree of conformancy between pages within a single scan).

Other fancy stuff I want to look at "some day" is a tool to automatically straighten crooked pages, and intelligently crop them. Possibly even to split double-page spreads automatically. The algorithms for these should be pretty straightforward (if we ignore pathological cases), except that it'll need to have a lot of knobs to adjust due to differences between scans.

Oh, and PS., I have a toy webservice set up (on WMCS) to interactively run Tesseract on a page here like the Phe and Google OCR gadgets. It's pretty hacky just now, and it'll break regularly as I mess with it, but if you want to play with that just let me know (or poke through my user .js etc. on noWS where I've been testing it for multi-lingual support). I'm hoping to get it good enough to replace Phe's OCR gadget, and add a few nice-to-have features like automatically preserving paragraphs and unwrapping hard-wrapped text. It can also conceivably be a good vehicle for attaching various OCR fixup scripts, but I haven't gotten to the point of looking at those yet. Mainly I'm stalled on integrating the fancy OOUI-based user interface stuff with the purely OCR-related code, which will balloon the number of lines of code disgustingly but probably won't really be difficult so much as fiddly.

Mentioning it in case anything catches your interest. Most of my capacity for attention is already oversubscribed IRL so I probably won't be able to give any of this any sustained attention any time soon. Happy to share any ideas and code though. --Xover (talk) 12:21, 3 October 2020 (UTC)

@Xover: Thank you very much!. I will have a look and see what I can learn. My attempt so far is User:Inductiveload/make_djvu.py, which seems to have worked "OK" in regenerating the scans below and a few other British Library/Hathi works for which I only have image scans - the main issue is it tends to produce quite large files, but it has a command line parameter to set a max file size, which is effective enough, but produces mediocre image quality due to no text-layer separation. The hocr->sexp step is a bit of a hack but appears to work, believe it or not. The biggest problem I encountered was detecting empty cols/para/lines which djvused rejects.
One more thing I think one could be able to detect and mitigate is the "bad pages" mentioned below, which should stand out like a sore thumb to any kind of OpenCV kind of algorithm due to a heavy dark border around the page on 3 sides. Inductiveloadtalk/contribs 17:35, 3 October 2020 (UTC)
Apart from being written in Python (*hack*, *spit*), it looks good to me. Only thing is that using XPath queries to pull out what you want is fragile in the face of changing input (hOCR is a "living standard" in constant evolution, and Tesseract's implementation has changed several times since 4.0 was released). With a push parser you'll get fed everything (including classes and elements we haven't seen before) and can device a sensible strategy for dealing with it (debug logging unexpected input, say). --Xover (talk) 19:56, 3 October 2020 (UTC)
@Xover: heretic :-p. It's definitely the weakest part of the chain, I might consider a more robust version if it were on a webserver as opposed to on demand with a -v flag spewing debug. Looks like I'm also short of ocr_textfloat and ocr_caption too. But it does use every thread on hand, so it'll keep the room warm in winter.
It's a shame the IA's derivation script isn't available, I'd like to see it (or maybe it is and I haven't found it). Inductiveloadtalk/contribs 20:13, 3 October 2020 (UTC)

Other scans[edit]

I have had similar problems with commons:File:The Register of Pennsylvania, Volume 1.djvu, commons:File:Hazard, The Register of Pennsylvania, Volume 3.djvu, commons:File:Hazard, The Register of Pennsylvania, Volume 1.djvu, and commons:File:Hazard's United States Commercial and Statistical Register, Volume 5, 1841.djvu. James500 (talk) 08:24, 24 September 2020 (UTC)

@Xover: hmm, do you think this could be related to the IA item having junk pages like this one in the "processed" JP2 archive? Not sure there's much one can do about it, other than regenerate the file offline. Inductiveloadtalk/contribs 10:19, 1 October 2020 (UTC)
The junk scan images are one major triggering factor for this, yes. IA seems to be maintaining per-image info somewhere else that lets it ignore these images (possibly that JSON file you found), but not in the XML file that ia-upload uses (iirc). --Xover (talk) 12:23, 3 October 2020 (UTC)
@James500: files regenerated from JP2s. Drop any more you need here. Inductiveloadtalk/contribs 15:11, 1 October 2020 (UTC)

Fancified borders...[edit]

I did up Page:Little Ellie and Other Tales (1850).djvu/168 mostly as an experiment to figure out the techy bits for myself. When you have a moment I'd appreciate it if you could take a quick look and comment on what you think in terms of the technical approaches, suitability for e-readers, and so forth. I mainly just cribbed your code for the fancy border so there shouldn't be anything particularly new or innovative lurking in there. --Xover (talk) 14:07, 11 September 2020 (UTC)

It looks perfectly serviceable to me. Probably the only nitpicky graphical "defect" I can see is the edge width is a few pixels too narrow so it clips the inner leaves slightly. Where it repeats (halfway down the edge) is a small alignment defect, but it's a tricky one to get exactly right without very careful manipulation, and you wouldn't notice it unless you were looking for it. Newes of the Dead was easier as it was a repeating element to start with.
WRT e-readers, it will not work, because the CSS refers to an online resource by URL and e-readers generally do not fetch them. The solutions I can think of are:
  • Use a "normal" border as a simple fallback and live with that (Newes does this)
  • Modify ws-export to rewrite TemplateStyles URLs, either as base64 data or by bundling the resource like any other image and changing the URL to a local file: phab:T256780
The other issue I can see with this technique is that the CSS usage of the files doesn't show up at Commons, so its a little vulnerable to silent breakage if the files are changed or deleted, so at least I think making a note on the file's description page is a good idea. Inductiveloadtalk/contribs 17:04, 11 September 2020 (UTC)
Thanks! Border width fixed, file tagged, and Phab subscribed to. Regarding the fallback: I'm using effectively the same TS stylesheet as Newes, including the non-fancified border there. Did I miss something? --Xover (talk) 17:44, 11 September 2020 (UTC)
The fallback is the border colour: a thick grey border in Newes:
	/* need to set this width first to provide space for image */
	/* ereaders (and other offline devices) will see only this, as they don't
	   have access to the CSS url source */
	border: 50px solid LightGrey;
If you'd like a thin border like 1px solid black, you probably need to add a padding to make up the shortfall between "1px" and the border image size. The "nominal border" runs through the centre of the image slices, so if the padding + half the border width is less than the actual image border width, the content can overlap the image. Inductiveloadtalk/contribs 18:33, 11 September 2020 (UTC)

Amplify[edit]

Why wouldn't it be ok to make a soft redirect on Amplify to w:Samplify? It makes sense, see Q: Are We Not Men? A: We Are Devo! Some users could type in the correct title (s:amplify) and end up here. The redirect would help them find their way back. Gioguch (talk) 21:37, 27 September 2020 (UTC)

@Gioguch: because it is not the purpose of Wikisource to provide soft redirects to Wikipedia in its main namespace. There are numerous cases of those sorts of issues, and these are the quirks that we face. What you should be doing is lodging a phabricator: ticket to fix the issue that if a page exists that it does not follow an interwiki map link. — billinghurst sDrewth 22:22, 27 September 2020 (UTC)

Match & Split[edit]

…now that the Match & Split bot is down (hopefully just until someone gives it a kick) it reminds me that we need to start thinking about systematically replacing all of Phe's tools. I've made a start at the OCR gadget, but that's frankly just because I had some code sitting around and it was easy; for most folks the Google OCR is plenty good enough here (and Community Tech may replace both with something better anyway).

But Match & Split is critical for ever making any appreciable dent in our ever-growing non-scan-backed backlog, and it desperately needs a few quality of life and user-friendlyness improvements. I've cast only fleeting glances at the code and not really understood it (not just because it's in Python: Phe has structured the code around some kind of internal pseudo-SOA architecture that I've not cracked yet), but maybe you will have better luck there.

Incidentally—and I don't think you have the free time or necessarily inclination for it—but Phe's code is public and can be forked, and the existing phetools.toolforge.org can be usurped since Phe is non-responsive. Iff you should be so inclined the option is there. I'd be happy to help there, but my Python-fu being what it is I can't in good conscience take the lead on that.

In any case… I think we need to work systematically towards making sure enWS (and, ideally, all the Wikisourcen) have access to the set of functionality Phe's tools provide today (when they're not broken). My first iteration on such a list would be:

  • Interactive per-page OCR
  • Match & Split
  • Per-project statistics (so we can see what we're doing)
  • Cross-project / comparable statistics (so we can se how we're doing relative to others)

And of these Match & Split is the short/medium-term highest priority in my mind. --Xover (talk) 08:02, 18 October 2020 (UTC)

@Xover: Agreed that we should try to get some of these maintained/able, and M&S is maybe a priority.
With the new shiny Toolforge/Kubernetes stuff, is there any advantage to all this being once huge phetools tool, or would it be better to have much more granular tools, e.g. match_and_split and ocr? Inductiveloadtalk/contribs 19:57, 24 October 2020 (UTC)
I think the current phetools is one tool because it shares a lot of the plumbing between functionalities, but that may of course have been a design shaped by the infrastructure available at the time. Going forward I would default to having separate tools (on Toolforge) or even separate services (on WMCS) for each distinct functional bit. There are better ways to share code and "separation of concerns" is IMO a strong principle. If we go the WMCS route (may be overkill / have drawbacks) it will also permit separate resource allocation for each tool. --Xover (talk) 20:07, 24 October 2020 (UTC)