User talk:Xover

From Wikisource
Jump to navigation Jump to search

Hathi scan request[edit]

Would you be able to grab Al Aaraaf for me, in order to scan back Al Aaraaf, Tamerlane and Minor Poems ? —Beleg Tâl (talk) 15:52, 31 January 2020 (UTC)

@Beleg Tâl: What's the copyright status of the front matter? PD-US-no-notice? --Xover (talk) 17:14, 31 January 2020 (UTC)
I don't see a notice, so {{PD-US-no-notice}} appears correct. —Beleg Tâl (talk) 17:19, 31 January 2020 (UTC)
@Beleg Tâl: I also find no registration in 1933 or 1934, nor a renewal in 1960 or 1961, so we should be good either way. --Xover (talk) 17:26, 31 January 2020 (UTC)
@Beleg Tâl: Done: File:Al Aaraaf (1933).djvu. I've done minimal checking. Ping me if it's borked in some way. --Xover (talk) 17:51, 31 January 2020 (UTC)

Request to convert PDF[edit]

May I ask you to convert File:Guide through Carlsbad and its environs.pdf into djvu for me? I used to convert PDF files into djvu using some online converter, but recently I have been receiving very bad results from them (they often leave some pages blank during the conversion). I also tried to download some convertor, but its results were even worse. Thanks! --Jan Kameníček (talk) 22:46, 2 February 2020 (UTC)

@Jan.Kamenicek: File:Guide through Carlsbad and its environs.djvu --Xover (talk) 08:58, 3 February 2020 (UTC)
Great, thanks very much! It is really bad that contributors are still forced to choose between struggling with bad PDF extraction in Mediawiki and struggling with DJVU conversion, without any Wikimedia help with any of these two problems. Not only it slows the work down, but it makes it so difficult that it is IMO one of the biggest obstacles to getting new contributors here. --Jan Kameníček (talk) 09:10, 3 February 2020 (UTC)



3 styles... If you can advise further let me know.. 22:43, 5 February 2020 (UTC)

Controversial request...[edit]

Do we have partial blocks on English Wikisource yet?

Owing to some concerns I have about my ability to effectively code certain templates, I was going to ask for some kind of limitation on my account, so that I HAVE to request changes to templates via talk pages, rather then edit the templates directly.

In effect, this would be a self-block request from Template namespace (but not Template Talk:), until I feel able to not make typos and syntax failures, which have tragically led to the kinds of unreasonable behaviour in "edit-comments" and on Scriptorum.

I can of course try to stop editing Templates manually, but ....

ShakespeareFan00 (talk) 11:08, 6 February 2020 (UTC)

@ShakespeareFan00: Partial blocks were deployed to enWS in June last year; we just haven't updated our policies to reflect it.
But I am hesitant to implement this request: when you know you should not do something, technical measures should not be needed to enforce it. In this case, it is by your own judgement that you should not edit in the Template: namespace so refraining from doing so should be easy to adhere to. If it will help you, I can add a strong admonishment to refrain from editing there: not because I've actually seen you make any edits that would merit that, but because it clearly causes you great frustration. We have endless backlogs of all sorts so there should be plenty of other things to do that will bring you pleasure rather than frustration. --Xover (talk) 19:21, 6 February 2020 (UTC)


This seems to be another instance of P wrapping weridness.. There should be a normal (paragraph spacing) between the end of the paragraph at the top of the page and the continuation paragraph following it. There is apparently a reduced spacing - more like that of a normal line (even though I'm not changing any top or bottom margins. Can you possibly sanity check what I'm doing in the template code underlying this to make sure I am not overlooking some blindingly obvious logic failure or typo?. ShakespeareFan00 (talk) 22:33, 7 February 2020 (UTC)

These may be related to the long-standing dolevels and P wrapping glitch that's on Phabricator (see T134469 ) ShakespeareFan00 (talk) 22:50, 7 February 2020 (UTC)
Wow.. I took a VERY careful look at my code, and moving where the anchors were placed... No spacing issues.. Next question: How to insert the required leading needed as P and DIV have different initial marination. I really really need to check my old code more often. ShakespeareFan00 (talk) 23:26, 7 February 2020 (UTC)
@ShakespeareFan00: The tendency of MediaWiki to aggressively insert P tags combined with very detailed styling will make that very difficult to get consistent. It will very quickly devolve to needing a "whole page style" (to reset things like margins) that templates and TemplateStyles are a poor fit for.
I also have to note, I took a quick look at the templates in use here and ran away screaming. This is very complicated code for what is, superficially, relatively simple visual formatting. My hunch (and I could very well be wrong here) is that this is indicative of another area where you're fighting limitations of both MediaWiki and what HTML/CSS supports. That's partially coloured by a previous look I had at what would be the "theoretically correct and semantic" way to mark up this kind of content and concluding that HTML/CSS just doesn't offer the facilities to do that properly.
I'm thus not really sure I can offer any useful advice on this. This kind of text is on my big and unsorted "todo" list of things to try to figure out at some point, but I very rarely work on this kind of content so it's likely to get pushed back for other projects. --Xover (talk) 09:40, 8 February 2020 (UTC)
It's certainly feasible to do it in CSS. It's just that the Mediawiki overheads add additional conflicts/concerns. If it wasn't for the paged nature of the content, I would have suggested looking into CSS counters so that numbering becomes a content attribute for a P or DIV as part of a before:: rule, assuming that's something media-wiki actually supports in TemplateStyles.. ShakespeareFan00 (talk) 11:36, 8 February 2020 (UTC)
(Sigh) I clearly can't write rules that work... In this page the text-indentation defaults now overrides the rule I set to make it zero for a continuation paragraph. Having to write several different classes for starts vs continuations say this template's design is fundamentally flawed somehow. I can reformat the existing usages. There aren't that many, but you've potentially lost me as a long term contributor to this project, if well-intentioned (and in this instance) actually though through approaches are going to continually have to fight limitations in the tools.. ShakespeareFan00 (talk) 12:11, 8 February 2020 (UTC)
The tools are what they are. We can do what we can within the limits of what they allow, or we can keep fighting them and being perennially frustrated. I heartily recommend the former approach. There are plenty of things to do here that do not involve templates or numbered paragraphs at all. --Xover (talk) 13:53, 8 February 2020 (UTC)
BTW My reasons for redoing this is so that eventually I don't have to implement a new template for each 'different' level, I just add the relevant classes to the stylesheet. (BTW there isn't technically anything stopping someone else "classing" the section titles into floats, with suitable margin shifts ...)ShakespeareFan00 (talk) 14:53, 8 February 2020 (UTC)


So something is as I said previously being mis-expanded, but despite the conversion to Lua code, I can't figure out where. You'd paused on the deletion of this to await feedback from another user, which still hasn't arrived. Either the glitch can be tracked down the code ( time consuming) or the template needs to be completely re-written so the 'train-wreck' is conclusively resolved. ShakespeareFan00 (talk) 07:36, 8 February 2020 (UTC)

Regarding Index talk:History of Oregon volume 1.djvu[edit]

To keep relevant discussions clear, I've responded there only to the decision that was already reached. But your general point is well taken, and next time I'm starting work on a new transcription with page-spanning footnotes, I'll probably adopt your convention. (We're too deep into this work for an easy transition, unless somebody wants to get clever with AWB or something.) -Pete (talk) 20:23, 9 February 2020 (UTC)

@Peteforsyth: Ah. I was wondering why the page contained only fragments of a discussion. :)
I just read that bit as a question and wanted to clarify the (lack of) policy point and threw in the advice while I was at it in case it was still relevant. If you want the page to function more as a style guide then feel free to refactor my comment (or remove it entirely). --Xover (talk) 05:18, 10 February 2020 (UTC)
Others responded to me on my user talk page and, I think, in an edit I can see how it would come across that way. If there's a recommended format for documenting conventions for a particular work, I'd be happy to see it and emulate it...I appreciate your validation of our approach, and the advice. -Pete (talk) 05:28, 10 February 2020 (UTC)

Quote templates (from User talk:Chrisguise)[edit]

Moved because OT…

@Levana Taylor: Would it be useful to have {{“ ‘}} and {{’ ”}} (like {{" '}}) to emit these combinations with hair spacing? Slightly easier to type and ditto logical. --Xover (talk) 20:09, 9 February 2020 (UTC)
There was some back-and-forth at the time of changing quote policy but nothing decided; one opinion (which I share, I think) was that it would be preferable to have a single template that works with any pair of quotes. In any case, {{sp}} is just a stopgap which can be changed by bot when we settle on something. Levana Taylor (talk) 20:42, 9 February 2020 (UTC)
@Levana Taylor: I meant: Would you like me to create those two templates for you? We can't have such templates for every combination of quotation mark out there, but these two are very common and should have wide applicability. If you use those combinations a lot on OAW and having these templates would help a little then I would be happy to make them for you. The "savings" relative to just using {{sp}} may not be worth the effort of incorporating new ones into your muscle memory, but I just wanted to note option should you prefer that approach. --Xover (talk) 05:12, 10 February 2020 (UTC)
Yes, actually it’d be a good idea to have those three (not two) templates including {{“ ’}}. I’ll bet some people have been wondering why they don't exist. Levana Taylor (talk) 06:22, 10 February 2020 (UTC)
I come back to the guidance amendment which was not meant to encourage users to move towards more extensive use. People may wonder why policy change discussions can get bitter, when the boundaries are moved, and stretched and moved again. Remind me next time that this is what is going to happen, and I will be vociferous about opposing such initial changes as it seems that the defence needs to be made at the very beginning. — billinghurst sDrewth 09:54, 10 February 2020 (UTC)
@Billinghurst: I don't think I catch your meaning here? --Xover (talk) 10:00, 10 February 2020 (UTC)

Index talk:Blenheim Column of Victory[edit]

Sorry to bother you about something, but this has seemingly sat for 6 months. I noticed it when doing some Lint error repair work...ShakespeareFan00 (talk) 19:00, 10 February 2020 (UTC)


thank you unsigned comment by Anayguy (talk) 15:02, 11 February 2020‎ (UTC).

You're welcome. :) --Xover (talk) 16:12, 11 February 2020 (UTC)

DjVu from Google[edit]

Are you able to pull a scan from Google and generate a DjVu for upload to Commons? With the disappearance of Wikilivres, I've found that The Poems of Sappho copy that we had was (ironically) a small fragment of the complete text. IT would be very valuable to have The Poems of Sappho hosted here, and a scan would be the obvious first step. --EncycloPetey (talk) 16:50, 15 February 2020 (UTC)

@EncycloPetey: Nope, sorry. Google does not provide access to this book, at least not here. HathiTrust has it, but with limited access. Not sure about the status because I find some comments to the effect that it is downloadable in the US, so that is probably worth checking. --Xover (talk) 18:09, 15 February 2020 (UTC)
It looks as though I have the option to download a PDF. Would that be a sufficient starting point for you? I could upload it here as a temporary file if that would be enough for you to work from. --EncycloPetey (talk) 18:40, 15 February 2020 (UTC)
@EncycloPetey: I can work from PDF, yes. This will slightly degrade the image quality due to double encoding—which may be an issue with heavily compressed low-resolution scans—but usually works fine. --Xover (talk) 18:52, 15 February 2020 (UTC)
I expect problems anyway because of the quoted Greek passages, but will upload to File:Poems of Sappho (Cox 1924).pdf The only additional work required is that the first page is a Google notice that should be removed (without substitution) since it alters the odd/even page numbering. With the notice, the title page is an even page, but it should be an odd page. The converted DjVu can be uploaded to Commons, and the local pdf deleted.--EncycloPetey (talk) 18:56, 15 February 2020 (UTC)
@EncycloPetey: Done. File:The Poems of Sappho (1924).djvu and Index:The Poems of Sappho (1924).djvu (do, of course, feel free to tweak the Index to your preference: it's set up as a convenience, not an expression of opinion :)). On the Greek, this was the best I could do. It's not my area, but it looks to be roughly as good as the English text, or at worst a little bit worse. It will certainly need a competent hand (i.e. not mine!) to correct it in any case.
Please let me know if you find anything that needs tweaking. I can swap around images, or insert placeholders etc., and regenerate the DjVu fairly easily now while I have the source files sitting around. There are also some knobs I can tweak on the OCR to try to get better results if there are specific problems, but absent anything pathological the current version is probably within spitting distance of how good we can get it. --Xover (talk) 06:58, 16 February 2020 (UTC)
Thanks! --EncycloPetey (talk) 14:52, 16 February 2020 (UTC)

Re images: I see that pages 56 & 57 of the text (DjVu pages 62 & 63) are facsimile printings of a specific first edition text. These would be worth having as images. --EncycloPetey (talk) 16:51, 16 February 2020 (UTC)

@EncycloPetey: File:The Poems of Sappho (1924), p.56.png, File:The Poems of Sappho (1924), p.57.png. --Xover (talk) 20:07, 16 February 2020 (UTC)

template:do not move to Commons[edit]

Hi. Noticed that you have changed the parameter of this template on some works (eg. [1]) from expiry to expires … nada. Also to note that it is configured for the year of movement, rather than the year of expiry, ie. aimed at the 1 January date rather than the 31st Dec date. Yes it is different <shrug> we survive. — billinghurst sDrewth 06:07, 22 February 2020 (UTC)

@Billinghurst: Thanks. I always struggle to keep that parameter name straight, mostly because |expiry= is both bad grammar and reads awkwardly in a mnemonic sense. In the edit you link I would guess I was doing general cleanup and at some point changed my mind about some modification to the Commons template, "restoring" it by hand and ending up using my flawed recollection of what the parameter name is. I see |expires= isn't even a valid alias for |expiry= so that's not just pointless but actually broken too.
However, on the right year to use I'm confused. The docs clearly say (and the template code reflects) to use the last year of the copyright term in that parameter, but here you seem to be saying to use the first year after the copyright has expired? --Xover (talk) 06:36, 22 February 2020 (UTC)
I have later boot prints on the template, primarily around being able to have a parent category, and subsidiary works. I have nothing on the original nature of its design. Yes, it is done so the expiry shows in the new year when you can move it—so it is +1—which makes sense for its "voila!" moment, though confusing against the YoD for PD templates. <shrug> — billinghurst sDrewth 10:25, 22 February 2020 (UTC)
@Billinghurst: I'm sorry, but I'm not following. Are you saying you prefer to use the template in contravention of its documentation and the semantics of that parameter in its code? If so, what is the effect you are trying to achieve? --Xover (talk) 13:50, 22 February 2020 (UTC)
I am explaining what the template does, if the documentation does not match the action then the documentation is wrong. The template takes the year of expiry, not the last year of life of copyright as the PD-old-nn series takes. And not exactly enchanted with how you worded your statement. — billinghurst sDrewth 10:22, 23 February 2020 (UTC)
@Billinghurst: If my message caused offence then I apologise: that was certainly not my intention! I merely intended to indicate that I do not understand your preceding message, and to ascertain your intended meaning.
The documentation is unequivocal: it says {{Do not move to Commons|expiry=_last year of copyright_}}. But regardless of the documentation, if you look at the code of the template it is also clearly intended to be used with the last year of the copyright term. The parameter is used to display the "do not move banner" and to place the page into a "not suitable for commons year" category, both of which are removed or hidden once the copyright has expired (once current year is larger than the year given in |expiry=) and replaced by the category "media now suitable for commons".
So what I'm trying to figure out is what effect you're trying to achieve, partly because, unless what you want is in direct conflict with its design, it seems likely that the template can be modified such that gives you that effect without abusing (and, I stress, I here use that term in a purely technical sense!) the semantics of the parameter. --Xover (talk) 11:00, 23 February 2020 (UTC)

Index:Public School History of England and Canada[edit]

Found some scans of a possibly identical edition:-

Is there a process for doing a replacement? ShakespeareFan00 (talk) 12:07, 22 February 2020 (UTC)

@ShakespeareFan00: No particular process, especially since this index is for individual image files rather than a DjVu and PDF. But if we can determine that it is the same edition I can generate a DjVu file with OCR and move the Index over. --Xover (talk) 13:53, 22 February 2020 (UTC)
It is is the same edition, but it's not a simple replacement, as the copy is missing the title page, which is present in the Jpeg scans present on Wikosurce currently. Any djvu file might need to be manually patched for the title pages... ShakespeareFan00 (talk) 10:00, 23 February 2020 (UTC)
Also - which IS also identical apart from the rear cover (and has the title pages) ShakespeareFan00 (talk) 10:00, 23 February 2020 (UTC) - Amongst these editions there must be one that's identical. ShakespeareFan00 (talk) 10:00, 23 February 2020 (UTC)

Index:Cox - Sappho and the Sapphic Metre in English, 1916.djvu[edit]

Page realignments needed.. Thanks... ShakespeareFan00 (talk) 20:55, 28 February 2020 (UTC) Now Done ShakespeareFan00 (talk) 22:59, 28 February 2020 (UTC)

Poet Lore, volume 4[edit]

Hello. I would like to ask you for help with File:Poet Lore, volume 4, 1892.pdf. There are three pages missing, which I have extracted from some other copy (which has different pages missing). Two of them are here and they come before the title page with the poem by Tennyson (which is currently the 5th page of the file and should move to become 7th page of the file). The third missing page is frontispiece and is here. It should come before the first page of No. 1 (with the text A Modern Bohemian Novelist…). Could you then also convert the file into djvu, please?

Thanks very much. --Jan Kameníček (talk) 09:35, 4 April 2020 (UTC)

Only now I have noticed the notification above. No problem, nowiki duties have priority. Hope you are fine. --Jan Kameníček (talk) 12:47, 4 April 2020 (UTC)

Hyphenation with italics...[edit]


Based on some concerns I had I wrote some minimal test cases...

None of the current approaches is ideal, because of how the tags around the start get interpreted, they collapse into a bold with a stray ', which is clearly not what is typically desired.

The parser here IS working a designed, but the combined italics over a Page gap, might be something that needs to be looked at again. ( The other concern arises with in follow refs as well.) ShakespeareFan00 (talk) 11:27, 10 May 2020 (UTC)

@ShakespeareFan00: I don't think I'm understanding correctly what problem you are trying to address here? Why can't the simple approach that I just added to your test cases be used? --Xover (talk) 17:25, 10 May 2020 (UTC)
That worked :) It's always the small things.. It means I can update the Help: page accordingly. {{hwe}} {{hws}} are not now needed so I can update accordingly..ShakespeareFan00 (talk) 19:38, 10 May 2020 (UTC)

List with a DIV..[edit]


This is getting silly. If Mediawiki can't handle very simple things like placing a DIV based template inside a list, then certain parts of the code responsible for handling that kind of markup need to be completely re-thought. This isn't a new issue, it is PRECISELY the issue that was reported about 2-3 years ago. (And in a related form has been known about in a related form since at least 2007!) - Head-meeting desk repeatedly (sigh). ShakespeareFan00 (talk) 01:00, 11 May 2020 (UTC)

@ShakespeareFan00: Again I'm not sure what the specific problem you're seeing is (your testcase page contains lots of things that may or may not be the issue you're concerned about). But for div inside list items, the most obvious issue isn't the div as such, but rather extraneous newlines in the template. Due to html whitespace rules these are usually effectively ignored, but in certain contexts newlines have semantics for the MediaWiki parser. Lists being one such context: list items in MediaWiki cannot contain raw newlines due to the simplified list syntax relative to html lists. I've removed the extraneous newlines in the EB1911 template you used as a test case just as a demonstration.
Also, in general, it often isn't an issue of MediaWiki being unable to handle whatever the issue is; but rather that when you have an extremely simplified syntax like wikimarkup, that's used to generate relatively complex things like full-blown html, you're going to run into limitations and tradeoffs. The lack of end tags in wikimarkup makes it impossible for the parser to function without inference, and when inference rules start stacking you can't easily tweak one without knock-on effects for others. This stuff is hard, in addition to suffering under Wikimedia's lack of resources and Wikipedia blinders. --Xover (talk) 05:39, 11 May 2020 (UTC)
Which in respect of that specifc template was the problem I was trying to solve.ShakespeareFan00 (talk) 08:10, 11 May 2020 (UTC)
The issue about line breaks inside line items is mentioned here w:Help:List#Nested_blocks_inside_list_items, but do normal people actually read (or know to look for) documentation, which may not be on the same wiki as the one they are editing on.)? As I've said in the past it would be nice if the inference rules were formally documented, somyself and other contributors relying on needing to look for a specific line in a Help: page (which may not even be on the same wiki) to find what they MIGHT be, as opposed to what they actually are. The notes there also don't take into account the DIV SPAN SPAN DIV whitespace handling issues that have caused confusion elsewhere. I am aware that this is a long standing issue.

(ASIDE: Converting the PRE block to a SYNTAXHIGHLIGHT resolves the issue of line feeds in respect of source code examples. Generally if it's multi line source code it should be using the latter not the former now. Where would be the best place to document this?) ShakespeareFan00 (talk) 08:25, 11 May 2020 (UTC)

The second issue concerns the generation of extra markers.. The formatting should (ideally) be the same for the internal conversion vs that placed inline ? ( I am wondering if what's being generated isn't the same code.) ShakespeareFan00 (talk) 08:46, 11 May 2020 (UTC)
And that proved to be correct. :)
*  Item 
** Sublist.

Doesn't open a new list for the sub list. Subtle, but easy once it's understood. ShakespeareFan00 (talk) 08:58, 11 May 2020 (UTC)

Long term , UKSI formatting...[edit]


Requesting a review of the approach here, and the template family concerned. (The intent is to EVENTUALLY replace the need to have direct numbering expect for proofreading and do it all with CSS counters if they are supported.)

Not urgent, but the approach here should allow for some simplification of the mess that some of the higher level templates are.

It may, and I say may also be possible to use something like this to make {{numbered div}} more usable, or even ultimately rescue the {{cl-act}} family. ShakespeareFan00 (talk) 22:01, 12 May 2020 (UTC)

@ShakespeareFan00: The approach looks generally fine, though quite complex. You need to keep in mind that you can end up in a situation where the solution has so many nuances and complexities that it ends up being just an extra level of abstraction and complexity. I don't know the source material this is intended for very well, and so I don't really have any good sense of whether that's a risk here, but it's one factor to keep in mind when designing such things.
I would also strongly caution against relying on CSS counters or any other algorithmic way to generate content when the details of that content matters (as paragraph numbering in legal texts does). Every single time that content is rendered the numbering is generated anew, and thus every single time there is a potential that something can go wrong (changes in MediaWiki's parser, changes in web browser CSS engines, different web browsers, etc. etc.). It also means it can change over time due to changes in the standards that define those algorithms. And algorithms are inherently more complex than hardcoding the content in the first place. CSS counters are close to programming complexity, but even HTML numbered lists share a lot of this type of fragility.
In addition, if we by some method generate part of the content of the page, we hide that content from those editing the page (which may be someone doing maintenance or looking to reuse parts elsewhere). That may be a worthwhile tradeoff if we gain a lot of value from it (which it looks like the UKSI family may well do), but it's another factor to keep in mind. I recall with horror the template—was it modern? I can't recall—that wrapped almost every other word in a template with multiple arguments, resulting in the whole page being just a soup of markup. If you've ever seen raw PostScript data… That's a perfect (extreme) example of why this is a problem.
That doesn't mean we can't use these approaches at all, but it does mean we need to be careful to not fall into the trap of making the solution so fancy that it defeats the purpose. --Xover (talk) 06:13, 13 May 2020 (UTC)

Play by Synge[edit]

Could you please create a DjVu for J. M. Synge's play The Playboy of the Western World from (external scan)? It was published in 1907 (Dublin) [1912 reprint] and the author died in 1909, so there should be no issues uploading to Commons. We have a dearth of works by Irish authors. --EncycloPetey (talk) 21:02, 14 May 2020 (UTC)

Since Xover is on semi-wikibreak, I queued this using the IA Upload tool. It should show up soon here: File:The Playboy of the Western World.djvu If there are quality issues, Xover may know better than me how to correct them, but this should at least get things started. -Pete (talk) 21:07, 14 May 2020 (UTC)
Will that work if there is no DjVu available at IA? --EncycloPetey (talk) 21:13, 14 May 2020 (UTC)
Yes, the IA Upload tool has the ability to generate a DJVU based on the JP2 files at Internet Archive. It takes a bit longer to process (a few hours, I'd guess). By the way, I noticed that Synge's work "Riders to the Sea" appears to be quite significant as well, so I also uploaded that one and began match & split. I'm not 100% sure the transcription is for the edition it claims, though, as the transcription has a list of "Characters" where the original has a list of "Persons". Anyway, hopefully any differences are minimal and easily detected. -Pete (talk) 21:26, 14 May 2020 (UTC)
@EncycloPetey, @Peteforsyth: For simple cases with little need for manual page-fiddling and such, the computer does most of the work (I just remove the extra scan reference images at the beginning and end, and any botched images interspersed in the image series). I can usually find the time to grab the download and set it processing even when I'm otherwise busy. Case in point, I've just set the computer to crunching this scan so it should have a DjVu ready for upload by the time I'll be sufficiently caffeinated tomorrow morning. Let me know if you want it or if you prefer the ia-upload version (ia-upload grabs the OCR text from IA, which uses Abbyy Finereader, which some people prefer to the results Tesseract produces). --Xover (talk) 21:42, 14 May 2020 (UTC)
OK, thanks for the background, and sorry if I jumped the gun. I'll not interfere further on this one. -Pete (talk) 21:54, 14 May 2020 (UTC)
This one looks like a simple case; just the extra scan reference images from front and end to be removed. I couldn't say which OCR is to be preferred. --EncycloPetey (talk) 22:19, 14 May 2020 (UTC)
@EncycloPetey: I should have used the IA upload tool to remove page 1. I will remove the first and last pages tomorrow and upload a new version. -Pete (talk) 02:42, 15 May 2020 (UTC)
@Peteforsyth: If I've understood Xover's comment above, then he's already generating a replacement DjVu. --EncycloPetey (talk) 03:24, 15 May 2020 (UTC)
@Peteforsyth: Never worry about simply trying to be helpful! That's always going to be appreciated, and if well-meaning assistance should ever mess anything up I'll be sure to let you know (as I hope you will for me too). In this particular instance I very much doubt EncycloPetey minds having multiple options to choose from, and, as mentioned, it cost me very little effort so it would hardly be a waste worth mentioning even if it ultimately went unused.
@EncycloPetey: (and Pete) I took the liberty of uploading the DjVu I generated over the one Pete generated with ia-upload (vs. seperately), because on checking I found that the ia-upload one had that darned annoying text layer offset problem. It's caused by a really annoying interaction between the way ia-upload generates these DjVus and the really rather shockingly approximate API at the Internet Archive, and is almost impossible for software to correct for (I know Sam has looked at it). Essentially, IA is returning plain incorrect information about page numbers and ia-upload relies on that page order being correct in order to extract OCR text from the XML file at IA and associating it with the right pages in the DjVu. Once one page is incorrect every subsequent page will be too, and multiple such errors will compound.
Regarding the OCR engines… ABBY FineReader (a commercial product that IA uses) used to generate better quality OCR than Tesseract 3.x (open source tool used by Phe's tools) so some people prefer the OCR text from IA. In my experience, Tesseract 4.x (which is a major new rewrite with a completely new engine), ABBY FineReader 8.x, and Google Vision (what the Google OCR gadget uses) have comparable quality results. Each has strengths and weaknesses, and some of them handle certain languages better, but I find them essentially interchangeable in terms of OCR results.
In any case, Pete uploaded the DjVu at File:The Playboy of the Western World.djvu and I've uploaded my version over that. Please do let me know if there's a problem with it or if it needs tweaking. And I'm happy to do these DjVus, so never hesitate to ask if you have need of that. --Xover (talk) 07:02, 15 May 2020 (UTC)
It's true, ia-upload has some annoying bugs like that! :-( It sounds like this file's all sorted now, but sometimes I find that the IA PDF is of good enough quality, and has a text layer, so the whole question of DjVU can be sidestepped. Sam Wilson 07:17, 15 May 2020 (UTC)
@Samwilson: It's my understanding that DjVu is strongly preferred, which is why I've been trying to hone my skills in generating them. But I'm not fuly familiar with the reasons. (I know it's a more open format, which may be reason enough.) @Xover: Thanks for the explanations. If IA is generating info that is just plain wrong, do you know if anybody has informed them of that? I'd be happy to reach out if it wouldn't be redundant. -Pete (talk) 18:32, 15 May 2020 (UTC)
@Peteforsyth: So far as I know, nobody has talked to IA about this. If you're looking for a programming project I'm sure Sam would appreciate the help on ia-upload!
The text layer offset is discussed in phab:T194861. ia-upload is (I think) relying on the pre-generated XML file at IA combined with information from the API for information about the pages. As Mpaa comments there, the XML file does not always contain all the page images found in the .zip, and ia-upload's algorithm (processing them sequentially) compounds the problem. It's been a while since I looked at it, and my code uses a different approach, but as I recall what I found was that the scan reference images at the start and end, and any mis-scans in the middle, are not correctly reflected. I suspect they manually correct for these in their book reader. In any case, if you're trying to process the .zip files automatically you'll run into trouble with this.
My code avoids this problem because it generates new OCR rather than try to import IA's OCR, but this, obviously, has the downside that you don't get IA's OCR (which may have been the very thing you wanted). It also means there are unavoidable manual steps to prepare the scan images before before processing, making it unusable as a general use tool, unlike ia-upload which is just about as user friendly as it's possible to get this kind of tool. (I'm toying with the idea of making my tool available as an interactive tool at Toolforge/Labs, but I have limited time and I'm not sure there's all that much interest. The current commandline version is too hacky to be useful to anyone but the most techy.).
As for DjVu vs. PDF… There are lots of reasons. Ironically, the biggest reason the community tends to prefer DjVu is that MediaWiki's extraction of OCR text from PDF files is atrocious and much worse than its extraction of the same text from DjVu files. You can literally open the same PDF file in Acrobat and copy the text out and get bette results than MediaWiki's. But this may be at least partly due to the biggest issue for me: there's a definite dearth of even semi-decent tools for working with PDF files, especially in an automated way. My guess is that this is because PDF is a wholly visually oriented format, so there is very little structure or sense to PDF files that a tool could manipulate.
DjVu by comparison has an extremely structured system with levels for a whole document, referenced sub-documents (you can even reuse binary chunks between pages for hyper-optimization!), pages divided into areas, and OCR text with regions, columns, paragraphs, lines, words, and characters; all with positions and extents. The DjVuLibre tools are designed to let you manipulate all this from the command line (I haven't tried doing it as a library, but it does support that), and are fairly decently scriptable. That DjVu(Libre) is free in all the ways that matter (licence, patents, open source, open specification, free-as-in-beer, etc.) where PDF has uncomfortable caveats on several points is a secondary but not unimportant concern.
And, ultimately, from what I can tell DjVu—because it is designed specifically for our use case—is much better suited for our needs at the format level. For example the ability to separate a page image into multiple layers, where a single-color solid background layer can be encoded efficiently as essentially a single pixel; areas that will be occluded by a higher layer can be ditto compressed away; advanced wavelet compression for "photographic" (anything but the simplest black and white stuff) layers, but scaling down to simple bitonal (and thus highly space efficient) encoding. In one recent test a file went from several hundred MB to 3.5MB because I decided to optimize for size rather than fidelity (it was a badly crushed B&W Google scan with lots of noise that the "photographic" compression didn't do well on, and which didn't suffer markedly by bitonal encoding).
But I suspect that ultimately the community's preference for DjVu actually boils down to that poor OCR text extraction from PDFs in MediaWiki. The rest are somewhat too esoteric issues for most people here. --Xover (talk) 13:02, 16 May 2020 (UTC)
@Peteforsyth: IA has additional problems with scans in its library. Many of them are imported from Google, and quite often no one checks the scans for basic quality control. I have not infrequently found scans where some of the pages are upside-down, or scans where pages were missing (or duplicated), or scans where the corner text of every page was obscured by the thumb of the person doing the scanning, and many other issues. Given that these visually obvious issues occur in IA scans, it is unlikely that text layer offset (which cannot be seen easily) will be caught and corrected. --EncycloPetey (talk) 22:02, 16 May 2020 (UTC)

Whitespace (again)[edit]

Page:The record interpreter- a collection of abbreviations.djvu/449

Unless you leave 2 lines between plain-lists, the parser backend seems to collapse things.

Normally between 'grouped' items you would leave a single line feed.

one consistent rule to apply, all the time, would be nice. ShakespeareFan00 (talk) 14:24, 17 May 2020 (UTC)

Solved by adding additional parameters to {{plainlist/s}} , Review requested. Why would it not be possible to style the UL element directly? ShakespeareFan00 (talk) 19:57, 17 May 2020 (UTC)
@ShakespeareFan00: Presumably because the ul doesn't actually appear anywhere: mediawiki's wikmarkup infers the ul from the presence of a list item. But {{plainlist}} could of course just emit the raw html directly instead of relying on wikimarkup. That would probably solve the extra margin too, as I'm pretty sure the default mediawiki stylesheet adds a 1em margin for all div elements. Then again, you don't really need list markup for Page:The record interpreter- a collection of abbreviations.djvu/449: it would make just as much sense to terminate each line with a &br /> and then use the normal whitespace rules to separate groups. --Xover (talk) 20:18, 17 May 2020 (UTC)


And why it should be deleted or re-written from scratch completely.:-

The exact problem was that:

{{cl-act-p/1|s1=1|text={{cl-act-h||Test heading}}{{lorem ipsum}}}} 

{{cl-act-p/1|s1=1|{{cl-act-h||Test heading}}{{lorem ipsum}}}}

Do not generate the SAME output, in the latter it's trying to put the ENTIRITY of what should be the 'text' inside the ID field of the DIV it's generating. This is a mistake somewhere in the LUA/markup combination that's generating it. It's too complex for me to understand what the code is doing, and thus it can't be fixed on reasonable time scale, therefore it's time someone else, started from scratch based on the test-cases and details provided.

(I've also reverted cl-act-t back to an earlier version as well because of some completely screwed up margin handling in more recent revisions. (If something isn't working , go back to a versionthat's broken in a way that WILL hopefully be understood better.) ShakespeareFan00 (talk) 15:17, 17 May 2020 (UTC)

At the very least can you check that the LUA code responsible for decoding arguments is decoding things correctly? (i.e the embedded template should not be parsed sperateyl but as part of paramater 1 (unnamed). ShakespeareFan00 (talk) 15:39, 17 May 2020 (UTC)

Well after a LOT of headaches I got this working again. However, I still think it needs a rethink as it has a pain to find where it broke down.. It still needs a re-write or more extensive documentation. ShakespeareFan00 (talk) 23:45, 17 May 2020 (UTC)