User talk:Inductiveload

From Wikisource
Jump to navigation Jump to search


Inductiveload User Area
Main User Page Talk Page Gallery Contributions

WELCOME to my user talk page. Feel free to leave me a message if there is a problem or you would like my help, or anything else.

I am also active on Commons. If you would like help with a file I uploaded or would like me to make a file for you, please ask at my user talk page there. If the request is Wikisource-centred, ask here.

Anything you write on this page will be archived, so please be polite and don't write anything you will regret later! My purpose here is to make interesting and useful documents open to the public. I am never trying to make trouble, and any problems can almost certainly be resolved quickly and easily if everyone stays calm.

Please sign your posts by typing four tildes (~~~~) after your post, and continue conversations where they start. This helps to keep discussions coherent for future readers. If I leave a message on your page, then please reply there. My replies to messages left on this page will be here.

Wikisource user page Commons user page Wikibooks user page Wikipedia user page


Bot Job for Index:A dictionarie of the French and English tongues - Cotgrave - 1611.djvu[edit]

Could you replace all instances of " " with a new line and a 1em indent. Also, could you remove all poem tags? Languageseeker (talk) 02:22, 4 November 2021 (UTC)[reply]

@Languageseeker I can do it, but it would have been ~900 times easier if done that before splitting. As usual, my recommendation is to slooooow dowwwwwn and think things over before rushing into the first action you think of. Inductiveloadtalk/contribs 07:29, 4 November 2021 (UTC)[reply]
Actually I do not think this is correct. There are lots of multiple-spaces and not all of them are new lines:
Abandonnemént. ''at randome, dissolutely, licenciously,   profusely,with libertie.''

Abandonner: ''to abandon, quit, forsake, forgoe, waiue   or give ouer, shake or cast off, lay open, leaue at randome,   prostitute vnto, make common for, others; also,   to outlaw.''   Abadonner la vie de tel au premier qui le   pourra tuer. ''to proscribe a man; (is ever to be vnderstood   of a Soveraigne, or such a one as, next vnder   God, hath absolute and vncontrowlable power ouer   his life.''   s'Abandonner à plaisirs. ''sensually to yeeld, or become   a slave, vnto pleasure; wholy to captiuat, or deuote,   his thoughts to delights.''   Fille qui donne s'abandonne: Pro. ''A maid that   giveth yeeldeth.''   Il commence bien à mourir qui abandonne son   desir; Pro. ''he truly begins to die that quits his chiefe   desires.''
Ideally we want to find a transform that will allow us to leverage the Mediawiki definition list markup like this:
; Abandonnemént.
: ''at randome, dissolutely, licenciously, profusely,with libertie.''

; Abandonner:
: ''to abandon, quit, forsake, forgoe, waiue   or give ouer, shake or cast off, lay open, leaue at randome, prostitute vnto, make common for, others; also,   to outlaw.''
:; Abadonner la vie de tel au premier qui le pourra tuer.
:: ''to proscribe a man; (is ever to be vnderstood of a Soveraigne, or such a one as, next vnder God, hath absolute and vncontrowlable power ouer   his life.''
:; s'Abandonner à plaisirs.
::''sensually to yeeld, or become   a slave, vnto pleasure; wholy to captiuat, or deuote, his thoughts to delights.''
:; Fille qui donne s'abandonne: Pro.
:: ''A maid that   giveth yeeldeth.''
:; Il commence bien à mourir qui abandonne son desir; Pro.
:: ''he truly begins to die that quits his chiefe desires.''
Inductiveloadtalk/contribs 12:15, 4 November 2021 (UTC)[reply]

Defaults...[edit]

Module:New texts currently defaults to Template:New texts/data.json with predictable results:

{{#invoke:New texts|new_texts|offset=9|limit=12}}

Wouldn't it make more sense to default to the current year when no |year= is given? (cf. Template:New texts/sandbox) Xover (talk) 08:44, 8 November 2021 (UTC)[reply]

Sure, it was just a holdover from when the current data was just at "data.json". diff. Inductiveloadtalk/contribs 08:49, 8 November 2021 (UTC)[reply]

Adding border option for {{FI}}[edit]

Would it be possible to add a paramater to set the border in {{FI}}? Some books have black borders around images that this template cannot handle requiring custom CSS code, see Index:Negro poets and their poems (IA negropoetstheirp00kerl).pdf. Languageseeker (talk) 16:34, 8 November 2021 (UTC)[reply]

@Languageseeker Hmm, that would need an extra <span>, since the [[File:...]] markup doesn't accept a style parameter. Looks like the imgstyle parameter was an attempt to do that and I failed at it.
As you seem to have found, index-based CSS will do this happily, and there is both a cclass and imgclass to assist in targeting the CSS if needed. Inductiveloadtalk/contribs 17:16, 8 November 2021 (UTC)[reply]
I see, thank you. Languageseeker (talk) 17:33, 8 November 2021 (UTC)[reply]
@Languageseeker: Unless there was a change in templates, we are talking about two templates. {{FI}} is a <div></div> based template and {{FIS}} is <span></span> template. The difference was to allow text to flow around the frame unbroken. Which one are you referring to? — Ineuw (talk) 07:45, 13 November 2021 (UTC)[reply]
I created a sample using both templates User:Ineuw/Sandbox4.— Ineuw (talk) 08:04, 13 November 2021 (UTC)[reply]
The templates surround both the image and the caption with a border. The desire is to have an option that would just surround the image. Languageseeker (talk) 03:32, 15 November 2021 (UTC)[reply]

Option to Export Text Layer of an Index[edit]

I was wondering if there's an option to export the entire text layer of a Index similar to how PGDP can export the concatenated text file. The output would be something like this

====Page:1====
Status: N (Not Proofread) B (No Text) P (Proofread) V (Validated) <header>
header text

<body text>
body text

<footer>
footer text

====Page:2====

This would be a great way to be able to search for common problems in the Index and also to be able to have a copy of the raw wikicode for an Index. Languageseeker (talk) 17:33, 8 November 2021 (UTC)[reply]

@Languageseeker hmm, interesting. It's probably pretty easy to do with Python.
https://book2scroll.toolforge.org/ is similar, but without the "save as file" and wikitext options. Inductiveloadtalk/contribs 17:36, 8 November 2021 (UTC)[reply]
Is there anyway to avoid python? That seems to quite a layer of complexity for most users. Languageseeker (talk) 17:51, 8 November 2021 (UTC)[reply]
Well, once the Python exists, it can be deployed on Toolforge. 100x easier than getting it into the extension (for one, it would need a formal format to be defined). Inductiveloadtalk/contribs 17:58, 8 November 2021 (UTC)[reply]
I was thinking that maybe Proofreader Page might actually be the best place to add this code. Ideally, there should be an option to export/import a project. Right now, there's no easy way to export the data from a project or recreate it in another space. Ideally, this would generate a zip file on zip file that would contain a text that starts with information about the file and metadata followed by the text from all the Pages as well, the original scan, and any media files. Languageseeker (talk) 21:50, 8 November 2021 (UTC)[reply]
That could still be done on Toolforge. Building into the extension is a good idea, but 1) you need a very well-defined format to fix and 2) it takes an enormous amount of effort and and even larger amount of time to get non-trivial patches accepted in the extensions. Easily an order of magnitude lower velocity than a Toolforge tool. Inductiveloadtalk/contribs 22:34, 8 November 2021 (UTC)[reply]
I see. It's a practicality issue. Could you add it to your already too long list of things to do? This will be important to users who wish to proofread and also those who wish to have a complete archive of an Index to either use on a different instance of Wiki or in whichever this capacity they wish. It's also an key to fulfilling the open-access philosophy/promise of Wikisource. Languageseeker (talk) 00:42, 9 November 2021 (UTC)[reply]
Ideally, there would be two level of backup
  1. A pure textual one consisting of a concatenated text file.
  2. A full backup that could be imported into a clean wiki-setup. This would include
    1. The Scan (PDF, DJVU, Images)
    2. The Metadata for the Scan
    3. The CSS File
    4. The concatenated text files
    5. Any media files and their metadata
    6. Any Templates and their documentation used
    7. Any pages where the scan is transcluded.
    8. Any pages that link to the transculsion. This will probably be Author pages or Version. Languageseeker (talk) 02:34, 9 November 2021 (UTC)[reply]
Forgot to ask the all important question. Do you think that this is something that your can do or do you not have the bandwith/time for it? Languageseeker (talk) 15:24, 9 November 2021 (UTC)[reply]
@Languageseeker Honestly, I do not think there's a lot of benefit to building this into the PHP. Any format would be extremely specific and not generally useful. All the information you need is explicitly available on the API. I think you should really be thinking about what you want to achieve here. You use the word backup, so this makes me think that you're thinking of some kind of archival purpose rather than any proofreading-related purpose. Database dumps of the whole of Wikisource are made every month or so, so if you're after an archive, maybe you can check them out.
In short, I do not really have time or inclination for any involved tool without a pretty solid "business case". On the other hand, a quick hack-up of "get the wikitext of every page in an index" is not very hard. Inductiveloadtalk/contribs 15:32, 9 November 2021 (UTC)[reply]
For me, there is both a use for proofreading and for backup.
I think the business case for backup would be make it possible for users/institutions to get their work out of this project. I can imagine many cases where institutions or users may want/be willing to use the Wikisource platform to proofread as long as their are able to get the work out in an easy way. This weekend, on LinguaLibre, there was a similar case where a user was willing to contribute because they were expecting that they could download their pronunciations easily. As it turned out, there is no such way which caused some embarrassment and lead to the team downloading every pronunciation manually to avoid losing the user. I think that WS faces the same. Say the NLS would like to download their chapbooks. How would this be possible? Think about how many Indexs on enWS have images or repaired files. I've imported quite a few works from PGDP and one of the constant challenges that I face is that the text file does not correspond to the actual file on IA or HT. Keeping the text file with the image files/scan will make it possible to actual backup the work.
For proofreading, a system to import/export an Index will have several benefits. First, it will enable users with slower internet connection to contribute without having to worry about long load times or losing data. Second, it will also enable users to proofread an entire text or search for common errors. Finally, it will also make it easier to locate a specific error.
I think that "a quick hack-up of "get the wikitext of every page in an index"" is a great start and would be a wonderful thing to have. Would at least that be possible? Languageseeker (talk) 15:48, 9 November 2021 (UTC)[reply]
@Languageseeker Right, but what's a pile of wikitext, Lua and images going to achieve? You'd have to import it into a near-perfect clone of Wikisource as it was at the time of export. For what purpose? In case Wikisource gets nuked? WS Export already provides HTML export, as well as PDF, ePub, MOBI, text and RTF. Wikitext export is already completely possible via the API or DB dumps and can easily be done, but the format it ends up in will be wikitext and essentially completely useless except for feeding back into a wiki and more suitable for some kind of offline match-and-split-like workflow that feeds back to Wikisource itself.
Anything more than a straight wikitext dump of the pages in an index is weeks of work and an ongoing maintenance burden, so you really need to explain what it's for, other that "man, wouldn't it be fun if".
And if you do want to feed back into a wiki locally, then we already have Special:Export (probably with Special:PrefixIndex assuming people have done the Right Thing and used subpages properly), as well as the aforementioned DB dumps, API access and the Wikisource-dedicated export tooling. Inductiveloadtalk/contribs 16:03, 9 November 2021 (UTC)[reply]
I don't think that it's going to be anywhere near a perfect or easy process to import the files into another system. However, it would be possible. Creating an export for an index will enable users to do what they wish with the data in an easy in convenient manner. That is why the three most important aspects to export are the text layer, scan, and images. The other features are nice to have (especially transclusion ranges), but are not strictly necessary. For me, this is a central pillar of a commitment to maintaining open-access to the information produced. Anyone should be free to take the raw data produced on enWS and do with it as they please. Languageseeker (talk) 16:29, 9 November 2021 (UTC)[reply]
But it's possible now. Getting the relevant data from the API and/or a DB dump is no harder than getting the data out of some special-sauce WS-specific package format. In fact, it's probably easier, because there's probably already tooling for handling DB dumps in whatever language the user wants (certainly Python and PHP).
There's still no concrete use case beyond "sounds fun". You need to find a client for this feature and make sure what you're proposing actually works for them. Bulk archiving is already provided by the software. Yes, Special:Export is missing the images, but that's a defect in the core (phab:T15827, 13 years old), and should "just" be fixed (ahahahaha, I crack myself up) there rather than getting me to do more of the WMF's homework and piling on more external tools to paper over lack of upstream interest. Tl;dr go and complain at them.
A way to generate a Special:Export package for all the pages in an index without having to use Special:PrefixIndex may also make sense (i.e. leverage the tools we already have)
A dump of wikitext in one big file I can understand, because then you can use a text editor to do various fixes without needing to bot them in "live" (though you'll still need a bot to upload at the end). Inductiveloadtalk/contribs 16:46, 9 November 2021 (UTC)[reply]
Alright, then can just the feature to export all the Pages to a txt file in an Index be added? Languageseeker (talk) 17:00, 9 November 2021 (UTC)[reply]

For such an edge case, what is wrong with Special:Export and let the users work out how they manage it. I sometimes wonder why we are trying to replicate rarely utilised functionality when we have needed improvements. If it is something important, stick it into phabricator: with all the other TO DOs. — billinghurst sDrewth 22:34, 9 November 2021 (UTC)[reply]

I don't think it's edge cases. I've been thinking about this more and here are what I think are some real scenarios in which this can help.
  1. Checking the formatting of an entire Index. For example, say that an Index has plates that should all be 500px. Right now, if you want to verify that all the images are in fact 500px, you would need to open every page, then click on edit, and then check. (Also, hope that the pages with images are actually marked.) This can be quite time consuming. If you had all the pages as a single text file, then you can just use find to check them all. This case can be generalized out.
  2. Finding an error in a book. When I read a WS text on my Kobo, sometimes I notice obvious scannos like "1t." On the Kobo, I can highlight the text which saves it to an annotation file. However, Kobo will save this as "Chapter 10: LETTER VII." So, if I want to find this error, I need to go to the transcluded work, find the right chapter, search in the chapter for the text, and then click on the page. It's a huge time waste. It gets worse when there are multiple Letter VII.
  3. The ability to import the text would also make it easier to correct common errors such as "— " or curly quotes.
  4. The ability to export/import images would make it much easier to replace poor quality images. Recently, I worked on replacing all 174 images in Index:The Adventures of Huckleberry Finn (1884).pdf because the existing images were cropped from the DJVU. The ability to export them with an accompanying XLS file would save a ton of time when it would come to reuploading them. As long as there is a reason column, there should be no technical barrier to using a script to override all 174. That is far faster than manually reuploading 174 files.
  5. It could also become possible to generate metadata for missing images. It would generate the metadata for all the missing images in an XLS file. Once the images are added to the folder, the script could upload them without the user having to manually create the metadata. This would greatly speed up the adding of images.
  6. In the long run, a proper system for importing/exporting text would enable the creation of an offline proofreading interface similar to AWB. There are many cases in which users might have a slow connection or just loading images from PDF/DJVU is simply a slow process. Creating a way to download/upload Indexes and individual pages would greatly speed up the work. Languageseeker (talk) 01:54, 10 November 2021 (UTC)[reply]
@Languageseeker: Re I need to go to the transcluded work, find the right chapter, search in the chapter for the text, and then click on the page. It's a huge time waste. It gets worse when there are multiple Letter VII. As a semi-tangent, you may appreciate the replace tool in User:Inductiveload/maintain, which allows you to highlight the text 1t and replace it directly in the Page namespace, if possible (usually it is).
correct common errors such as "— " or curly quotes. functionally, AWB, JWB or PWB (perhaps with User:Inductiveload/quick pwb) are existing tools that can do this already.
You can already get access to images in a given index, e.g. https://en.wikisource.org/w/api.php?action=query&format=json&formatversion=2&prop=images&generator=prefixsearch&gpssearch=The%20Strand%20Magazine%20(Volume%201).djvu&gpsnamespace=104&gpslimit=2000
Ditto for the content: https://en.wikisource.org/w/api.php?action=query&format=json&prop=revisions&list=&generator=prefixsearch&formatversion=2&rvprop=ids%7Ctimestamp%7Cflags%7Ccomment%7Cuser%7Ccontent&rvslots=main&gpssearch=The%20Strand%20Magazine%20(Volume%201).djvu&gpsnamespace=104&gpslimit=2000 In fact, a thin wrapper over this interface is all the putative Toolforge "exporter" would be anyway. If you're already using scripts, you should just hit that API yourself and then you get more control anyway.
Most of what you're asking is already possible, and if you're already using a custom script, the normal API is much more reliable, available, tested and stable than any Toolforge tool would ever be. It still sounds to me like you are coming up with solutions to problems before you've actually worked out a workflow that has the problems. Inductiveloadtalk/contribs 21:58, 10 November 2021 (UTC)[reply]
Wow, I did not realize how amazing the API was. However, when I try to get the raw content for Mansfield Park or frWS, it seems that it does not show the content for all the pages and the pages are out of order. Is there any way to show the content for all the pages in order? Languageseeker (talk) 01:48, 11 November 2021 (UTC)[reply]
At some point you're going to need to process the data anyway. Sorting that array is a one-liner in Python: pages.sort(key=lambda page: int(page['title'].split('/')[-1])).
I see 308 pages there, which looks right by Index:Austen - Mansfield Park, vol. II, 1814.djvu? Remember the JSON array is 0-indexed, but the page numbers are 1-indexed.
When the generator for index pages is ready (currently work-in-progress), that will be the better option. Inductiveloadtalk/contribs 07:46, 11 November 2021 (UTC)[reply]
While it correctly identifies all the pages, at some point, it stops outputting the revisions field which contains the content, see User:Languageseeker/sandbox3. Also, which generator are you discussing? Sorry, if I've missed something obvious. Languageseeker (talk) 12:31, 11 November 2021 (UTC)[reply]
@Languageseeker the generator is the one that will be implemented in phab:T291490. I'm halfway though doing it. Deployment will be when it will be. I have to finish it, and then shepherd it through code review.
For the data there, that's because you are not logged in, so you have lower API limits (50 vs 500). You will either need to make the API query from some logged-in session, or handle the continue.rvcontinue field correctly. Note, that since some books are over 500 pages long, you need to handle the continue data anyway. 13:06, 11 November 2021 (UTC) Inductiveloadtalk/contribs 13:06, 11 November 2021 (UTC)[reply]
Thank you for your wonderful and detailed explanation. I'm looking forwards to seeing the generator when it is done. It sounds very cool. Languageseeker (talk) 01:04, 12 November 2021 (UTC)[reply]

That iffy feeling…[edit]

The one that says, maybe I don't want to open the lid on that mystery container at the back of the fridge because who the heck knows what will come crawling out. I've been having that feeling for a while regarding the magical mystery black box that is phetools. But since the PWB thing forced my hand I've had to start opening lids. Let me illustrate by the pseudocode version of the algorithm that makes the Phe OCR gadget so fast:

titles = SELECT page_title FROM <Index: namespace on enws>;
for title in titles
    if not exists ocr_cache[title] then
        generate_ocr(title)

Because the flip side of the fridge horror above is the feeling you get after fixing a bot that's been dead for a while and discover it's decided to download every single PDF and DjVu file on commons to warm its OCR cache. Having to do emergency database surgery to excise the ~70k jobs already queued up in its internal grid engine manager before the Toolforge admins come `round to have a wee bit of a chat is… Well, I don't recommend it as a habit.

This thing is so clearly an eldritch horror poking its icy cold tentacles through a weak spot in the skein between dimensions. Maybe not Cthulhu itself, but surely Th'rygh, The God-Beast or Sho-Gath, The God in the Box. Xover (talk) 18:32, 8 November 2021 (UTC)[reply]

Ha, well, technically using a nuclear weapon is still "warming", even if people standing nearby get a bit grumpy. Inductiveloadtalk/contribs 18:45, 8 November 2021 (UTC)[reply]

Hws & hwe[edit]

I just saw a note you left with another user. Just what? How are we to know when changes like this happen? I used to get changes on the Scriptorium page on my Watchlist which I check but it doesn’t seem to be showing up anymore. It seems a major change to me, as a proofreader and quite distressing to be oblivious of it happening. I am working on a Beginners’ proofreading guide. Can you tell me of any other changes that I may be unaware of? I’ve noticed you seem to have your finger on the pulse. I’d appreciate the support. Cheers, Zoeannl (talk) 23:30, 8 November 2021 (UTC)[reply]

@Zoeannl: I'm using a phone, so I can't get too wordy, but the hyphenation thing was introduced in September 2018: Wikisource:Scriptorium/Archives/2018-09#Words_hyphenated_across_pages_in_Wikisource_are_now_joined.
Probably the other biggest recent change is H:Page styles, which allows index-specific CSS. I can go into more detail tomorrow if you like. Inductiveloadtalk/contribs 23:59, 8 November 2021 (UTC)[reply]
Zoeannl, it doesn't mean that you need to stop using HWE and the community has not deprecated its use and it is still supported, just there is now an alternative. There are still situations where HWE has to be used. To note that we did have a conversation more recently that we do need to get better with our announcements with regard to changes taking place. — billinghurst sDrewth 23:00, 9 November 2021 (UTC)[reply]

Pop goes the… Extension?[edit]

In case you're not aware: mw:Extension:Popups. Xover (talk) 08:59, 10 November 2021 (UTC)[reply]

@Xover I have seen this, but I counter with: Popups Reloaded is wayyyyy betterer, and does lots of cool stuff ner-ner ner-ner. Inductiveloadtalk/contribs 09:04, 10 November 2021 (UTC)[reply]
Yeah, I haven't looked at it; I just ran across the link and figured it might be relevant due to Reloaded (which I haven't looked at either). Incidentally, I'm cross-loading the enwp upstream of Popups instead of our locally-ported copy, and it is much nicer. "Good enough" rather than "Great", but everything is relative. Xover (talk) 09:58, 10 November 2021 (UTC)[reply]
@Xover reloaded is far from done, but even now 1) it's got lots of fun WS'y features (page image on hover anyone?) and 2) it's designed to allow pluggable extra modules (though the API for that isn't baked yet, so caveat implementor). Inductiveloadtalk/contribs 10:03, 10 November 2021 (UTC)[reply]

IME and ULS and Beta Code (and interested in testing?)[edit]

Awhile ago (5 Sept) I mentioned I was writing a user script to do easy keyboard entry of Ancient Greek polytonic script. Using w:Beta Code as was the basis for your template experiment {{betacode}}.

And it was <!*wonderful*!> with multiple options for visual feedback and all the easiness of Beta Code. I used it a lot at EL on a bit of ambition.

But it felt a bit hacky and was a lot of work, working around Wikimedia and doing all my own UI visual displays. And I remembered your comments about ULS and jquery.ime.


After several days of discovery and coding I've convinced jquery.ime to do Beta Code, with both strict rules and deliciously loose rules. It isn't as beautiful, but still wonderful to use, and the Wikimedia people will not actually hate it.

But it *is* rather hard to demonstrate 'live'. I've worked out a way to bootstrap the development copy of the new rules into online wikisource, but it requires a localhost HTTP server and many (>10) steps to force it into the live wikisource IME for use and testing.

Thing is, a Beta Code implementation using jquery.ime's very basic tools was complicated. I had to write programs to generate all the rules. Even the strict rules set is 200 rules of magic (~275 total). The loose rules - very kind to users - is 1200 rules of magic (~1275 total). The previous largest jquery.ime rules set was 179 rules. The wikimedia people might have heart attacks?


Soon I'll be ready to submit a pull request at Github, but I understand they are kind of slow(?) merging pulls into that project, and then merging jquery.ime updates into Wikimedia. So figuring out how to excite people is a goal?

So I did it using ULS and jquery.ime. Any advice on getting this into wikisource this year?  :-)  Shenme (talk) 06:38, 15 November 2021 (UTC)[reply]

@Shenme this looks amazing. I don't have much advice for getting things into the code base other than "be very, very patient" and then "be more patient" and then "don't kick up a fuss and just take the beating when you have followed existing code and documentation but still get told it's wrong and go round the the review process dozens of times" because the process can take months and months and can also be incredibly frustrating to the point that I sometimes seriously consider kicking a dustbin across the room and giving up on anything that needs a merge. "Fortunately" the deployment pattern for userscripts and gadgets is so broken that I keep going back to merges as a way to get any code deployed anywhere near sanely. You have probably also noticed this by trying to deploy something locally.
I will try dig into it, but my initial feeling is that using combining diacritics will allow a substantial saving in rule lists.
Even if we cannot get it into the upstream, we can first deploy here as a gadget, and if the ULS people still can't be persuaded of its utility, we might also be able to squeeze it into the Wikisource extension. Inductiveloadtalk/contribs 07:50, 15 November 2021 (UTC)[reply]
Still in progress, but first you fix other people's problems? Oh well, useful social credit. Shenme (talk) 04:52, 23 November 2021 (UTC)[reply]

Transcluding[edit]

I am trying to transition from using {{page}} to using <pages/>. With your help, this has gone well, but I face a new problem where the book I am working on is missing two pages (discovered late in the game unfortunately). I handled this in my usual fashion which is to import (via JPEGs) the two pages needed from another copy of the book. The pages in question are pp. 412-413 (see Index:The Reminiscences of Carl Schurz (Volume Two).djvu). My problem is I don't know how to transclude the patch pages except by using {{page}}. I have done this for The Reminiscences of Carl Schurz (book)/Volume Two/Chapter 8, but the transition between pp. 411 and 412 is poor (the one between pp. 413 and 414 worked fine since p. 413 ends with a complete paragraph). How do I make this work smoothly without using {{page}}? Thanks for any suggestions. Bob Burkhardt (talk) 17:08, 15 November 2021 (UTC)[reply]

@Bob Burkhardt the best thing to do here is to repair the scan by inserting the missing pages, then you can keep it normal. I did this using those two files and then moved the pages into their new homes and adjusted the transclusion (obviously this is easier when it's the last chapter of a book!). Wikisource:Scan Lab exists for this kind of repair - if you notice that a book has a defect, you can get it fixed there and hopefully it'll be done before you get to the pages in question. Inductiveloadtalk/contribs 17:44, 15 November 2021 (UTC)[reply]

Thank you for bringing the new resource to my attention. There is still a problem: Page:The Reminiscences of Carl Schurz (Volume Two).djvu/484 has an image for p. 413, rather than p. 412 as it is supposed to. Can you fix this for me or have it fixed? Bob Burkhardt (talk) 18:12, 15 November 2021 (UTC)[reply]

@Bob Burkhardt Sorry, that was my fault, I uploaded the wrong "fixed" file. Should be OK now. Thanks for checking. Inductiveloadtalk/contribs 18:22, 15 November 2021 (UTC)[reply]

Looks good. Thank you for grappling with this. Bob Burkhardt (talk) 18:29, 15 November 2021 (UTC)[reply]

Stash Bug and Batch Upload[edit]

With the stash bug fixed, is it possible to do batch upload of periodical again? I think that The Dial, Volume 75 is a good example of how having scans enables users to scan-back works published in periodicals. Languageseeker (talk) 14:54, 16 November 2021 (UTC)[reply]

Unpurgeable stale thumbnails[edit]

cf. T215558. You don't happen to have any current examples of files with stale thumbnails that can't be purged? Xover (talk) 19:03, 17 November 2021 (UTC)[reply]

@Xover can't think of any, sorry! Inductiveloadtalk/contribs 19:27, 17 November 2021 (UTC)[reply]

some files[edit]

I found a book that did not have a printed copyright, that is probably from 1931 but maybe from 1915 and I scanned it. I scanned it to jp2 and duped that into png and uploaded them here.

If necessary, I will enter it into a process to get approval or no, although, that didn't go so well here, so maybe at the Commons but some steering me into the right direction would be appreciated.

As I see it, worse case, the scans are not approved and go with the files to be released in 2029. That is not so bad (its not great though). So, the files are at the commons. Included with them is an advertisement for this book from another book which is here. There is an ad for the other book also in that cat, I really thought I saw the image that is on that ad in this book but have failed to find it.

A couple of other things. If you prefer jp2, I can manage that (I found a place to upload). Also, If you would like a really good for tesseract set of dups, I can make those (I just need to dust my script off) -- I don't have tess on this computer, so I cannot provide the text files.

Also, thanks for cleaning up that toc! I should have looked at that also because all of the other multi-page things needed tweaking also.--RaboKarbakian (talk) 03:44, 18 November 2021 (UTC)[reply]

((also, I scanned the blank pages because I scanned all of the odd pages first and the even pages second and I can juggle, but maybe not so well. The next book is a 2 volume set from 1896!! They are some of the most beautiful books I have ever had my hands on!!))
OK, well I can make a DjVu out of that easily enough (is that the question?) Some hints for the next scan:
  • For the images, do try to get into the spine a bit more, because the scanner has a very low depth-of-field and loses focus in the gutter: phab:F34753564. This is hard on a flatbed, but if you're hoping to scan a lot, you may consider a book scanner with an "edge bed" like an OpticBook (I don't know if that's actually any good in terms of image, I just know the 3600 model is cheap on eBay, I don't have one myself). Alternatively, the time-honoured DIY method is a (good) camera on a tripod and a sheet of glass to flatten the page. This is slow and fiddly, but gets excellent results, and if your camera, lens and lighting is good it is probably better in terms of colour reproduction than whatever manky electronics they shove into a consumer-grade scanner. I wouldn't want to scan a whole book like that, but maybe it's practical for only the images. The next step up is a v-cradle scanner like the IA themselves use, but that's some serious DIY unless you're really scanning a lot of books. The actual optical setup is still a real camera + glass sheet, it's just a question of throughput at that stage.
  • Do try to rotate the files before uploading. If you're already batch-converting JP2→PNG with ImageMagick, it's in the same command: mogrify -format png -rotate 90 *.jp2
  • You do not need special versions for Tesseract - it has a built-in binarisation step that will handle these images perfectly well. It's when the image has poor contrast between text and background (e.g. dark paper, light print, bleed-though, bad scanning or something like that) that you might consider a pre-OCR processing step.
That said, except for the gutter, the scans are very very good indeed, right down to the printing "dots". For "copyright clearance", the best forum is probably WS:CV. Inductiveloadtalk/contribs 07:52, 18 November 2021 (UTC)[reply]
I have a great camera, (I worry sometimes that it is worth more than me!) and I saw the howto at IA but maybe I should get my brother onto the hardware. My inter-library loans will end soon though.
Just for your growing ability to look at images and figure out what happened between inception and delivery: I blocked off an area of the scanner bed (using the dividers from a box of tea bags, actually) because the edges of the glass don't scan, even if their rules make it look like they do. I had the scanner software rotate the even numbered scans as I used the same area for both sides of the open book. My conversion script is just a format conversion, nothing more. So, half of them were scanner rotated only. The covers were apparently on the scanner bed in the right direction. To clarify: the only rotation done by me was via the scanner, rotating the even numbered scans.
My script for preparing the scans for tess was the gutenberg recipe. I was reminded of it by the scans at Hathi, which have been very clearly posterized in a truly harmful way to what had been beautiful little line drawings (some sniffles, a couple of tears shed, the supression of a hatred for how this world works, etc.)
There is an interesting thing about this book. I have two copies, one is a nineth edition the other, less old and less with blown reds...., from Lippencott. Both from libraries. The unnumbered pages -- the pages in each are in a different order. All the worse because it is a poem I kind of know. So, </whine> and thanks for all the information and analysis!--RaboKarbakian (talk) 14:25, 18 November 2021 (UTC)[reply]
Yes, I can see you haven't rotated them, because they're (nearly) all sideways. My point is you can fix that in a few seconds with ImageMagick: mogrify -rotate 90 *.png, or even do it in the same command as the format-shift to PNG.
Generally, I'd say don't use any processing software that comes with a scanner, because it's pretty universally junk. Scanners are for getting the images onto a computer where you can handle them with real tools.
You do not need to process these particular images at all to feed them into Tesseract - that's only needed if the default binarisation fails. tesseract image.png - -l eng works just fine. I'd say you could try it for yourself with the OCR tool like this, but since the image is sideways, it won't work. Inductiveloadtalk/contribs 14:40, 18 November 2021 (UTC)[reply]
Also, should 15,16 and 33 be in that category, or are they just gaps in the scan image numbering? Inductiveloadtalk/contribs 18:04, 18 November 2021 (UTC)[reply]
I mentioned that the two books did not match and that there are no page numbers. I compared the books and the book I scanned was clearly out of order. So, I reuploaded into the namespace and fixed it -- however, for whatever reason, the correctly ordered book was two pages less than the book I scanned. The whole process was very disturbing. The files can be renumbered if necessary (I started again from the end when I got a little mess up....) I think that the one black and white plate is actually an end page, and if I were to organize them, I would have put the last color image before the last text page, but this is the one book matching the other.--RaboKarbakian (talk) 20:35, 18 November 2021 (UTC)[reply]
There was never a 33. That is an actual mistake I made.--RaboKarbakian (talk) 20:39, 18 November 2021 (UTC)[reply]
The question is: are the files on Commons a complete set and in the correct order if sorted numerically? Inductiveloadtalk/contribs 22:30, 18 November 2021 (UTC)[reply]
To the best of my knowledge, Yes. Also, thank you so much for your time and patience and your sharing of knowledge thoughout this endeavor of mine.--RaboKarbakian (talk) 00:33, 19 November 2021 (UTC)[reply]
Here ya go! Index:The Night Before Christmas - 1915 - Moore.djvu. Inductiveloadtalk/contribs 10:40, 22 November 2021 (UTC)[reply]

I completely missed this!! I had to get slapped around at the commons before I saw it, even. I hate it when I am the one who sucks. So, thank you so much! I got uploading to do, if they would stop slapping me around at commons.--RaboKarbakian (talk) 17:03, 29 November 2021 (UTC)[reply]

out of order books[edit]

I don't know where to take this so, please allow me to just spew here. I had those two Night Before Christmas books, and the one I scanned was out of order by the words. I think it was in order, however, for the pictures. It is early in my day and I am not together enough to look at it.

The pictures in The Night Before Christmas (Rackham) don't match the words. I am thinking about redoing it, in my User space, not scan backed, so the pictures can go with the right words. It is a little jarring to my sensibility as it is.

Also, post transcribing, I think the 1915 version must have been beautiful, where this old (probably 1931) scan I used looks like a reassembled thing. This scan I made, is just another fuzzy proof of the earlier edition.</spew>--RaboKarbakian (talk) 13:46, 30 November 2021 (UTC)[reply]

@RaboKarbakian I am unclear if there is any action you'd like me to take. Inductiveloadtalk/contribs 13:47, 30 November 2021 (UTC)[reply]
I don't know the reason I needed to type words about this, but I did. This was the best place to type them. I should have typed "thanks for the djvu", etc. The good djvu got me thinking. No action from you required. Thanks for the consideration.--RaboKarbakian (talk) 13:53, 30 November 2021 (UTC)[reply]

Error Message on MC[edit]

I've frequently been getting this message "The time allocated for running scripts has expired." on the main MC challenge page and none of the books are showing. Could you please take a look? Languageseeker (talk) 23:39, 18 November 2021 (UTC)[reply]

@Languageseeker hmm, I guess the accretion of new indexes into such a large MC is pushing the limit for the Lua stuff. Inductiveloadtalk/contribs 14:08, 19 November 2021 (UTC)[reply]
@Languageseeker FYI this is (hopefully) fixed by phab:T296092 and a backport deployment of the fix will be done tomorrow. Inductiveloadtalk/contribs 18:41, 21 November 2021 (UTC)[reply]
Wow, that's so fast. Languageseeker (talk) 22:20, 21 November 2021 (UTC)[reply]
@Languageseeker well...let's see if works first! ^_^ Inductiveloadtalk/contribs 22:34, 21 November 2021 (UTC)[reply]
@Languageseeker it is now deployed. The MC pages are still not blazingly fast, but they seem substantially better, and I think we shouldn't have issues with challenges around the 50 index mark any more. Inductiveloadtalk/contribs 12:54, 22 November 2021 (UTC)[reply]
Definitely feels faster! Languageseeker (talk) 21:19, 22 November 2021 (UTC)[reply]

Hanging indent[edit]

Hi. Do you have any suggestion on how to manage the indentation e.g. in the first 2 lines of Page:Dictionary_of_National_Biography._Errata_(1904).djvu/296? The first line could be managed with {{hi}} but what about the second? If there are no existing templates that can be simply combined without being too hacky, do you have any suggestions for a custom template? Also considering that it would be used all over the place. Thanks Mpaa (talk) 22:11, 19 November 2021 (UTC)[reply]

@Mpaa hnaging indents are fundamentally a hack of a negative text-indent and a padding or margin to give the first line a space to "hang" into. Usually the two are the same (but one is negative). So what it looks like you need there is a padding greater in magnitude than the negative text-indent:
<div style="border: 1px solid green; padding-left:4em; text-indent:-2em;">
{{lorem ipsum}}
</div>

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Inductiveloadtalk/contribs 18:39, 21 November 2021 (UTC)[reply]
Thanks. A change in {{tl|hi}} to set "text-indent" independently would do but I can't see a way of keeping a nice interface and compatibility. Maybe a dedicated template would be better then. Mpaa (talk) 20:32, 21 November 2021 (UTC)[reply]

Switching from px to em[edit]

I didn't dismiss your advice about using "em" instead of "px". At first when I implemented it, the images were smaller than what I was used to. Then you reminded me about adding "!important" which made a difference, but still the image is 8 pixels smaller when converting pixels to em and measured with a pixel ruler. Could you please look at this page with the two images and tell me what I am doing wrong? Same images two sizes.— Ineuw (talk) 04:30, 21 November 2021 (UTC)[reply]

It looks like the font size is actually set to 14px because there's a global CSS rule for . .vector-body { font-size: calc(1em * 0.875); }. 14px * 32 = 448px, which is indeed the width of the box.
Exact sizing on the order of 10% is not incredibly important anyway, because it strongly depends on the user-agent. You might see it as 14px = 1em in this exact case, but that presupposes a "base" ratio of 1rem = 16px, which is only a common default on desktop browsers and may be wildly different elsewhere (even on desktop, if someone has changed their font size). Inductiveloadtalk/contribs 18:36, 21 November 2021 (UTC)[reply]

ppoem and left overfloat[edit]

Module:Ppoem uses %S to match the text before a <<<, meaning you can't overfloat a string with spaces in it. Is that deliberate? Xover (talk) 18:25, 21 November 2021 (UTC)[reply]

@Xoverno, I don't think it was. Just didn't think of that. Inductiveloadtalk/contribs 18:27, 21 November 2021 (UTC)[reply]
Looks like you quickly run into trouble with the 2em left/right gutters when you stuff arbitrary strings (i.e. wider than 2em) into these. Not sure whether that's "Don't do that then.", adding template params to control the gutter widths, or pointing the issue to index CSS. Possibly we could guesstimate the needed width based on the string length and handle it automagically, but that sounds… hacky. The one I ran into is a once-off that can squeak by as is, and I haven't seen that in any of the other works I tested ppoem with, so I'm going to leave it to simmer for a bit. Xover (talk) 20:36, 21 November 2021 (UTC)[reply]
Note before adding template params: you can also control gutters with CSS using ws-poem-left-gutter and ws-poem-right-gutter. Maybe params are better, but like you say, lets see what boils over! Inductiveloadtalk/contribs 22:24, 21 November 2021 (UTC)[reply]

Add OCR to Index:An American dilemma the Negro problem and modern democracy (First Edition).pdf[edit]

Could you add an OCR layer to this text? It's slated for the December MC. Thanks. Languageseeker (talk) 04:40, 26 November 2021 (UTC)[reply]

@Languageseeker I can do it, but could you please do the page list first? DLI books are often pretty poor scans, so I'd rather not convert it and only then find that there are pages missing or something. Also the easiest thing to do for these is to re-import as a DJVU with https://ia-upload.wmcloud.org, which will also do Tesseract OCR on the way through, so the result will be the same as if I did it. Inductiveloadtalk/contribs 17:06, 29 November 2021 (UTC)[reply]
Hmm, OK so actually it does not look like that import is going well at all! The DLI books are such a mess: they import the PDF and then the IA extracts to JP2, but because they're going "backwards" from PDF to JP2, the files end up encoding a ton of compression noise resulting in a >2GB tarball. I did think the IA import would work though, if it doesn't, that should be fixed. I'm downloading the PDF now (taking a while at 125kB/s): I'll convert/OCR it and upload if the IA-Upload does indeed fall over. But I'd still like a pagelist if you could :-) Inductiveloadtalk/contribs 17:56, 29 November 2021 (UTC)[reply]
I decided to scrap the idea. It seems like to much work for a poor quality scan. HT also has scans of the same edition, but they are behind a protection wall. I've requested them to remove the protection. Let's see how that goes. Sorry for the bother. Languageseeker (talk) 21:40, 29 November 2021 (UTC)[reply]
@User:Languageseeker it's not especially hard to convert, just takes time to download. I don't think HT usually respond to such requests (oddly enough, Google are good about that) but do let me know if they do. In the meantime, I'll happily upload a DjVu from one of the DLI scans if you can do the pagelist and let me know if it's complete. Inductiveloadtalk/contribs 21:48, 29 November 2021 (UTC)[reply]
Looks like the scan is missing some pages. It's probably best to let this idea rest. Languageseeker (talk) 22:00, 29 November 2021 (UTC)[reply]
OK, give me a ping if you get hold of a complete scan (or a set of scans than can be patched into a complete set) and I'll see what I can do. Inductiveloadtalk/contribs 22:16, 29 November 2021 (UTC)[reply]
It turns out it was the opposite problem: four duplicate pages. I corrected the page list. I haven't flipped all 1,500+ pages, but a sampling indicates that the scan is complete. Languageseeker (talk) 05:30, 30 November 2021 (UTC)[reply]
OK, dupes removed and OCR added. File now at Index:An American dilemma the Negro problem and modern democracy (First Edition).djvu.
You do not usually need to flip every page - you can generally tell with good confidence that the scan is complete if the page numbering is correct at the end of the file. If pages are missing or duplicated, the page numbers will be out of step. In this case, you can tell the pages are probably correct because Page:An American dilemma the Negro problem and modern democracy (First Edition).djvu/1541 is correctly numbered as 1483. Of course, pages could still be jumbled or duplicates balance out missing pages, but it's very very rare for that to happen "perfectly" so that the pages still line up after the defects. Inductiveloadtalk/contribs 10:14, 30 November 2021 (UTC)[reply]
Thank you! I might run it for Jan 2022 because December is quite crowded already. Thank you for the information about the pages. I'll keep it in mind for the future. Languageseeker (talk) 22:20, 30 November 2021 (UTC)[reply]

Greek template and serif font display[edit]

Hi, any idea why Greek fonts (using the {Greek} template) no longer display with serifs? It started happening a few days ago. DivermanAU (talk) 03:53, 30 November 2021 (UTC)[reply]

The serifs disappeared for me a month ago, when the template was altered, and returned when I corrected the template to what it was before. Since they're now displaying for me, but not for you, then I would first suggest clearing your browser cache. --EncycloPetey (talk) 04:43, 30 November 2021 (UTC)[reply]
It's also sans for me if I remove my personal CSS, because the first in the old Template:Greek/fonts.css list that I have is "DejaVu Sans" (I imagine I share this with all Linux users, Windows users without special fonts installed will probably get Arial Unicode MS, but I'm not sure). As usual, a knee-jerk reversion as a first act is not particularly constructive. A constructive thing to have done here would have been to say what font your browser was actually using from the "styles.css" CSS and we could have addressed it properly.
@DivermanAU, please liaise directly with @EncycloPetey to find a font ordering that works for you both and please also bear in mind that most of the fonts in the list are not installed by most users. I have my own CSS anyway, so Works For Me (TM) whatever you do. Inductiveloadtalk/contribs 10:04, 30 November 2021 (UTC)[reply]
@EncycloPetey do you intend to address @DivermanAU's problem? Reverting something implies to me that you willing to take some level of responsibility for it, and I don't want to get in your way if you feel you have a better solution. Inductiveloadtalk/contribs 08:45, 2 December 2021 (UTC)[reply]

We need some stats, stat![edit]

Well, or not so "stat". But somewhere (MC summary? Some diff I saw somewhere in any case) you referenced the phetools page stats as a point of reference for the MC stats. So before the whole matter drops from my frazzled mind, I thought I'd mention that I have on my plan doing some work on the stats code in phetools at some point in the not too distant future (maybe). Prime mover is improving the graphs in various ways, and secondary is cleaning up the way the stats are persisted (it's currently dumping a stringified Python datastructure to a text file, and reading and exec()ing it on next run). But once I go digging there may be opportunities for other improvements, such as anything the MC might need. If you give me a wishlist I can try to keep it in mind whenever I get around to that project (over the Christmas hols at the earliest, and absolutely no promises on anything). Doing all the MC stats in phetools would probably require more "on-wiki knowledge" than is sane to implement there, but anything generic / cross-project-applicable that would help or remove friction is fair game. Xover (talk) 06:41, 2 December 2021 (UTC)[reply]

@User:Xover that's a kind offer. I don't think the MC actually needs much in the way of stats support from Phetools, the bot is chugging along happily enough.*
What I do actively miss is the ability to get the "uplifts" for the whole wiki on a year-by-month and month-by-day basis. For example, if I want to check the figures for November, I have to check on 1st Dec (and even then, that's only actually accurate for 30-day months).
* The change-tag-based progress history API will make it much easier in future to get change histories for sets of pages, but that's stuck in review/deployment hell, so who knows.
As for getting the sets of pages, the API for getting all pages in an index and using them as a generator exists now (docs here), but hasn't been deployed this week as expected because the RelEng folks are "distracted" and nothing is being deployed. Inductiveloadtalk/contribs 08:59, 2 December 2021 (UTC)[reply]
This is nice, it will simplify this a lot: https://github.com/wikimedia/pywikibot/blob/master/pywikibot/proofreadpage.py#L910, after adding support for the new API in pywikibot. Maybe during Xmas holidays ... :-) Mpaa (talk) 22:51, 2 December 2021 (UTC)[reply]
@Mpaa yep, that was part of the motivation. Also PWB can now access index fields as JSON which might help too (mw:Extension:ProofreadPage/Index data API). Inductiveloadtalk/contribs 23:02, 2 December 2021 (UTC)[reply]
Ok, more work then! Backward compatibility, especially for the first one, will be needed to support old wikis. Mpaa (talk) 23:07, 2 December 2021 (UTC)[reply]
@Mpaa Out of interest, who is using ProofreadPage outside of the WMF deployment zone (and therefore are on older versions)? Also, if you have smart ideas about useful API stuff, there's a whole column on Phab for it, and I'll be happy to try to make a dream come true if I can. Inductiveloadtalk/contribs 23:12, 2 December 2021 (UTC)[reply]
In practice I guess no one, but in my experience I always got comment about compatibility when adding stuff to PWB, as PWB supports a certain range of wmf-versions. Sure, I will keep in mind the API stuff. Happy to see the Extension hassome more people to help Tpt. Mpaa (talk) 23:25, 2 December 2021 (UTC)[reply]

Casing in {{smallcaps}}[edit]

Hello,

Thank you for the tip, it is not always easy to know the best way to make the text look good with all those models lying around. I am actually pretty proud of myself for remembering the existence of {{fraktur}}! ^_^ Ælfgar (talk) 21:11, 2 December 2021 (UTC)[reply]

@Ælfgar the learning curve is pretty vertical, isn't it? Just thought I'd let you you know sooner rather than later.
BTW, normally you should reply to messages where they are left, otherwise it's just confusing. In this case, just reply on your talk page and I'll see it in my watchlist. Or you can ping me with @[[User:Inductiveload]] and I'll get a notification (just like you will get when I save this, because I pinged you at the start of the reply. Inductiveloadtalk/contribs 21:17, 2 December 2021 (UTC)[reply]