User talk:Samwilson

From Wikisource
Jump to navigation Jump to search


[OMG a nude page, let me help resolve that]

I just saw announced the lua extension mw.wikibase.getEntityIdForTitle and if I am not mistaken that is could be a joyous little bundle of helpfulness for us.

Here I am thinking where we have an author page, and a related biographical page in main ns, and working out whether we can poke a wikipedia = parameter on the respective main ns page, or maybe automating a link; similarly I am see the potential for us to more readily get some bot action to better apply "main subject (P921)" at Wikidata for our biographical works. Am I reading the function properly? — billinghurst sDrewth 22:37, 16 April 2018 (UTC)

@Billinghurst: Interesting! So you mean create a link from the NS0 page of e.g. a biography chapter to the Author NS of the bio's subject? If the bio has a P921, couldn't we link via that (i.e. bio page → sitelink → P921 → Qxx → sitelink → Author page)? I'm not quite getting when we'd need to do a page title look-up... or do you mean, as a means to find unlinked articles? That must be it. So we'd do a getEntityIdForTitle('NS0 Page Name') and see if it comes up with an instance of person, and if it does we'd add some thing to alert editors here to the fact? Sam Wilson 06:27, 17 April 2018 (UTC)

Books & Bytes - Issue 27[edit]

Wikipedia Library owl.svg The Wikipedia Library


Books & Bytes
Issue 27, February – March 2018

  • #1Lib1Ref
  • New collections
    • Alexander Street (expansion)
    • Cambridge University Press (expansion)
  • User Group
  • Global branches update
    • Wiki Indaba Wikipedia + Library Discussions
  • Spotlight: Using librarianship to create a more equitable internet: LGBTQ+ advocacy as a wiki-librarian
  • Bytes in brief

Arabic, Chinese and French versions of Books & Bytes are now available in meta!
Read the full newsletter

Sent by MediaWiki message delivery on behalf of The Wikipedia Library team --MediaWiki message delivery (talk) 14:49, 18 April 2018 (UTC)

Your feedback matters: Final reminder to take the global Wikimedia survey[edit]

WMF Surveys, 00:43, 20 April 2018 (UTC)

Unused files as a list?[edit]

Do you know a way to manipulate Special:UnusedFiles so I can get it as an easy list? There are a string of files there that I know that I can straight out delete, though how to get it as a list to easily manipulate in bite size chunks is just not obvious. It is not even obvious that you can pull it from the API, not that I can generate simple text lists from the API anyway — billinghurst sDrewth 04:06, 16 May 2018 (UTC)

@Billinghurst: It doesn't look like it. That special page isn't transcludable even, and it's constructing the database query itself so I suspect the same query isn't done anywhere else (or we'd be reusing it). Also it's the only place mw:Manual:$wgCountCategorizedImagesAsUsed is used. What sort of list are you trying to build? It probably wouldn't be too hard to add transcluding support, if that'd help. Sam Wilson 04:38, 16 May 2018 (UTC)
There are works there that have been completed where the original image has been cleaned/gleaned/screened and uploaded to Commons. So we have the residue images to cleanse, and getting these url by url is a PITA. Getting a list, checking the work completion, and zapping more collectively is bettererer. Noting that prefix lists are unreliable in case one/some aren't done. — billinghurst sDrewth 05:23, 16 May 2018 (UTC)
Dropped the problem into phab:T194865billinghurst sDrewth 01:44, 17 May 2018 (UTC)
Note that File linked via {{raw image}} is still considered 'unused'. In pywikibbot: python scripts/ -unusedfiles.— Mpaa (talk) 17:36, 19 May 2018 (UTC)


Hi. Just in case you have not been notified about this: . It is happening quite often recently. Bye— Mpaa (talk) 20:35, 18 May 2018 (UTC)

Books & Bytes – Issue 28[edit]

Wikipedia Library owl.svg The Wikipedia Library


Books & Bytes
Issue 28, April – May 2018

  • #1Bib1Ref
  • New partners
  • User Group update
  • Global branches update
    • Wikipedia Library global coordinators' meeting
  • Spotlight: What are the ten most cited sources on Wikipedia? Let's ask the data
  • Bytes in brief

Arabic, Chinese, Hindi, Italian and French versions of Books & Bytes are now available in meta!
Read the full newsletter

Sent by MediaWiki message delivery on behalf of The Wikipedia Library team --MediaWiki message delivery (talk) 19:33, 20 June 2018 (UTC)

Meeting followup[edit]

Hi Sam, Thanks for being there today. Lots of stuff half heard, half understood, to try to follow up on. One thing you mentioned was some form of mapping using wikipedia when data have been uploaded to the commons. I was curious about this as I dislike my Rgooglemaps: they are too fuzzy. Nor am I mad about my Australian outline maps (produced using SAS), so another technique would be good.... MargaretRDonald (talk) 13:27, 27 June 2018 (UTC)

@MargaretRDonald: There's a new thing called Kartographer that can show data on maps pretty easily. For example, at right is the Cuscuta australis data we were looking at yesterday. The colours and styles and things can all be customised, and the data doesn't have to live in the wiki page (as I've done in this example). —Sam Wilson 01:25, 28 June 2018 (UTC) @Samwilson: Thanks for this. (Only just spotted...) MargaretRDonald (talk) 02:07, 6 July 2018 (UTC)

@Samwilson: Sorry to be so thick. But here in your text you have listed all the co-ordinates... and of course, the map is embedded in the page.. Writing code to generate the mark-up looks a smidgin ugly. So I am not quite sure how this is easier, or conceptually better from a wikipedian point of view (?) MargaretRDonald (talk) 02:14, 6 July 2018 (UTC)
No, the idea would be to include the coordinates (in KML format) in a template in the manner of e.g. wikipedia:Template:Attached KML/High Street, Fremantle. Then, to update the range map, only that template would need to be changed and the article map would update automatically from there. I'm not sure if it is easier, but it does make the map zoomable, and perhaps is quicker than creating separate raster map files and uploading them. Just an idea though! :) —Sam Wilson 06:50, 6 July 2018 (UTC)
@Samwilson: Thanks for the explanation, Sam. MargaretRDonald (talk) 16:54, 20 January 2020 (UTC)

seeing other wikisources[edit]

@Samwilson: Hi, Sam. It would be very nice if one could see all the corresponding wikisource things on the left as one can in wikipedia or as one can in wikidata. I am constantly seeking other language sources for botanical stuff and would be nice to be able to navigate (relatively) easily to other language sources..... Any thoughts? MargaretRDonald (talk) 02:04, 6 July 2018 (UTC)

@MargaretRDonald: Yes, this is a definitely wanted thing, and is being worked on as Phabricator:T180303. The trouble with Wikisource interlinking, compared to other projects, is that works in different languages don't get directly linked to the same Wikidata item, but rather each get their own (which has a 'edition or translation of' property that links to the unifying work-level item). —Sam Wilson 06:53, 6 July 2018 (UTC)
@Samwilson: Hmmm. (I see) I look forward to all those clever persons making it happen sometime.... Cheers, MargaretRDonald (talk) 07:14, 6 July 2018 (UTC)

Living auhors category again[edit]

Hi Sam. First I would like to thank you a lot for handling the floruit problem at the template {{Author}} and thus solving partly the Living people category.

There are also some authors who do not have the floruit property filled at Wikidata, because they are not known because of a one-date event, but who were known for a longer time. Such people can have Wikidata properties "work period (start)" and "work period (end)" instead. An example of this is Author:Mordach Mackenzie (Q56612310) whose birth and death dates are unknown and who is known for his work between 1746 and 1764. Do you think it would be possible that a) the authors's page at Wikisource could take these dates from Wikidata and display them as "fl. 1746–1764" and b) remove the authors whose "work period (end)" was more then e.g. 90 or 110 years ago from the Living people category too?

I am writing you because you are the only one here I know that can handle such things (though I believe there are more people like that). However, it is not of the highest importance, so if you do not have enough time, it can wait. Thanks. --Jan Kameníček (talk) 11:30, 18 September 2018 (UTC)

Yes, that sounds like a great idea! I did see your comment on that other page; sorry I didn't reply yet. I'm keen to help, not sure when I'll find time, but it's conceptually the same thing we're already doing but just with a different property, so it shouldn't be too hard. There are currently 7 failing tests that I want to fix up before embarking on any new features though, so I might try to do them first. Will keep you posted! Sam Wilson 03:15, 19 September 2018 (UTC)

PageCleanUp feature request[edit]


Just a note to make a record of our recent conversation about my feature request for your very useful PageCleanUp.js tool:

If a full stop (period) is followed by a lower-case letter:

Some text. then some more

then it should probably be a comma:

Some text, then some more

If a comma is followed by a capital letter:

Some text, Different text

then (proper names notwithstanding) it should probably be a full stop:

Some text. Different text

If this is not a major issue for most OCRd text, perhaps a separate script would be better. What do you think? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 10:31, 3 October 2018 (UTC)

Also, perhaps the script could fix ligatures, like the "fi" and "fl" in "magnificent power of flight"? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 16:14, 5 October 2018 (UTC)
  • @Pigsonthewing: dots and commas done, good idea! As for ligatures, Wikisource:Style_guide/Orthography#Ligatures suggests that we not use them as search engines struggle. I suspect that's wildly out of date. We do avoid e.g. the long 's', because it's "just" orthography and so not relevant to the text. Also, there are ligatures (e.g. st) that don't exist in many fonts at all. Sam Wilson 22:58, 9 October 2018 (UTC)
Sorry if I wasn't clear; I meant the script could change from ligatures generated by OCR to regular letter pairs. Thanks for the punctuation feature. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 00:25, 10 October 2018 (UTC)
@Pigsonthewing: Oh! Ha, yes I see now. Done! :) Sam Wilson 05:23, 10 October 2018 (UTC)
That is going to save me a lot of dull drudgery. Thank you! Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 10:42, 10 October 2018 (UTC)

mediawiki-feeds 503?[edit]

Hi Sam, thanks for all the stuff you do!

Realized I wasn't subscribed to the Signpost anywhere and when I tried the RSS feed, Feedly said it couldn't reach, and clicking the feed got a 503 from Toolforge and it said you're the mediawiki-feeds maintainer.

Thought you might like to know. Hopefully it's just an old link or something else simple. John Abbe (talk) 17:32, 2 June 2019 (UTC)

@John Abbe: Thanks for telling me about this! It made me realise that that tool isn't in my list of monitored tools, so I hadn't seen that it was down. That's fixed now, and so is the bug that was causing it to fail on the Signpost feed, and the tool is back online. See how it fares, and ping me with any dramas. :) Thanks! Sam Wilson 02:18, 3 June 2019 (UTC)
Sweet! And thx for the quick fix.John Abbe (talk) 05:46, 5 June 2019 (UTC)

Requesting import of "Links count" gadget[edit]

Any chance you could help with: Wikisource:Scriptorium#Requesting import of "Links count" gadget, please? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 19:23, 20 June 2019 (UTC)

curly quotes script[edit]

Hi -- thanks for the very useful-looking script. How do I install it at my common.js? I tried adding importScript('User:Samwilson/CurlyQuotes.js'); but that didn’t work. Thanks for help Levana Taylor (talk) 16:05, 30 August 2019 (UTC)

@Levana Taylor: Use this in your commons.js:
And let me know if you find any bugs with it! :-) It's adapted from Wilson 01:58, 31 August 2019 (UTC)
Hmm… nice interface, mostly works, but I’ve already found some issues. Most notably, it’s not always correctly noticing paired apostrophes to leave straight: try it on this page, for example. Also it doesn't know to leave double-quotes alone if they’re inside angle brackets (I use <section begin="s1" /> for section breaks, for example). And, I’m sure this isn't the only case it doesn’t get right, but when you have italics inside double quotes ("Really?!") it doesn’t alter the double quotes. Here’s a thought for long-term development: since you’ll never be able to get it to work absolutely perfectly, the thing that’d make it really easy to check the results is if it highlights quotes in 3 different colors after finishing work, one for left, one for right, one for straight. Single and double quotes could be the same color: no need to have six! Levana Taylor (talk) 05:05, 31 August 2019 (UTC)
Hey, sorry if that last comment was too negative! I’ve been using the script a lot and finding it extremely useful. I do have a list of stuff it isn't catching, though. Lemme know if you want it. Levana Taylor (talk) 04:53, 5 September 2019 (UTC)
@Levana Taylor: Oh cool! Yes, please. I'm sure there are going to be a bunch of things we can't handle, but it'd be good to try. :) The quotes thing I'm looking at now, but I think the highlighting colour thing might be a bit harder. Or do you just mean in the preview, not the editing box? That might be easier. Anyway, thanks for the feedback! I'll try to improve the script. Sam Wilson 05:04, 5 September 2019 (UTC)
@Levana Taylor: Have you found Wikisource:WikisourceMono by the way? That helps with more easily seeing the different characters while editing. Sam Wilson 05:55, 5 September 2019 (UTC)
Yes, highlighting in preview would be helpful, even if it's not possible in the edit box. I find the font I'm using plenty readable, but the point is that you want the left-right pairs, or absence thereof, to jump out at you so your mind doesn't overlook them.
Anyhow, I was wondering why you don’t have some simple rules like double quote at start of line is left, at end is right; single quote between two letters of the alphabet is right. Must be a reason I’m sure! As for suggestions:
  1. My biggest suggestion for improvement would be to drop all the things the script does with dashes and stick to just quotes. I never find the dashes useful and they are constantly messing things up, like pagenames that contain hyphens, and the <!-- comment markup (though I guess you must be finding the dash alterations useful since you put them in!)
  1. Bug noted: paired apostrophes before s and d are not being correctly interpreted
  1. The French d’ and l’ are major list items not yet being recognized. Then there’s ’twould - ’twill - ’twere - ’tisn’t - ’twasn’t (etc.) - ’midst - ’neath - ’bout - ’fraid - ’nother - ’uns - People in the novel I’m reading keep saying ’Gad! and ’Pon my honour! Levana Taylor (talk) 06:48, 5 September 2019 (UTC)

Cloud Services and Toolforge[edit]

You wouldn't happen to be familiar with Cloud Services (WMCS, CVPS), Toolforge, and related infrastructure bits? I have some kinda hacky local tooling for working with DjVu files and Tesseract and was toying with the idea of trying to set up some related utilities of possibly general usefulness. But right now I'm banging my head against the wall of insufficient documentation for the stuff the WMF provides for hosting such things. In other words, I'm looking for someone familiar with the setup that's willing to answer dumb questions and provide some hand-holding. --Xover (talk) 15:28, 17 September 2019 (UTC)

@Xover: We do have the shared account user:Wikisource-bot for bits, and there is a range of documentation at wikitech: and places. Mailing list is pretty good for support, and IRC can be useful (though you have to be in a good time zone). I am totally useless as a coder, though have managed to find handholders to allow me to bumble through. — billinghurst sDrewth 23:33, 17 September 2019 (UTC)
@Xover: Yes sure, I'd be happy to help. I'm reasonably familiar with Toolforge. What issues are you having? Sam Wilson 03:37, 18 September 2019 (UTC)
Well, mostly stupidity and documentation that seems to be written for a different audience than myself.
My vague ideas to begin with are a replacement (possibly temporary) for Phe's OCR gadget, and possibly a tool that'll take images from some source (IA id, URL, zip file, etc.) and spit out a DjVu with a text layer. There's some related stuff that might be relevant, like an easy way to add and remove pages from a DjVu, or to redact pages from a DjVu (typically for copyright reasons). Not sure what all would make for tools that are 1) a reasonable amount of effort to get working and 2) of sufficiently general utility to be worth it. A short term alternative for the OCR gadget is the primary motivation as that seems to be pretty critical for several contributors.
Right now I'm trying to figure out where and how it'd make sense to host something like that—Cloud VPS, Toolforge, or a third party general web host somewhere—and I'm just not finding the documentation that'll tell me what CVPS and Toolforge actually look like in terms of facilities, hosting environment, and so forth. As I said, the docs seem to be addressing different questions and for a different audience than me. So my first set of dumb questions / need for hand-holding is to figure out that.
What I'd need is:
  • A sufficiently Unix-y hosting environment. Fedora would be perfect, and RHEL or CentOS good second choices. Any good modern and not too esoteric Linux distro would probably be good enough, but experience tells me there are crucial difference in relevant sysadmin tools and package distribution/management systems and availability of packaged third-party software. Depending on what comes ready out of the box that may or may not be an issue.
  • A non-ancient version of Perl 5, with a reasonable set of standard modules installed. Given the rate of change in perl-land, I don't imagine any OS the infrastructure team are willing to host would contain a version that's too old. I don't current need a lot of esoteric perl modules, but by experience I expect to need to be able to install at least some. For example, I believe HTML::Parser was dropped from the core modules so that's something that would need to be available through some method.
  • A not-too-esoteric CGI hosting environment. My experience is with Apache, possibly with mod_perl, but anything that supports Perl and can be tweaked for interactive performance would probably work.
  • Tesseract 4.1 installed and functioning.
  • GraphicsMagick in some recentish version. ImageMagick will probably do in a pinch, but my experience with GraphicsMagick is better.
  • The ability to have such software updated in some reasonable timeframe. Whether that's done by the sysadmins as part of the platform, or whether that's something I'd sysadmin myself using the package tools, isn't all that important.
  • I'm way past the age where I want to waste time compiling software from source, so I really really would prefer that can be handled through some kind of package system.
  • I'd need a moderately large amount of disk space to play with, and moderately performant too (for OCR, disk IO quickly becomes a bottleneck). Several tens of gigs at least for temporary stuff: purged, depending on the case, on the timeframe of hours up to weeks. A gig or so per DjVu file, and room for at least 10 jobs' worth of files sitting around, would be the minimum reasonable. More is better.
  • For the batch OCR stuff I'd eat all the CPU I could get too, but for the per-page stuff anything that isn't completely choked would probably work well enough.
  • It would be a bonus if I had easy read-only access to files from Commons and local files on enWS (and possibly the other WSes if anyone should want it). A virtual filesystem or something, preferably more performant than having to download a copy of the file over HTTP. If any writing to a wiki is eventually needed that'd go through the API, most likely using OAuth, so read-only would be fine. Not a requirement by any means, but it'd make the DjVu manipulation stuff much more elegant and efficient if I ever get around to it.
  • Network access to Commons and enWS for downloading files, and for talking to the API if it becomes relevant.
  • General internet access to download stuff from IA, Hathi, etc. if that becomes relevant.
  • No immediate need for access to database dumps or similar: everything I have in mind is just crunching File: files.
  • No immediate need for database facilities: I might eventually need something to track batch jobs or whatever, but I'd get by with file-based solutions for a good long while before that became an issue.
  • If there is a batch system with oodles of compute resources that responds to "run this code on that data over there and notify me with the results once you're done" that'd be neat, but I'm not entirely sure that'd be more performant than doing it on whatever is hosting me directly. Possibly it'd enable parallel execution of large batch OCR/DjVu jobs if the tool is used a lot, that would otherwise need to be serialised, but I'm not sure the volume would be there to make that worth the effort.
  • If anyone actually started using a utility here I would want to add extra maintainers with OS-level access.
  • For source control and such I'd probably use Github (Gerrit looks completely impenetrable to me), and issue tracking either there or Phabricator, so no special needs related to that.
That's a rough braindump of the requirements. Given your knowledge of the facilities, what should I be looking at for hosting? By my guesses here, Toolforge might be both too constricting for the needs, and at the same time its advantages aren't all that relevant (I think I've grasped that Toolforge has DB dumps and similar already available, but as I don't need those…). If my vague understanding is anywhere near right, Toolforge is essentially a shared web host with some Wikimedia-specific facilities while Cloud VPS is just server hosting (that happens to be para-virtualised rather than bare-metal); but from there to the details that'd let me assess them against the requirements is a bit steep a climb right now. Anything I haven't thought of? Am I completely off my rocker? Should I go away and stop bothering you? :) Any help and hand-holding would be very much appreciated! --Xover (talk) 08:28, 18 September 2019 (UTC)
@Xover: Yep, your understanding of the shared-hosting or VPS is right. You could certainly do all you need on a VPS, but I think it sounds like you'll be fine with a Toolforge tool (although I'm not 100% certain off the top of my head of version specifics etc.). I recommend creating a tool account and seeing how you go. Lots of people use Github, so no worries there. Probably the most confusing thing for new toolforge users is the cronjob setup: basically, your cronjobs don't actually run things themselves, they just add a job to the 'grid', where it runs. In practice this just means a command has to use jsub. Anyway, I recommend a) creating a tool, b) trying to run what you need, and c) ask me when you hit an issue. Sam Wilson 23:16, 25 September 2019 (UTC)
Well, apparently Toolforge is a no-go. Is there any point requesting a dedicated VPS so I could do that stuff myself? I have no idea what the criteria are for getting one or whether the stuff I have in mind is even remotely what the CVPS infrastructure is intended for. Any suggestions or pointers would be much appreciated! --Xover (talk) 18:32, 6 January 2020 (UTC)
@Xover: Yes, I think if you've demonstrated that your requirements aren't met by Toolforge, then you should be able to request a VPS. Then you'll be able to install whatever you need. (Not that I'm completely familiar with the whole process, but that's my understanding.) Sam Wilson 00:03, 7 January 2020 (UTC)

Little problem on nl-ws[edit]

Hi Sam,

may I ask you to take a look at some strange problem, that happens on nl-ws?

It does happen only in one book, for instance at this page: s:nl:Pagina:Heemskerck op Nova Zembla.djvu/101. As you can see the header does not outline correctly. I have been trying all kinds of things. Nothing seems to help. It only happens in this book. In all other books on nl-wiki where we use the RH-template, it works fine, see e.g. s:nl:Pagina:De voeding der planten (1886).djvu/46. Can you explain why this happens to this book?

Many greetings, and looking forward to your answer, --Dick Bos (talk) 10:47, 2 October 2019 (UTC)

@Dick Bos: It looks to me like there are odd block-level elements being introduced where they shouldn't be. For example, the following part of s:nl:Sjabloon:RunningHeader (and the same applies to the right side):
Currently: Should be:
    -->|{{#if:{{{1|{{{left|}}}}}} |
<span style="float: left; display: block;">
    -->|{{#if:{{{1|{{{left|}}}}}} |<!--
--><span style="float: left; display: block;"><!--
And also that there isn't a default value for the centre component (in {{RunningHeader}} here, it's a &nbsp;).
Actually, that template could be rewritten with block-level components and using flexbox, but that's another story! :-)
Sam Wilson 09:48, 5 October 2019 (UTC)
You were right! I usually copy this kind of templates from en-ws (I really don't understand a word of the code, to be honest), and apparently there had been an update of the template. Now that I copied the newest version to nl-ws, it is running perfectly! Hurray.....
We need someone with some technical knowledge to do this kind of maintenance work on the Dutch Wikisource! But unfortunately, activity on nl-ws is very low. Thanks for helping us! --Dick Bos (talk) 16:29, 7 October 2019 (UTC)
@Dick Bos: Oh good, I'm glad it works. There isn't really a good system yet for keeping imported templates up to date. One possible way could be that we add some of this functionality to the new Wikisource extension, because then it'd be on all Wikisources. That's going to take some more work though. For now, it's export/import and keep an eye on things. Sam Wilson 03:55, 8 October 2019 (UTC)
@Dick Bos: Can I recommend that you special:import templates (select "en" from dropdown) rather than copy and paste. 1) it brings a history and can actually bring other required components; 2) it allows, in future times, others to track and find what you were doing and reproduce at nl:special:log/import. — billinghurst sDrewth 23:33, 6 January 2020 (UTC)

Plain sister updates[edit]

Hi SW. [Happy early cricket season, hope your weather is better than mine at the moment] I am wondering whether you had been able to look at my thoughts on template talk:plain sister for an update to Module:Plain sister to automatically link articles to enWP biographis. I know that I lack the skills to make those changes, and wondered whether you had the skills for such a change, or whether we are needing to go searching outside. — billinghurst sDrewth 10:09, 9 November 2019 (UTC)

@Billinghurst: I've replied over there. I've resurrected some work I did last year on that, and it's now functioning. See what you think. I'm happy to make the changes and monitor things closely. Sam Wilson 21:28, 10 November 2019 (UTC)

validated index count discrepancy[edit]

Hi. Hope all is well out west.

  • Your tool says "This page presents the categorisation of the 3172 works on the EN Wikisource"
  • Category:Index Validated says "... pages are in this category, out of 3,442 total"

Which is correct? What sort of discrepancies would we need to identify to resolve the 270 gap? I cannot work out what to do with a json list (my uselessness) to make any comparisons within AWB or petscan:.

Noting that when I compare Category:Index Validated with Category:Indexes validated by date (3443) that there is some discrepancies to resolve between those two, so the numbers above will probably have bumped around a little by the time you see this. — billinghurst sDrewth 02:51, 6 January 2020 (UTC)

  • @Billinghurst: Hello! That's interesting. :( My first thought is that the missing ones are not categorized (i.e. their index pages are in Category:Index Validated but their mainspace pages are not categorized). It could also be that the tool can't figure out what their mainspace pages are (it looks for links to a top-level mainspace page from the Index page, with a query a bit like this one).

    Sam Wilson 03:10, 6 January 2020 (UTC)

    Thanks. For the incompetent, would you be able to generate a wikipage or a petscan query (preferred as regeneratable), and I will take it from there to determine the issue. Some of this list will then be works not transcluded, and I will explore the remainder. I know that there are plenty without title links, it is one of Esme's traits. We should document that we wish for titles to be linked, as I don't always do it for other's works. — billinghurst sDrewth 03:30, 6 January 2020 (UTC)
  • @Billinghurst: I've been trying, but have not yet figured out a simple way. Will keep looking at it! The ws-cat-browser is also due for a rejigging I think, because we can now determine validated mainspace pages via Category:Validated texts, so it no longer really needs to go via the index page at all. Although, maybe it's good to keep it as-is, for helping to find discrepancies like this. Sam Wilson 00:45, 7 January 2020 (UTC)
    A SPARQL query in Petscan doesn't function?

    I would disagree that "category:validated texts" is a reasonable match. That category does not have a one-to-one relationship with Index: ns—DNB to works, of volumes of DNB to works are one to many. Plus, it is grossly underpopulated. There is no easy means to populate from this side the work side or the index side; even then the root page and the index: page are usually not one-to-one either.

    Last time that I asked about flag addition I was told that there was no ready means to bot populate the flag via the available tools. blah blah blah blah... <sigh> — billinghurst sDrewth 04:30, 7 January 2020 (UTC)

Index:East Anglia in the twentieth century.djvu extra image page, now text off by one[edit]

Hi SW. IA-upload bot generated the above work, and it seems to have inserted that random image page as the lead, the work shows the image on page 2 at IA. Now the text and scans are out by one. What is the best way to address/resolve? — billinghurst sDrewth 01:42, 20 January 2020 (UTC)

@Billinghurst: This seems to be some discrepancy with the book viewer at IA, because the imagecount attribute in the work's metadata says 672, but the book viewer is only showing 670 (there are non-book scans at front and back). I can't find anything in the metadata that explains how book viewer is making this decision; I guess it's in there somewhere, and ia-upload could use the same logic to exclude these pages. I don't have time right now to dig into it though. :( It looks like there's only a few pages proofread so far, so I guess it's a matter of resolving it manually. Sam Wilson 03:12, 20 January 2020 (UTC)
okay. FWIW The display for IA-upload bot, just showed the second scan page as the first page as the page to exclude. — billinghurst sDrewth 03:33, 20 January 2020 (UTC)
@Billinghurst: oh, hmm yeah that's annoying. It's because it uses the bookreader thumbnails and numbering system to get that image. :( I'll open an issue. Sam Wilson 03:40, 20 January 2020 (UTC)
Okay, what is your trick to get around 100MB? The PDF -> DJVU conversion pushed it over the upper size, though you clearly have a sneak means through. If you can apply the corrected file it is at toollabs:wikisource-bot/East Anglia in the twentieth century.djvu. — billinghurst sDrewth 12:10, 22 January 2020 (UTC)
@Billinghurst: Done. It's chunked upload protocol—which is what the UplaodWizard uses behind the scenes—which allows up to 2GB per upload. Last safety valve is server-side upload which can be performed by some WMF staff (sysadmin, not dev, iirc) and bypasses all size restrictions, but that's most suitable for things like massive donations from some archive or library and not individual files. --Xover (talk) 12:29, 22 January 2020 (UTC)
Culled the file, two times misaligned by different processes. Will await other fixes, it isn't an urgent work. — billinghurst sDrewth 07:36, 23 January 2020 (UTC)
@Billinghurst: Want me to regenerate the file from the source scans? --Xover (talk) 08:13, 23 January 2020 (UTC)
I don't mind either way, it was poked up for Charles, and it is not urgent. It can wait until there is a fix, or the need for one to demonstrate a fix. — billinghurst sDrewth 10:43, 23 January 2020 (UTC)

Author:George Spearing[edit]

work written 1803, born 1824. Needs a fix, or we have a miracle! — billinghurst sDrewth 03:50, 17 April 2020 (UTC)

  • @Billinghurst: Ha! Yes, oops. I've fixed it to be (as @Annalang13 correctly had it) his death date. Also added his birth date based on the fact that he turned 41 while down the hole. Sam Wilson 04:05, 17 April 2020 (UTC)

Using Google OCR for old English text[edit]

Hi. I'm running a project to upload 3,000 chapbooks from the National Library of Scotland's digitised collections and we're interested in using the Google OCR function instead of Tesseract because it identifies the long s letter (ſ) really well. i noticed you've been quite heavily involved in the discussion around Google OCR - even though it's discouraged to use Google OCR with English, do you think this would be an acceptable use? Gweduni (talk) 14:51, 4 May 2020 (UTC)

@Gweduni: I think it definitely would be okay. The only reason it's at all discouraged (and it should be a less strong word, I think) is that we have a quota with Google. However, the quota is always renewed, and Google are (I think!) very happy for us to use their Cloud Vision API. We're also going to be doing some improvements soon (although I guess still a couple of months away) that will hopefully increase the quality of the text returned (Google gives us lots more structure about the OCR text than we're currently using, so we could do things with e.g. automatically improving punctuation, or even adding wiki templates where they're unambiguous). For updates, follow the phabricator:tag/wikisource_ocr project. Sam Wilson 22:32, 4 May 2020 (UTC)
@Samwilson: Great, that's really good news. We'll move over to using Google OCR on our project from now on, and I'll have a look at the wikisource ocr project you mentioned. Excited to hear new developments are on the way! Gweduni (talk) 12:12, 5 May 2020 (UTC)

Long S[edit]


I saw that you were involved in some of the "long s" discussion many years back. I've been trying to find good info here on how to approach long s in proofreading, but haven't found a clear guideline one way or the other. When I tried the template someone had created, it didn't seem to work, and also seems rather tedious unless I'm missing something (likely). Thanks for any clarification or advice you might have! Grillo7 (talk) 16:46, 3 June 2020 (UTC)

@Grillo7: I think the basic guidance is not to use them at all, but of course if you're consistent within a work then it's fine. I used the {{ls}} template, but I think that's now set to only display the long S in the Page namespace (and a normal S in the mainspace). If you definitely want a long S in every situation, then you can just use the ſ character (probably copying and pasting it is the easiest way, or remembering its key shortcut). —Sam Wilson 23:10, 3 June 2020 (UTC)



Please could you bring {{Person}} more into line with {{Author}}, in particular with regard to pulling in data (and images) from Wikidata? No doubt they can use the same Lua module. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 10:01, 19 October 2020 (UTC)