User talk:Mpaa/Archives/2023

From Wikisource
Latest comment: 9 months ago by Mpaa in topic How's your SQL-fu?
Jump to navigation Jump to search

Shakespeare's sonnets

Hello. I have reverted an old edit by MpaaBot, which I guess made sense in the time it was made, but later the situation with various redirects and version pages changed. I did the same here as well, but then I realized it would be much better to employ the bot again. Do you think you could do it? -- Jan Kameníček (talk) 18:12, 13 February 2023 (UTC)

@Jan.Kamenicek done. Mpaa (talk) 21:11, 17 February 2023 (UTC)

How's your SQL-fu?

Out of around 600k texts on enWS, around 200k are not scan-backed and the number is not really decreasing (we're adding new non-scan-backed texts as fast as we scan-back old texts). See this graph.

I have a vague long-term ambition to get this down to somewhere around zero, probably through packaging the task up in some kind of structure that can be attacked like any other large maintenance backlog that the community can be encouraged to divide and conquer, and chip away at over time.

But right now the only real tool we have for that is Special:PagesWithoutScans (provided by PRP). This only shows 5k at a pop, doesn't filter out sub-pages, can't be sorted, updates only once a week, etc. So I think we need something better.

I haven't figured out the optimal form of that "better thing" yet. It could be a dedicated Toolforge tool with a sortable and filterable list, maybe letting you copy it as wikitext for on-wiki tracking and supporting multiple Wikisourcen (if the other sisters would like the same). Or it could be a huge table on-wiki that's updated by a bot on some interval.

But as a first approach I'm thinking the easiest and most useful (value for effort put in) would be to simply bot-add a category to all top-level mainspace pages that are not scan-backed. That'd be our "these need to be checked by a human" backlog category; and the we could build a process on top of it with per-step templates to indicate progress like "there's a scan at IA for this", "scan is on Commons", "Index: set up", "text Matched&Split but needs proofreading" etc.

I asked Samwilson for tips and he pointed me in the direction of checking the templatelinks table for pages that have no transclusions of pages in the Page: namespace. Special:PagesWithoutScans does roughly that in SpecialPagesWithoutScans.php.

But my Python-fu is barely toddler-level, and my SQL-fu even worse, so I'm kinda daunted by the though of trying to hack something like that up myself. I could probably do it, but where the goal would be relative to the heat death of the universe I am less certain of. :)

So… Is this something you'd be interested in helping out with? Does PWB have existing facilities for interrogating these tables (i.e. is the info exposed by the Action API, I guess)? Alternately, is your SQL-fu strong enough to make this reasonable / trivial to tackle? Do you have any thoughts on how best to tackle the problem?

No rush—it's a long term / backburner / not fully formed thing—and as sketched out above I'm not entirely sure what the best approach is yet. I'm just fishing for pointers, recruiting unwitting victims generous volunteers, etc. :) Xover (talk) 07:40, 31 August 2023 (UTC)

@Xover sorry for the late response but haven't spent much time here lately. Anyhow as a quick feedback for now:
1. pywikibot supports special pages as a pagegenerator (there is no 'official generator' available via CLI but it is easy to make one:
import pywikibot
site = pywikibot.Site('en', 'wikisource')
gen = site.querypage('PagesWithoutScans', 10)
list(gen)
(if you find the right SQL query, pywikibot can run SQL-queries)
2. My SQL is pretty basic, what I would do is to try to mimic the query in SpecialPagesWithoutScans.php in Quarry.
That's all for now, if I have the chance I will dig a bit deeper.Mpaa (talk) 16:16, 23 September 2023 (UTC)