User talk:Samwilson

From Wikisource
Jump to navigation Jump to search

mw.wikibase.getEntityIdForTitle[edit]

[OMG a nude page, let me help resolve that]

I just saw announced the lua extension mw.wikibase.getEntityIdForTitle and if I am not mistaken that is could be a joyous little bundle of helpfulness for us.

Here I am thinking where we have an author page, and a related biographical page in main ns, and working out whether we can poke a wikipedia = parameter on the respective main ns page, or maybe automating a link; similarly I am see the potential for us to more readily get some bot action to better apply "main subject (P921)" at Wikidata for our biographical works. Am I reading the function properly? — billinghurst sDrewth 22:37, 16 April 2018 (UTC)

@Billinghurst: Interesting! So you mean create a link from the NS0 page of e.g. a biography chapter to the Author NS of the bio's subject? If the bio has a P921, couldn't we link via that (i.e. bio page → sitelink → P921 → Qxx → sitelink → Author page)? I'm not quite getting when we'd need to do a page title look-up... or do you mean, as a means to find unlinked articles? That must be it. So we'd do a getEntityIdForTitle('NS0 Page Name') and see if it comes up with an instance of person, and if it does we'd add some thing to alert editors here to the fact? Sam Wilson 06:27, 17 April 2018 (UTC)

Books & Bytes - Issue 27[edit]

Wikipedia Library owl.svg The Wikipedia Library

Bookshelf.jpg

Books & Bytes
Issue 27, February – March 2018

  • #1Lib1Ref
  • New collections
    • Alexander Street (expansion)
    • Cambridge University Press (expansion)
  • User Group
  • Global branches update
    • Wiki Indaba Wikipedia + Library Discussions
  • Spotlight: Using librarianship to create a more equitable internet: LGBTQ+ advocacy as a wiki-librarian
  • Bytes in brief

Arabic, Chinese and French versions of Books & Bytes are now available in meta!
Read the full newsletter

Sent by MediaWiki message delivery on behalf of The Wikipedia Library team --MediaWiki message delivery (talk) 14:49, 18 April 2018 (UTC)

Your feedback matters: Final reminder to take the global Wikimedia survey[edit]

WMF Surveys, 00:43, 20 April 2018 (UTC)

Unused files as a list?[edit]

Do you know a way to manipulate Special:UnusedFiles so I can get it as an easy list? There are a string of files there that I know that I can straight out delete, though how to get it as a list to easily manipulate in bite size chunks is just not obvious. It is not even obvious that you can pull it from the API, not that I can generate simple text lists from the API anyway — billinghurst sDrewth 04:06, 16 May 2018 (UTC)

@Billinghurst: It doesn't look like it. That special page isn't transcludable even, and it's constructing the database query itself so I suspect the same query isn't done anywhere else (or we'd be reusing it). Also it's the only place mw:Manual:$wgCountCategorizedImagesAsUsed is used. What sort of list are you trying to build? It probably wouldn't be too hard to add transcluding support, if that'd help. Sam Wilson 04:38, 16 May 2018 (UTC)
There are works there that have been completed where the original image has been cleaned/gleaned/screened and uploaded to Commons. So we have the residue images to cleanse, and getting these url by url is a PITA. Getting a list, checking the work completion, and zapping more collectively is bettererer. Noting that prefix lists are unreliable in case one/some aren't done. — billinghurst sDrewth 05:23, 16 May 2018 (UTC)
Dropped the problem into phab:T194865billinghurst sDrewth 01:44, 17 May 2018 (UTC)
Note that File linked via {{raw image}} is still considered 'unused'. In pywikibbot: python scripts/listpages.py -unusedfiles.— Mpaa (talk) 17:36, 19 May 2018 (UTC)

Ping[edit]

Hi. Just in case you have not been notified about this: https://phabricator.wikimedia.org/T194861 . It is happening quite often recently. Bye— Mpaa (talk) 20:35, 18 May 2018 (UTC)

Books & Bytes – Issue 28[edit]

Wikipedia Library owl.svg The Wikipedia Library

Bookshelf.jpg

Books & Bytes
Issue 28, April – May 2018

  • #1Bib1Ref
  • New partners
  • User Group update
  • Global branches update
    • Wikipedia Library global coordinators' meeting
  • Spotlight: What are the ten most cited sources on Wikipedia? Let's ask the data
  • Bytes in brief

Arabic, Chinese, Hindi, Italian and French versions of Books & Bytes are now available in meta!
Read the full newsletter

Sent by MediaWiki message delivery on behalf of The Wikipedia Library team --MediaWiki message delivery (talk) 19:33, 20 June 2018 (UTC)

Meeting followup[edit]

Hi Sam, Thanks for being there today. Lots of stuff half heard, half understood, to try to follow up on. One thing you mentioned was some form of mapping using wikipedia when data have been uploaded to the commons. I was curious about this as I dislike my Rgooglemaps: they are too fuzzy. Nor am I mad about my Australian outline maps (produced using SAS), so another technique would be good.... MargaretRDonald (talk) 13:27, 27 June 2018 (UTC)

Cuscuta australis
@MargaretRDonald: There's a new thing called Kartographer that can show data on maps pretty easily. For example, at right is the Cuscuta australis data we were looking at yesterday. The colours and styles and things can all be customised, and the data doesn't have to live in the wiki page (as I've done in this example). —Sam Wilson 01:25, 28 June 2018 (UTC)

@Samwilson: Thanks for this. (Only just spotted...) MargaretRDonald (talk) 02:07, 6 July 2018 (UTC)

@Samwilson: Sorry to be so thick. But here in your text you have listed all the co-ordinates... and of course, the map is embedded in the page.. Writing code to generate the mark-up looks a smidgin ugly. So I am not quite sure how this is easier, or conceptually better from a wikipedian point of view (?) MargaretRDonald (talk) 02:14, 6 July 2018 (UTC)
No, the idea would be to include the coordinates (in KML format) in a template in the manner of e.g. wikipedia:Template:Attached KML/High Street, Fremantle. Then, to update the range map, only that template would need to be changed and the article map would update automatically from there. I'm not sure if it is easier, but it does make the map zoomable, and perhaps is quicker than creating separate raster map files and uploading them. Just an idea though! :) —Sam Wilson 06:50, 6 July 2018 (UTC)

seeing other wikisources[edit]

@Samwilson: Hi, Sam. It would be very nice if one could see all the corresponding wikisource things on the left as one can in wikipedia or as one can in wikidata. I am constantly seeking other language sources for botanical stuff and would be nice to be able to navigate (relatively) easily to other language sources..... Any thoughts? MargaretRDonald (talk) 02:04, 6 July 2018 (UTC)

@MargaretRDonald: Yes, this is a definitely wanted thing, and is being worked on as Phabricator:T180303. The trouble with Wikisource interlinking, compared to other projects, is that works in different languages don't get directly linked to the same Wikidata item, but rather each get their own (which has a 'edition or translation of' property that links to the unifying work-level item). —Sam Wilson 06:53, 6 July 2018 (UTC)
@Samwilson: Hmmm. (I see) I look forward to all those clever persons making it happen sometime.... Cheers, MargaretRDonald (talk) 07:14, 6 July 2018 (UTC)

Living auhors category again[edit]

Hi Sam. First I would like to thank you a lot for handling the floruit problem at the template {{Author}} and thus solving partly the Living people category.

There are also some authors who do not have the floruit property filled at Wikidata, because they are not known because of a one-date event, but who were known for a longer time. Such people can have Wikidata properties "work period (start)" and "work period (end)" instead. An example of this is Author:Mordach Mackenzie (Q56612310) whose birth and death dates are unknown and who is known for his work between 1746 and 1764. Do you think it would be possible that a) the authors's page at Wikisource could take these dates from Wikidata and display them as "fl. 1746–1764" and b) remove the authors whose "work period (end)" was more then e.g. 90 or 110 years ago from the Living people category too?

I am writing you because you are the only one here I know that can handle such things (though I believe there are more people like that). However, it is not of the highest importance, so if you do not have enough time, it can wait. Thanks. --Jan Kameníček (talk) 11:30, 18 September 2018 (UTC)

Yes, that sounds like a great idea! I did see your comment on that other page; sorry I didn't reply yet. I'm keen to help, not sure when I'll find time, but it's conceptually the same thing we're already doing but just with a different property, so it shouldn't be too hard. There are currently 7 failing tests that I want to fix up before embarking on any new features though, so I might try to do them first. Will keep you posted! Sam Wilson 03:15, 19 September 2018 (UTC)

PageCleanUp feature request[edit]

Hi,

Just a note to make a record of our recent conversation about my feature request for your very useful PageCleanUp.js tool:

If a full stop (period) is followed by a lower-case letter:

Some text. then some more

then it should probably be a comma:

Some text, then some more

If a comma is followed by a capital letter:

Some text, Different text

then (proper names notwithstanding) it should probably be a full stop:

Some text. Different text

If this is not a major issue for most OCRd text, perhaps a separate script would be better. What do you think? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 10:31, 3 October 2018 (UTC)

Also, perhaps the script could fix ligatures, like the "fi" and "fl" in "magnificent power of flight"? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 16:14, 5 October 2018 (UTC)
  • @Pigsonthewing: dots and commas done, good idea! As for ligatures, Wikisource:Style_guide/Orthography#Ligatures suggests that we not use them as search engines struggle. I suspect that's wildly out of date. We do avoid e.g. the long 's', because it's "just" orthography and so not relevant to the text. Also, there are ligatures (e.g. st) that don't exist in many fonts at all. Sam Wilson 22:58, 9 October 2018 (UTC)
Sorry if I wasn't clear; I meant the script could change from ligatures generated by OCR to regular letter pairs. Thanks for the punctuation feature. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 00:25, 10 October 2018 (UTC)
@Pigsonthewing: Oh! Ha, yes I see now. Done! :) Sam Wilson 05:23, 10 October 2018 (UTC)
That is going to save me a lot of dull drudgery. Thank you! Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 10:42, 10 October 2018 (UTC)

mediawiki-feeds 503?[edit]

Hi Sam, thanks for all the stuff you do!

Realized I wasn't subscribed to the Signpost anywhere and when I tried the RSS feed, Feedly said it couldn't reach, and clicking the feed got a 503 from Toolforge and it said you're the mediawiki-feeds maintainer.

Thought you might like to know. Hopefully it's just an old link or something else simple. John Abbe (talk) 17:32, 2 June 2019 (UTC)

@John Abbe: Thanks for telling me about this! It made me realise that that tool isn't in my list of monitored tools, so I hadn't seen that it was down. That's fixed now, and so is the bug that was causing it to fail on the Signpost feed, and the tool is back online. See how it fares, and ping me with any dramas. :) Thanks! Sam Wilson 02:18, 3 June 2019 (UTC)
Sweet! And thx for the quick fix.John Abbe (talk) 05:46, 5 June 2019 (UTC)

Requesting import of "Links count" gadget[edit]

Any chance you could help with: Wikisource:Scriptorium#Requesting import of "Links count" gadget, please? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 19:23, 20 June 2019 (UTC)

curly quotes script[edit]

Hi -- thanks for the very useful-looking script. How do I install it at my common.js? I tried adding importScript('User:Samwilson/CurlyQuotes.js'); but that didn’t work. Thanks for help Levana Taylor (talk) 16:05, 30 August 2019 (UTC)

@Levana Taylor: Use this in your commons.js:
mw.loader.load('//en.wikisource.org/w/index.php?title=User:Samwilson/CurlyQuotes.js&action=raw&ctype=text/javascript');
And let me know if you find any bugs with it! :-) It's adapted from https://github.com/gitenberg-dev/punctuation-cleanupSam Wilson 01:58, 31 August 2019 (UTC)
Hmm… nice interface, mostly works, but I’ve already found some issues. Most notably, it’s not always correctly noticing paired apostrophes to leave straight: try it on this page, for example. Also it doesn't know to leave double-quotes alone if they’re inside angle brackets (I use <section begin="s1" /> for section breaks, for example). And, I’m sure this isn't the only case it doesn’t get right, but when you have italics inside double quotes ("Really?!") it doesn’t alter the double quotes. Here’s a thought for long-term development: since you’ll never be able to get it to work absolutely perfectly, the thing that’d make it really easy to check the results is if it highlights quotes in 3 different colors after finishing work, one for left, one for right, one for straight. Single and double quotes could be the same color: no need to have six! Levana Taylor (talk) 05:05, 31 August 2019 (UTC)
Hey, sorry if that last comment was too negative! I’ve been using the script a lot and finding it extremely useful. I do have a list of stuff it isn't catching, though. Lemme know if you want it. Levana Taylor (talk) 04:53, 5 September 2019 (UTC)
@Levana Taylor: Oh cool! Yes, please. I'm sure there are going to be a bunch of things we can't handle, but it'd be good to try. :) The quotes thing I'm looking at now, but I think the highlighting colour thing might be a bit harder. Or do you just mean in the preview, not the editing box? That might be easier. Anyway, thanks for the feedback! I'll try to improve the script. Sam Wilson 05:04, 5 September 2019 (UTC)
@Levana Taylor: Have you found Wikisource:WikisourceMono by the way? That helps with more easily seeing the different characters while editing. Sam Wilson 05:55, 5 September 2019 (UTC)
Yes, highlighting in preview would be helpful, even if it's not possible in the edit box. I find the font I'm using plenty readable, but the point is that you want the left-right pairs, or absence thereof, to jump out at you so your mind doesn't overlook them.
Anyhow, I was wondering why you don’t have some simple rules like double quote at start of line is left, at end is right; single quote between two letters of the alphabet is right. Must be a reason I’m sure! As for suggestions:
  1. My biggest suggestion for improvement would be to drop all the things the script does with dashes and stick to just quotes. I never find the dashes useful and they are constantly messing things up, like pagenames that contain hyphens, and the <!-- comment markup (though I guess you must be finding the dash alterations useful since you put them in!)
  1. Bug noted: paired apostrophes before s and d are not being correctly interpreted
  1. The French d’ and l’ are major list items not yet being recognized. Then there’s ’twould - ’twill - ’twere - ’tisn’t - ’twasn’t (etc.) - ’midst - ’neath - ’bout - ’fraid - ’nother - ’uns - People in the novel I’m reading keep saying ’Gad! and ’Pon my honour! Levana Taylor (talk) 06:48, 5 September 2019 (UTC)

Cloud Services and Toolforge[edit]

You wouldn't happen to be familiar with Cloud Services (WMCS, CVPS), Toolforge, and related infrastructure bits? I have some kinda hacky local tooling for working with DjVu files and Tesseract and was toying with the idea of trying to set up some related utilities of possibly general usefulness. But right now I'm banging my head against the wall of insufficient documentation for the stuff the WMF provides for hosting such things. In other words, I'm looking for someone familiar with the setup that's willing to answer dumb questions and provide some hand-holding. --Xover (talk) 15:28, 17 September 2019 (UTC)

@Xover: We do have the shared account user:Wikisource-bot for bits, and there is a range of documentation at wikitech: and places. Mailing list is pretty good for support, and IRC can be useful (though you have to be in a good time zone). I am totally useless as a coder, though have managed to find handholders to allow me to bumble through. — billinghurst sDrewth 23:33, 17 September 2019 (UTC)
@Xover: Yes sure, I'd be happy to help. I'm reasonably familiar with Toolforge. What issues are you having? Sam Wilson 03:37, 18 September 2019 (UTC)
Well, mostly stupidity and documentation that seems to be written for a different audience than myself.
My vague ideas to begin with are a replacement (possibly temporary) for Phe's OCR gadget, and possibly a tool that'll take images from some source (IA id, URL, zip file, etc.) and spit out a DjVu with a text layer. There's some related stuff that might be relevant, like an easy way to add and remove pages from a DjVu, or to redact pages from a DjVu (typically for copyright reasons). Not sure what all would make for tools that are 1) a reasonable amount of effort to get working and 2) of sufficiently general utility to be worth it. A short term alternative for the OCR gadget is the primary motivation as that seems to be pretty critical for several contributors.
Right now I'm trying to figure out where and how it'd make sense to host something like that—Cloud VPS, Toolforge, or a third party general web host somewhere—and I'm just not finding the documentation that'll tell me what CVPS and Toolforge actually look like in terms of facilities, hosting environment, and so forth. As I said, the docs seem to be addressing different questions and for a different audience than me. So my first set of dumb questions / need for hand-holding is to figure out that.
What I'd need is:
  • A sufficiently Unix-y hosting environment. Fedora would be perfect, and RHEL or CentOS good second choices. Any good modern and not too esoteric Linux distro would probably be good enough, but experience tells me there are crucial difference in relevant sysadmin tools and package distribution/management systems and availability of packaged third-party software. Depending on what comes ready out of the box that may or may not be an issue.
  • A non-ancient version of Perl 5, with a reasonable set of standard modules installed. Given the rate of change in perl-land, I don't imagine any OS the infrastructure team are willing to host would contain a version that's too old. I don't current need a lot of esoteric perl modules, but by experience I expect to need to be able to install at least some. For example, I believe HTML::Parser was dropped from the core modules so that's something that would need to be available through some method.
  • A not-too-esoteric CGI hosting environment. My experience is with Apache, possibly with mod_perl, but anything that supports Perl and can be tweaked for interactive performance would probably work.
  • Tesseract 4.1 installed and functioning.
  • GraphicsMagick in some recentish version. ImageMagick will probably do in a pinch, but my experience with GraphicsMagick is better.
  • The ability to have such software updated in some reasonable timeframe. Whether that's done by the sysadmins as part of the platform, or whether that's something I'd sysadmin myself using the package tools, isn't all that important.
  • I'm way past the age where I want to waste time compiling software from source, so I really really would prefer that can be handled through some kind of package system.
  • I'd need a moderately large amount of disk space to play with, and moderately performant too (for OCR, disk IO quickly becomes a bottleneck). Several tens of gigs at least for temporary stuff: purged, depending on the case, on the timeframe of hours up to weeks. A gig or so per DjVu file, and room for at least 10 jobs' worth of files sitting around, would be the minimum reasonable. More is better.
  • For the batch OCR stuff I'd eat all the CPU I could get too, but for the per-page stuff anything that isn't completely choked would probably work well enough.
  • It would be a bonus if I had easy read-only access to files from Commons and local files on enWS (and possibly the other WSes if anyone should want it). A virtual filesystem or something, preferably more performant than having to download a copy of the file over HTTP. If any writing to a wiki is eventually needed that'd go through the API, most likely using OAuth, so read-only would be fine. Not a requirement by any means, but it'd make the DjVu manipulation stuff much more elegant and efficient if I ever get around to it.
  • Network access to Commons and enWS for downloading files, and for talking to the API if it becomes relevant.
  • General internet access to download stuff from IA, Hathi, etc. if that becomes relevant.
  • No immediate need for access to database dumps or similar: everything I have in mind is just crunching File: files.
  • No immediate need for database facilities: I might eventually need something to track batch jobs or whatever, but I'd get by with file-based solutions for a good long while before that became an issue.
  • If there is a batch system with oodles of compute resources that responds to "run this code on that data over there and notify me with the results once you're done" that'd be neat, but I'm not entirely sure that'd be more performant than doing it on whatever is hosting me directly. Possibly it'd enable parallel execution of large batch OCR/DjVu jobs if the tool is used a lot, that would otherwise need to be serialised, but I'm not sure the volume would be there to make that worth the effort.
  • If anyone actually started using a utility here I would want to add extra maintainers with OS-level access.
  • For source control and such I'd probably use Github (Gerrit looks completely impenetrable to me), and issue tracking either there or Phabricator, so no special needs related to that.
That's a rough braindump of the requirements. Given your knowledge of the facilities, what should I be looking at for hosting? By my guesses here, Toolforge might be both too constricting for the needs, and at the same time its advantages aren't all that relevant (I think I've grasped that Toolforge has DB dumps and similar already available, but as I don't need those…). If my vague understanding is anywhere near right, Toolforge is essentially a shared web host with some Wikimedia-specific facilities while Cloud VPS is just server hosting (that happens to be para-virtualised rather than bare-metal); but from there to the details that'd let me assess them against the requirements is a bit steep a climb right now. Anything I haven't thought of? Am I completely off my rocker? Should I go away and stop bothering you? :) Any help and hand-holding would be very much appreciated! --Xover (talk) 08:28, 18 September 2019 (UTC)
@Xover: Yep, your understanding of the shared-hosting or VPS is right. You could certainly do all you need on a VPS, but I think it sounds like you'll be fine with a Toolforge tool (although I'm not 100% certain off the top of my head of version specifics etc.). I recommend creating a tool account and seeing how you go. Lots of people use Github, so no worries there. Probably the most confusing thing for new toolforge users is the cronjob setup: basically, your cronjobs don't actually run things themselves, they just add a job to the 'grid', where it runs. In practice this just means a command has to use jsub. Anyway, I recommend a) creating a tool, b) trying to run what you need, and c) ask me when you hit an issue. Sam Wilson 23:16, 25 September 2019 (UTC)

Little problem on nl-ws[edit]

Hi Sam,

may I ask you to take a look at some strange problem, that happens on nl-ws?

It does happen only in one book, for instance at this page: s:nl:Pagina:Heemskerck op Nova Zembla.djvu/101. As you can see the header does not outline correctly. I have been trying all kinds of things. Nothing seems to help. It only happens in this book. In all other books on nl-wiki where we use the RH-template, it works fine, see e.g. s:nl:Pagina:De voeding der planten (1886).djvu/46. Can you explain why this happens to this book?

Many greetings, and looking forward to your answer, --Dick Bos (talk) 10:47, 2 October 2019 (UTC)

@Dick Bos: It looks to me like there are odd block-level elements being introduced where they shouldn't be. For example, the following part of s:nl:Sjabloon:RunningHeader (and the same applies to the right side):
Currently: Should be:
    -->|{{#if:{{{1|{{{left|}}}}}} |
<span style="float: left; display: block;">
{{{1|{{{left}}}}}}
</span>}}<!--
    -->|{{#if:{{{1|{{{left|}}}}}} |<!--
--><span style="float: left; display: block;"><!--
-->{{{1|{{{left}}}}}}<!--
--></span>}}<!--
And also that there isn't a default value for the centre component (in {{RunningHeader}} here, it's a &nbsp;).
Actually, that template could be rewritten with block-level components and using flexbox, but that's another story! :-)
Sam Wilson 09:48, 5 October 2019 (UTC)
You were right! I usually copy this kind of templates from en-ws (I really don't understand a word of the code, to be honest), and apparently there had been an update of the template. Now that I copied the newest version to nl-ws, it is running perfectly! Hurray.....
We need someone with some technical knowledge to do this kind of maintenance work on the Dutch Wikisource! But unfortunately, activity on nl-ws is very low. Thanks for helping us! --Dick Bos (talk) 16:29, 7 October 2019 (UTC)
@Dick Bos: Oh good, I'm glad it works. There isn't really a good system yet for keeping imported templates up to date. One possible way could be that we add some of this functionality to the new Wikisource extension, because then it'd be on all Wikisources. That's going to take some more work though. For now, it's export/import and keep an eye on things. Sam Wilson 03:55, 8 October 2019 (UTC)