Wikisource:Scriptorium/Archives/2020-12

From Wikisource
Jump to navigation Jump to search
Warning Please do not post any new comments on this page.
This is a discussion archive first created in , although the comments contained were likely posted before and after this date.
See current discussion or the archives index.

Feedback requested on November update for Wikisource ebook export project

Hello, everyone! The Community Tech team is requesting your feedback on the recently posted November update for the Wikisource ebook export improvement project. Your feedback is very important to us. We want to know what you think of some work we have recently completed to improve the reliability of WS-Export and font support in various languages. Additionally, we want to know what you think of our proposed mockups to improve the download user experience. In that case, please do check out the updates, if you can, and share your feedback on the project talk page. Thank you! --IFried (WMF) (talk) 18:23, 2 December 2020 (UTC)

Algiers Accords - PD?

Do we think this document is in the public domain? It is an agreement (though short of a treaty) between the United States and Iran, brokered by the Algerian government, with the document officially having been produced by the Algerian government. BD2412 T 06:45, 2 December 2020 (UTC)

As slight background information, Commons Algeria copyright discussion says "The protection period for a collective, pseudonymous, anonymous, posthumous or audiovisual work is 50 years from the end of the Gregorian year during which the work has been legally published for the first time." This would presumably be a collective and anonymous work. If not, it would still fall under the "State works, legally made available for public use in non-profit generating purposes, may be freely used subject to maintaining the work wellbeing and highlighting its source. State works, within the context of this article, shall mean works produced and published by various state institutions, local groups and public establishments of administrative nature.[Law of 2003, Art.9] Approved protection of copyrights provided for herein shall not be granted to administrative laws, regulations, resolutions and administrative contracts issued by the state institutions, local groups, justice rulings and the official translation of these texts" clause. I would say it's a likely yes. Peace.salam.shalom (talk) 02:20, 3 December 2020 (UTC)
Our own rules for faithfully reproducing works and denoting their origin would clearly abide by the listed requirements even if it was protected, but the next clause would seem to withhold copyright from this document. I'm good with that. BD2412 T 07:23, 3 December 2020 (UTC)

Hey slackers, get to work!

I'm joking of course, but I did think it might help if I listed the following here to get backlogs cleared up, probably wise if a sidebar somewhere actually the links and numbers. I understand that Category:Texts with missing musical scores just keeps growing larger until we get some tool/person/bothOfTheAbove to get it done, but there's not much excuse for other categories with backlogs that have never been cleared in fifteen years. Category:Pages to be split has 15 works dating back to at least 2004, Category:Works with no header template similarly has only 21 works but again they date back to at least 2005. Category:Subpages with no header template has only 58 works, but they date back to at least 2010. The very important Category:Works with no license template has 250 works and they haven't been cleaned up since at least 2008. Category:Empty ref tag has 92 pages but seems like the sort of thing that should be on a bot's regular monthly schedule. Category:Texts with page numbers has only 45 works, but again since 2007. And then my personal favorite is Category:Proofread works with index pages that need linking which appears to have been a 2013 creation of Billingshurst that only ever got used a single time and so the one work still sits in it, sad and alone. I feel like maybe the "Maintenance Templates" category should be given a once-over like this every December and have a chinwag about how to clean it up? Cheers. Peace.salam.shalom (talk) 23:12, 3 December 2020 (UTC)

  • I would like to help with musical scores, (to some degree at least,) but the Score extension has been destroyed for some months now, so it is very difficult to create and proofread scores. For some of the smaller categories, they could be cleaned out, partially or wholly, now, and cleaned out periodically as required every year. For some of the larger categories, a few dedicated editors could work on one or more of the categories, but these will (generally) take a longer period of time to do, as creating images, tables, musical scores, &c., is more time-consuming than merely proofreading around these concerns. (For musical scores, more so than the other categories, there are oftentimes many pages within one work missing musical scores, and these could be completed by a single editor who wishes to proofread the whole work.) A greater use and popularity of {{backlog}} could also help this situation. (Category:Proofread works with index pages that need linking seems like a helpful category for a bot, but I don’t know much about that sort of thing.) TE(æ)A,ea. (talk) 00:39, 4 December 2020 (UTC).
we should put a redesign of score editing on wishlist. if they could reverse engineer lilypond, it would remove a major roadblock. Slowking4Rama's revenge 02:24, 4 December 2020 (UTC)
(slightly off-topic, is that for proposing things like that the OCR should automatically ad-
-just for routine page brea-
-ks when reading, and auto-correct blankspace before a semi-colon on the OCR engine? Peace.salam.shalom (talk) 03:42, 4 December 2020 (UTC))
No, the wishlist is a place to make yearly requests to Wikimedia’s technology team. The OCR engine used on Wikisource isn’t really maintained by them, so requests like that should be made locally. However, there are (and have been) wishlit requests relating to the Wikisource OCR engine; I don’t know how they are progressing. TE(æ)A,ea. (talk) 03:52, 4 December 2020 (UTC).
@Peace.salam.shalom, @TE(æ)A,ea.: The OCR improvement project is coming soon (I hope!). We (I'm on the CommTech team) are going to start on it as soon as we're done with the WS Export project, so maybe in a few weeks or something like that. Regarding better reflowing of text and fixing of common scannos, there's a heap that we're going to be able to do because the Google OCR (at least; perhaps Tesseract too?) provides much more information than what we're currently using to prepare the text. It looks like it'll be possible to have language-specific rules for punctuation etc., and even include wikitext like templates and whatnot. phab:T250185 is a task. —Sam Wilson 09:05, 4 December 2020 (UTC)
That is really impressive, and glad to hear it's not on some distant wish-list but an up and coming to-do list. I always like to see phrases like "kaldari renamed this task from Can Wikisource-OCR handle paragraphs better? to Make Wikisource-OCR handle paragraphs better.Apr 21 2020, 7:53 PM", lol. It won't help the backlog issues so much, other than all the non-OCRed pages when I click "Random Transcription" which doesn't even seem to tackle the Faebot additions that sit on Commons without Indices. Peace.salam.shalom (talk) 14:40, 4 December 2020 (UTC)
@Samwilson: Good to hear the OCR stuff is having thought: those ideas sound really exciting! On a practical note, can we please have phab:T230415 reviewed? Not using the existing paragraph separators in existing OCR layers is extremely annoying and a huge waste of human brainpower to manually insert them again. Inductiveloadtalk/contribs 14:46, 4 December 2020 (UTC)
@Inductiveload: I went looking, and got confused because it's already merged. But only four hours ago! Looks like @Kaldari got there first. :-) Is there more with that ticket that needs to be worked on? —Sam Wilson 22:51, 4 December 2020 (UTC)
@Samwilson: thank you and @Kaldari:, it's a small thing but it's going to really improve things. @Xover: that's your patch, AFAIK that's all that's needed? Inductiveloadtalk/contribs 23:05, 4 December 2020 (UTC)
@Inductiveload: That should be all that's needed, yes. But with the caveat that this part of the stack is really hard to test without a complete environment, so there may be some factor I didn't account for that makes further tweaks necessary. We should maybe start thinking about getting a representative "Test Wikisource" set up on the test cluster if we're contemplating anything more than this exceedingly trivial patch. --Xover (talk) 08:26, 5 December 2020 (UTC)
@Samwilson: Tesseract hOCR gives you structured information on geometry down to the character box level. It doesn't have any appreciable support for font styles (and struggles recognising italic text sometimes), but for coarse text features (regions, columns, paragraphs, etc.) the only difference with Google Vision I'm aware of is accuracy (which I haven't studied systematically, but believe is comparable). Any transforms (fixups) designed to work on an abstract canonical representation should be applicable to data from both. --Xover (talk) 21:16, 4 December 2020 (UTC)
@Xover: That's great. Certainly, we'll do all we can to make anything work for whatever backend OCR is used; hopefully we can make it generic, but I guess there's not actually a standard to follow with the abstract structure is there? (Whenever working with anything Google, I always have a feeling in the back of my mind that they're going to throw it all away at some point! Good not to be too tightly tied to 'em.) —Sam Wilson 22:55, 4 December 2020 (UTC)
@Samwilson: I would suggest starting with hOCR (enwp). I haven't tried deserializing it into an in-memory model (my parser is streaming and converts to sexpr as it goes, with no intermediary representation to speak of), but as an information model it looks reasonable for the purpose. --Xover (talk) 08:36, 5 December 2020 (UTC)

 Comment Hmmm. Some maintenance is more important than others; like fixing immature author pages that omit the basics components.

Noting that some of us spend hundreds of hours a year doing maintenance, for multiple years, so excuse me for not getting excited by the post. Happy for you to join in doing maintenance. — billinghurst sDrewth 10:00, 4 December 2020 (UTC)

No need to get offended, my tone is meant to be jovial. I've been trying to do maintenance since I joined two weeks ago, but I'll remind you that I almost walked away from Wikisource in that first week because you went on my talk page and told me to STOP DOING MAINTENANCE and instead just discuss the stuff on Scriptorium. So now I'm here (I still do try Maintenance, I was recently shown Match/Split and I've been trying to work on that backlog) and trying to discuss it, even recognizing that a couple other editors have been super patient and welcoming in guiding my efforts to tackle such mundane tasks (it was only yesterday it was pointed out that I shouldn't be using {{lh}} and {{rh}} as left-header and right-header...since LH is actually completely unrelated to headers, whoops) and you're at least coming across as dissatisfied and upset if I'm reading it correctly. I understand people spend hundreds of hours a year for multiple years, so I'm just confused why for 15 years a maintenance category with only 12 works hasn't been cleaned out even once...presumably the system could use improvement, which I believe is the Wiki Wiki Way, right? The claim "it's because some maintenance is more important than others" doesn't work very well, it would appear to me, since "The very important Category:Works with no license template has 250 works and they haven't been cleaned up since at least 2008.", and the Category:Deletion requests/Unknown translators has only 35 works sitting on it with a backlag template since 2006. So ideally we (and I'm willing to contribute but I'm not sure my voice is worth as much as others' here) can tackle some of these backlogs constructively and figure out which categories shouldn't be kept in the future, or need templates more closely watched, more frequently used, whatever the case may be. Peace.salam.shalom (talk) 14:40, 4 December 2020 (UTC)
Umm, I asked you to stop creating cross-namespace redirects. So please do not classify actions that are contrary to our rules as undertaking maintenance. I also at that same time pointed you here. RH does not equal "Right header" it is an abbreviation for "RunningHeader" and please don't AT me that it is a misleading abbreviation, as it is one that I rail against, and I don't use. If you need guidance in performing maintenance, then simply state that you are looking at a maintenance category, and would like some guidance. You can also look at some of the information at Wikisource:Maintenance. — billinghurst sDrewth 04:39, 5 December 2020 (UTC)
He created one cross-namespace redirection page, from Author: to Portal:, which was entirely appropriate in that situation, and not in violation of the CSD you specified. Your “guidance” on his talk page was largely unhelpful, and your response here quite combative, in violation of the principles of kindness to newcomers that should be promulgated here, but, I realize, are not codified in any rules. His reference to {{lh}} and {{rh}} was not related to you. TE(æ)A,ea. (talk) 13:38, 5 December 2020 (UTC).
Oh BS. If I am quoted as telling someone to stop doing maintenance (in capitals) on something that is factually incorrect, then please expect a clarification. Cross namespaces are not correct in that situation, it is criteria for speedy deletion; and extended uses have always been discussed here. Article namespace in this situation has always included Author: and alike namespaces. — billinghurst sDrewth 13:58, 5 December 2020 (UTC)
i would have more sympathy for the "new" editor, if they did not "joke:... Hey slackers, get to work!" - not offended, just tired of the loud jokes. completing a work would be nice, before practicing with the redirect button. there is help about templates, but i guess, tl;dr. Slowking4Rama's revenge 16:32, 5 December 2020 (UTC)

A main namespace page question

Based on editors' experience, how many pages from the [Page: namespace] are reasonable for inclusion in a single main namespace page? The word count is about 520 per page.— Ineuw (talk) 07:32, 4 December 2020 (UTC)

Rarely do I worry about page count, and more rely on natural breaks in works. Outside of natural breaks, I would only break a work for technical reasons. — billinghurst sDrewth 09:53, 4 December 2020 (UTC)
Thanks. In the meanwhile, I found that in this book, using anchors works very well.— Ineuw (talk) 04:46, 5 December 2020 (UTC)

I am finding problem with its collapsibility. This template has been used in a work I am currently engaged in. There, the output of the template remains in collapsed condition and does not expand. Can anybody please fix the template? Thanks. Hrishikes (talk) 03:11, 6 December 2020 (UTC)

1925-1977 yearbooks with no copyright notice

I have a small collection of grade-school yearbooks, some of which are probably public domain because they were published before 1977 without a copyright notice. One, for example, is an elementary school yearbook from 1962.

I saw that Wikisource has a portal on yearbooks, but the only yearbook listed there is for a university.

Despite seeming to meet the criteria for the public domain that we're all used to at this point, I can't deny I'm a bit concerned about the personal information included therein. I'm divided as to whether I should actually release scans of these to the public, especially considering that many of the students listed in them are still alive today. On the one hand, so much time has passed since then and the lives of the then-students are without a doubt quite different, with many of them probably living in entirely different locations than those listed at the time. And as for phone numbers, for obvious reasons it would be incredible if you actually found that one of them still dialed to the same house.

On some other notes I assume you could call yearbooks "published works" even though yearbooks aren't generally widely available to the public, but authorship I guess tends to just go to the school collectively.

Anyway, what do you guys think about hosting grade-school yearbooks? Are they in the public domain as a usual book of that description would be, or do personal info rights also need to be taken into account legally...and beyond that just morally to be quite honest? It's cool to own these little obscure pieces of history and it feels kind of bad for me to not share them with the world, but then again, would I feel worse if I did? PseudoSkull (talk) 08:49, 6 December 2020 (UTC)

Amateur opinion here; on copyright I have no idea, I was shocked to see New York Times says that it was published without copyright through 1977 and therefore can print more modern stories than common sense suggests would be allowed...but perhaps I am wrong there. On the moral issue though, would it be possible to upload them as DVJU/PDFs but missing the pages that are nothing but individual photos/names, and therefore print mostly the photos of classrooms, anonymous students outside at lunchhour, teachers, fieldtrips, etc? (I feel morally teachers are adults being photographed at the job they voluntarily attend, so there is less expectation of privacy than a child required to attend a classroom and required to be photographed every year for a mugshot). Just blurring the photos wouldn't be great imho though, because if somebody googles "Ruby Leshiqueka Freeman" and finds out she went to your high school and Grade 9/1978 lines up with her details elsewhere, they'll quickly get ahold of the book elsewhere...better the full names just don't appear. Interestingly on the moral issue though, I wouldn't see any problem with printing a 1978 phonebook since it's just names/adults/numbers, not photos and personal details and signed admissions of teenaged love and heartbreak...but yearbooks, definitely a weird territory. Best compromise that comes to my mind would be uploading the individual group/classroom/fieldtrip photos to Commons under a Category:Franklin North Secondary School in 1978 category, etc...but cropping out any names. Peace.salam.shalom (talk) 09:40, 6 December 2020 (UTC)
in general yearbooks are PD-not renewed in the US - but, so far only ad hoc uploads have been done, see also c:Category:Yearbook photos. you could search here for renewals https://cocatalog.loc.gov/cgi-bin/Pwebrecon.cgi?DB=local&PAGE=First , but also before 1977 by each year, at IA copyright catalogs, https://archive.org/details/1977periodicalsj3312libr .- you might have to fight deletion at commons by europeans who do not understand US formalities.
personality rights are not handled well at commons or wikidata, which might give you pause, to a mass upload. you would have to get access to a book scanner, which expedites creating the multi-page pdf. --Slowking4Rama's revenge 16:38, 6 December 2020 (UTC)

Slow response times when previewing large pagelist on Index : page

Just me or is the previewed pagelist here slow to show or update when changed:- https://en.wikisource.org/w/index.php?title=Index:The_Book_of_Orders_of_Knighthood_and_Decorations_of_Honour_of_All_Nations.djvu&action=edit

Thanks..

If something doesn't apparently cope I'd like to know why. ShakespeareFan00 (talk) 20:16, 6 December 2020 (UTC)

When is a template justified?

So there's a page Page:Waylaid_by_Wireless_-_Balmer_-_1909.djvu/181 with a quoted quoted quote, using " ' " . Knowing about {{" '}} and such, I went and looked around. There *is* a {{' " '}}, but there is not a {{" ' "}}. Should I create that now, even if apparently no other usages have been previously encountered? Or do your great worthies know of a better way? Shenme (talk) 04:02, 7 December 2020 (UTC)

@Shenme: Thank you for doing the validations on WBW, in general. Also, thanks for making me aware of the gap between quotation marks and apostrophes being possible in the first place, and being concordant with Wikisource's styling practices.
I have gone ahead and created the template you suggested. I have quite a few books left to do by Balmer specifically, and I wouldn't doubt that he'd used the "'" combination in another book of his. Plus, who knows, maybe someone else has done it at some point too, so good to have it just in case. PseudoSkull (talk) 05:47, 7 December 2020 (UTC)

How to notice existence of a file's hidden first page before uploading?

A short time ago I have uploaded File:Zawis and Kunigunde (1895).djvu from the IA to Commons using the IA uploader. The IA uploader showed me the page with the title Zawis and Kunigunde as the first page of the book (which really is the first page of the book) and asked whether I wanted to remove it. I ticked "No", as this page was a part of the book. However, after the upload was finished, a completely different page (some technical one with rulers and coloured pattern) appeared there at the position of the first page. The one I was asked about appeared on the position of the 2nd page after the upload.

It is not a big problem for Wikisource (I can just mark it as empty), but it does not look well in Commons, as all thumbnails of the file display this technical page instead of the cover of the book (not only in the file’s page, but also in its categories etc.). IA uploader adviced me that I should raise all technical problems to phabricator, which I did (task T268246) but quite unsurprisingly it did not trigger there any response. Does anybody have any idea how the existence of such a page can be noticed before the upload so that one can tick it to be removed? --Jan Kameníček (talk) 09:09, 7 December 2020 (UTC)

2020 Coolest Tool Award Ceremony on December 11th

Entering a new work

Hello Wikisource community-

I have recently participated in collaborating with a more experienced Wikipedian with two articles, and plan on writing my first article soon. With that project I transcribed an important unpublished autobiography for the person we did a biographical article on. It was unpublished and I have permission from descendants to transcribe and archive it. It would be a significant addition to the collection here, as it documents first hand experience with the Northern Transcontinental Survey in 1883, and rugged life in the Western frontier. I have already transcribed it as the typed copy I was given was very sketchy. Of the 71 pages, I think I have guessed at about 4 or 5 words, otherwise accurate. There are I think 2 or 3 illustrations included.

Now my question: Once this is archived with Wikisource, can I then refer to that work in our Wikipedia biography as a reference from here?

Thanks, looking forward to entering that and a number of other works of interest.

Noel unsigned comment by Noel Andrew Sherry (talk) 12:35, 7 December 2020‎.

@Noel Andrew Sherry:: yes, as long as the autobiography is in the public domain (or otherwise released under a permissible license for Wikisource, such as CC-BY-SA), you can post it here. Since it is (was!) unpublished, {{PD-US-unpublished}} applies and as long as the author died over 70 years ago, it should be fine (but ideally we'd know the author's date of death) and we'd love to have it.
We would ask that you also upload the scanned document for validation purposes. If you need help preparing such a document from raw images, I can help with that.
Once uploaded here, you can refer to it from Wikipedia with their en:w:Template:Wikisource template, or cite it as a reference with en:w:Template:Cite wikisource. Inductiveloadtalk/contribs 12:49, 7 December 2020 (UTC)

Fantastic, Inductiveload, very helpful. I have some other projects this week but will dive in over the Christmas holiday period on this. Looking forward to it. Noel Andrew Sherry (talk) 13:08, 7 December 2020 (UTC)

Login & Password Problem

Hello Wikisource Community, I had a login and password for Wikipedia and have used it for editing two articles for the last month or so, with lots of activity.

Then today I learned about Wikisource and so logged in and created a new login and password, which it seemed to be necessary to do, but now that seems to be the new login and password for all Wiki accounts, which I did not realize.

So my question, how to I revert to my original login and password so I can benefit from my "history" with Wikipedia?

Noel Andrew Sherry (talk) 13:19, 7 December 2020 (UTC)

@Noel Andrew Sherry: I think log out from this account and re-log in with your old account. Then you can just re-sign these messages and I'll move your Wikisource user talk page for you when I know what to move it to. Inductiveloadtalk/contribs 13:32, 7 December 2020 (UTC)

16:15, 7 December 2020 (UTC)

Minks, covid and denmark

I am new here so please be gentle.

The w:Canadian Broadcasting Corporation has recently reported covid discovered in farmed minks in British Columbia. When I saw this article I vaguely remembered seeing another article linking covid with farmed minks somewhere else. I googled and found https://www.who.int/csr/don/06-november-2020-mink-associated-sars-cov2-denmark/en/ which says this association dates back to at least June 2020, which I do not see reflected in w:COVID-19 pandemic in Denmark. Is there something WS can do in terms of collecting source information related to this topic? Just wondering. Ottawahitech (talk) 20:39, 7 December 2020 (UTC)

@Ottawahitech: Welcome. Generally, Wikisource hosts works which are in the public domain and cannot host copyrighted works including most (or practically all) modern news articles. What we can gather are government documents which are public domain in many countries (but not all). For what we already have see Category:COVID-19. --Jan Kameníček (talk) 20:54, 7 December 2020 (UTC)
@Jan.Kamenicek: Thanks for replying so promptly and for pinging me.
The link I provided above is to material published by the w:World Health Organization, which I assume is considered a goverment source of some sorts? Thanks in advance, Ottawahitech (talk) 21:09, 7 December 2020 (UTC)
@Ottawahitech: the WHO claims copyright on (most of) its documents and doesn't generally release them under a license compatible with Wikisource. See https://www.who.int/about/who-we-are/publishing-policies/copyright. Inductiveloadtalk/contribs 21:20, 7 December 2020 (UTC)
@Ottawahitech: (after edit conflict) I had a quick look at it and at the bottom of the page there is a link to WHO’s Copyright, Licensing and Permissions, where it is stated that all publications published by WHO fall under the CC BY-NC-SA 3.0 IGO licence, which unfortunately disallows commercial use. Wikisource as such does not use any material commercially, but it allows third parties to do so, and so it cannot host texts under such restricted conditions. For more details see Help:Licensing compatibility. So the only possible way is to contact WHO and ask them to allow the usage of the material under some less restrictive license, such as CC BY-SA 3.0, which would require some effort with quite small chances of success. --Jan Kameníček (talk) 21:31, 7 December 2020 (UTC)
OFFTOPIC: I am pleasantly surprised to see so much well thought out and elaborate response to my query. I don't remember receiving so much help anywhere else on wmf-sites from user-IDs that I do not recognize. Having said this, I wonder if I am in the minority when I give up when confronted by the complexity of all the different copyrights?
When I initially looked at the who-page I could not find the copyright notice, but I did see a w:Twitter, w:Facebook and other commercial social-networking logos on the page. I also do not understand the implication of that. Thanks in advance, Ottawahitech (talk) 21:50, 7 December 2020 (UTC)
@Ottowahitech: copyright is incredibly complex and frustrating. It is not just you! The thing about copyright is that it nearly always subsists except when explicitly released, so unless you happen to know that a certain organisation licenses it's output appropriately (e.g. the UN sometimes, but not the WHO, US federal govt, some UK govt stuff, etc., plus a few exceptions like non-US legislation due to US rules being the norm at Wikisource), anything recent you find online will sadly be out of the public domain probably until we're all dead or out minds are uploaded to the cloud.
On the other hand, quite a lot of stuff is published online with incorrect licenses that assert rights that do not subsist. It's changing a bit now, but it has been "fashionable" for libraries to slap non-commercial licenses on their scanned works from the birth of the Internet until recently, which is a doctrine known as "sweat of the brow" that is generally not recognised in the US or by Wikisource or Commons. Inductiveloadtalk/contribs 22:31, 7 December 2020 (UTC)
@Ottowahitech: Which is the reason why many of us decide to stay on the safe side and usually transcibe here what we are sure that definitely is out of copyright, which in the U. S. (as en.wikisource follows U.S. copyright laws) means works published more than 95 years ago :-) --Jan Kameníček (talk) 22:43, 7 December 2020 (UTC)
Rant: @Inductiveload: Yes, it is incredibly frustrating especially when trying to build up information about w:COVID-19. Governments/news media/you name it are hoarding information which they are not sharing freely with the public to the world's detriment. Ottawahitech (talk) 22:53, 7 December 2020 (UTC)
yeah, United Nations is misunderstood, like European Union. (we are spoiled by PD-USGov, but even their feeds include republishing copyrighted work) much deletion over at commons over these works. c:Commons:Copyright_rules_by_territory/United_Nations. Slowking4Rama's revenge 00:08, 8 December 2020 (UTC)

The Great Gatsby becoming PD in 2021

Hi! I found this piece of news https://chicago.suntimes.com/2020/1/22/21076846/great-gatsby-copyright-ends-2021-f-scott-fitzgerald-public-domain

When 2021 comes we may have a lovely novel on here. WhisperToMe (talk) 05:26, 6 December 2020 (UTC)

I assume people can find the best PDF/ebook of it now, and may or may not be able to do some proofreading and annotations so long as it doesn't appear to us commoners until January 1? Or do you not do work until January 1 on it at all? Peace.salam.shalom (talk) 05:45, 6 December 2020 (UTC)
To be fair, everything published in the United States in 1925 becomes PD in 2021. We could add the original novel of Gentlemen Prefer Blondes. BD2412 T 06:37, 6 December 2020 (UTC)
What I did, when I was proofreading a newly copyright-free work last year, was wait until a scan on the Internet Archive became available (which occurred in mid-January), and then upload the work to Wikimedia Commons and begin proofreading on Wikisource. The work I chose was quite obscure, however, so there may already be an accessible scan in existence, or they might release the scan quite quickly, so that work may begin much sooner. Also, there is a (seemingly forgotten) list of works that should be on Wikisource entering the public domain in 2021 listed at Requested texts, 1925. TE(æ)A,ea. (talk) 14:30, 6 December 2020 (UTC).
There's a scan available right now on the Internet Archive; it's from a clearly PD edition, but one from the 1950s or 1960s. There's a couple more modern scans as well. There's a Gutenberg Australia edition, and a Wikilivres version, at least buried in web.Archive.org. (Unfortunately the DJVU scan they used doesn't seem to have been archived with it.) Wikimedia is not a private group; there's nothing proofread here that can't appear to everyone. I'm sure it will be done quickly enough once it does hit here.--Prosfilaes (talk) 07:14, 8 December 2020 (UTC)

How can I add a category?

I am tring to add Category:Wikimedia to Jimmy Wales Speaks at Closing Ceremony of Wikimania 2015. Thanks in advance, Ottawahitech (talk) 18:49, 9 December 2020 (UTC)

  1. @Ottawahitech: You can simply click the edit button and add [[Category:Wikimedia]] to the very bottom of the page.
  2. You can also go to Gadgets in your Preferences and tick "HotCat". This will add + and − signs to the bottom of every page, which makes adding categories very easy. If you want to add a category, you click + and write the category’s name. To remove a category, you click −. --Jan Kameníček (talk) 21:11, 9 December 2020 (UTC)

Community Wishlist Survey 2021

SGrabarczuk (WMF)

15:03, 11 December 2020 (UTC)

Translation license box does not expand

The translation license box does not expand after clicking it, see e. g. at the bottom of Tyrolean Elegies. --Jan Kameníček (talk) 17:59, 9 December 2020 (UTC)

Can somebody have a look at it, please? It is not possible not to show the license of a work to the readers, it definitely has to be accessible.
The problem does not appear in the template’s page but it appears in the main namespace; besides the above mentioned page see e. g. Constitution of Haïti, Grimm's Household Tales (Edwardes) or any other page using the template. --Jan Kameníček (talk) 10:46, 12 December 2020 (UTC)
@Inductiveload: This is triggered by MediaWiki:Gadget-PageNumbers.js. If the gadget is enabled it breaks as above; if it is disabled it starts working again. It doesn't break on the template’s page because PageNumbers isn't active in the Template: namespace.
I am unable to reproduce by disabling the gadget and then manually pasting the code into my console, so it seems timing-dependent. Collapsibility in MW is provided by jquery.plugin.makeCollapsible, but trying to set a breakpoint in the debugger didn't work and it looks like the module isn't even getting loaded (no ideas what's going on there; could be anything from a local content blocker to my insufficiently caffeinated brain).
I suspected this to be caused by PageNumbers somehow either copying the .licenseContainer rather than moving it (and losing attached event handlers), or cause it to not exist in the DOM at the point the plugin runs (so they never get attached). But I wasn't able to verify or exclude that before running out of time. --Xover (talk) 15:24, 12 December 2020 (UTC)
@Xover: I suppose it is indeed some horrid timing thing, because it doesn't work if the gadget is loads via prefs (as reported by Jan), but it works fine of you load it from user JS with
mw.loader.load(['ext.gadget.PageNumbers']);
So the move from Mediawiki:Common.js to gadget probably provoked whatever unsafe timing/unmet dependency is causing this. I will investigate further. Inductiveloadtalk/contribs 22:22, 12 December 2020 (UTC)
Well, looks like this was caused by an abundance of caution in not moving some code from Mediawiki:Gadget-PageNumbers.js (i.e. the DOM-ready hooks) to Mediawiki:Gadget-PageNumbers-core.js. I had considered it, but went for a softly-softly approach, not wishing to change too much at once. This resulted in a delicate race condition between the DOM-ready hooks, the core.js gadget loading and the collapsing code (I think). Moving the DOM-ready hooks to the core gadget (where, IMO, they belong anyway) seems to have fixed it for me.
Relevant diffs:
@Jan.Kamenicek, @Xover: is it working for you now? Inductiveloadtalk/contribs 23:01, 12 December 2020 (UTC)
@Inductiveload: Great, it works well now. Thanks very much. --Jan Kameníček (talk) 23:21, 12 December 2020 (UTC)

Captains courageous and version choice.

Fae has uploaded a million books, to commons from internet archive. We might want to develop a process to pick the version we want to work on. for example, we have

c:File:"Captains courageous", a story of the Grand Banks (IA captainscourageo00kipl).pdf,
c:File:Kipling - Captains courageous, 1899.djvu, and
[5]. and search for works is poor, so it is easy to start on a reprint. Slowking4Rama's revenge 14:48, 12 December 2020 (UTC)

Copyright renewal records search engine linked from Help:PD doesn't work

Help:Public Domain refers to some "United States Copyright renewal records search engine", but the provided link seems dead and so should be replaced. I can replace it e. g. by this link, but I do not know which possibilities the previous search engine offered and so I would like to ask whether it is an adequate substitute. --Jan Kameníček (talk) 17:28, 12 December 2020 (UTC)

21:34, 14 December 2020 (UTC)

Request for retrieval of scans from HathiTrust

There are five volumes of The Works of Honoré de Balzac (Avil Publishing, 1901) missing on IA, but they appear to be present in HathiTrust. The missing volumes are: 9, 10, 19, 23, 24. Would it be possible for someone with the appropriate access and tooling to pull a scan from HathiTrust and place it in the appropriate Commons category? Thanks —Beleg Tâl (talk) 14:41, 8 December 2020 (UTC)

Note: You may notice that Volume 24 already exists on Commons; this is a mistake in the title, the file is actually Volume 29, and a move request has already been submitted. —Beleg Tâl (talk) 14:41, 8 December 2020 (UTC)
@Beleg Tâl: There is a discussion of bulk retrieval of texts occurring at commons. I thought this thread was to be something like that. I don't understand the sequestering of PD text under a priveledge wall....--RaboKarbakian (talk) 16:02, 8 December 2020 (UTC)
@Beleg Tâl: It looks like Avil issued two editions in 1901, a deluxe edition in 36 vv. and a university edition in 18 vv. Your link is to the 18-volume edition. Hathi also has the 36-volume edition (incomplete). Which edition are you after? --Xover (talk) 16:07, 8 December 2020 (UTC)
@Xover:, I am after the 36-volume deluxe edition. In fact, I thought it was a 35-volume edition; if are able to also grab Volume 36 that would be fantastic. —Beleg Tâl (talk) 16:08, 8 December 2020 (UTC)
I'll start with the last one and work my way forward, if anybody wants to start in the other end. --Xover (talk) 16:22, 8 December 2020 (UTC)
Ok, vv. 19–36 are downloaded and being processed, so I'll upload those some time tomorrow. vv. 9 and 10 are missing from that set, so those will have to be tracked down separately. --Xover (talk) 20:39, 8 December 2020 (UTC)
19, 23, and 36 have now been uploaded, but I'm waiting on the rename before uploading 24. --Xover (talk) 10:28, 9 December 2020 (UTC)
@Beleg Tâl: I found and uploaded 9, and 24 is currently uploading (since the rename of 29 is done). But I can't find volume 10 of this edition anywhere so that one is still missing. --Xover (talk) 18:46, 9 December 2020 (UTC)
fyi, hathi trust works tend to come from google books, not IA, and they they now link to the google book version, to download [7] Slowking4Rama's revenge 01:54, 17 December 2020 (UTC)

Why does <pages> now put page tag in odd place?

I don't know whether to ask this here or on Mediawiki: in recent days, output from the <pages> feature has put the marginal "original" page number somewhere in the middle of the source page, not at its top. This makes it difficult to know at a glance which source page contained a piece of text being used in a citation, for example on Wikipedia. I realize (actually, just noticed) I can hover and see the highlight, but that doesn't work on a touchscreen—not that touch-only is usually a useful scenario.

One example (of many): 1911 Encyclopædia Britannica/Aldrich, Thomas Bailey. Where does the text of page 537 start? More exactly, what's the rule for where the page number is located now, and why the change? A few days ago, [536] would have been at the top and [537] opposite the word "during". DavidBrooks (talk) 04:17, 16 December 2020 (UTC)

@DavidBrooks: there's ongoing wrestling with the pagenumbers script. Can you try it now - I think I may has found a way to suppress the issue for now. It's hard to be sure because it's something to do with timing, so it depends on lots of things including your browser, whether you have the browser dev tools open, your browser cache, WS's own caches and the phase of the moon. Inductiveloadtalk/contribs 05:01, 16 December 2020 (UTC)
A very small sample (10 pages, 3 page crossings) does indeed seem to be fixed. Thanks. It's a 4% waning crescent btw. DavidBrooks (talk) 05:15, 16 December 2020 (UTC)

Template:bar not working anymore

{{bar|2}} is supposed to appear at "at once to —— Ashland" on this page.

This template is not working on my laptop or my phone at all, and is giving me this blank space everywhere, even on the template's own documentation. Is it just happening on my end or is this a sitewide issue? PseudoSkull (talk) 14:36, 18 December 2020 (UTC)

Same for me. What browsers are you seeing this on? (I'm currently using Safari for iOS.) BethNaught (talk) 14:43, 18 December 2020 (UTC)
I'm seeing it on Chrome for Mac and the same on Android. PseudoSkull (talk) 14:46, 18 December 2020 (UTC)
I'm also seeing it on Firefox on Windows--both logged in and logged out. Seems like it's a WS issue then. BethNaught (talk) 15:43, 18 December 2020 (UTC)
@Inductiveload: An unintended effect of [8]? BethNaught (talk) 15:50, 18 December 2020 (UTC)
Reverted. Odd, but it was just at attempt to work around some device limitations on export. Which hopefully will be fixed upstream eventually anyway. Inductiveloadtalk/contribs 16:03, 18 December 2020 (UTC)
@Inductiveload: visibility:hidden hides the entire inner span, but the line effect is applied by showing a number of em dashes of transparent colour with a strikethrough inherited from the outer span. You can better see what's going on if you add some content between the outer and inner span  —————  (the non-hidden non-breaking spaces are displayed with strike-through; the inner span is hidden, including its character box decorations), or by removing the transparent color and replacing the dashes with underscores: _ _ _ _ _. I don't think this approach to faking a horizontal bar is susceptible to visibility:hidden. --Xover (talk) 19:45, 18 December 2020 (UTC)

I just moved an index and a handful of pages due to a filename change at commons. I was asked not to move things (something about tools) but I thought that was about Main. So, there are a few deletes in my recent contributions which I get an error for when I try to paste {{speedy}}. Can someone...?--RaboKarbakian (talk) 15:07, 21 December 2020 (UTC)

@RaboKarbakian: Done. The issue is that when a normal user moves pages the software will always leave a redirect behind at the old name. Administrators get an extra checkbox when they move pages to suppress redirects. So for any case where you have multiple pages to move and where redirects from the old names are not desirable, it's generally going to be preferable to flag down an admin to do the move. Cleaning up afterwards takes longer for everyone involved. --Xover (talk) 17:35, 21 December 2020 (UTC)

20:54, 21 December 2020 (UTC)

PastLovingBot for bot status

Hi, I may have only fairly recently become very active here at Wikisource, but this isn't my first time using bots on a wiki. I have a little experience with it on Wiktionary as well, where I had been semi-automating some surname entries. I have general amateur coding experience of about a few years' worth in Python.

PastLovingBot's main purpose yet is to automatically add noinclude headers and footers, which are mathematical and tedious, and honestly should be automated more often than typed manually, as we should be focusing all our human energy on more high-level tasks. I have stopped adding the headers manually myself because I know that I can just use the bot to do it all for me.

It also corrects some minor errors in proofreading that are consistently errors every time they are to be found. More detail about my bot's tasks can be found on the user page of the bot.

I have used PastLovingBot, within the near 1 minute throttle and with heavy supervision, to iterate through all the pages of both A Wild-Goose Chase (Balmer) and Waylaid by Wireless, while tinkering with it when it had issues. Please refer to its activity on these pages as proof that it has ultimately done its work correctly on those pages. Now I believe the bot could probably run through most about any book's Index page in this manner (but will have to be slightly customized for each book as many books do their headers and footers differently, even those by the same publisher or author).

I am requesting the bot flag for my bot, so that 1. I can be permitted to do these jobs more quickly, 2. so that the bot can be recognized as a community bot and I can receive requests from other people to add/fix headers and footers to their books, and 3. so that the edits can be marked as bot edits, and can be filtered out in recent changes logs by people who only want to see human edits. PseudoSkull (talk) 00:33, 10 December 2020 (UTC)

Please can you give an example of an edit to "add noinclude headers and footers" in the sense referred to here. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 21:39, 10 December 2020 (UTC)
Sure, I can give several examples:
  • In diff the bot fixed an error with the header in which Template:ft was erroneously used instead of Template:rh, which is the raw header template.
  • In diff, a header was added where one didn't exist. As you can see, the bot well knew in this case that the page was an even number, meaning that the book's name would need to be shown in the header, and it was aware of what page number was to be placed there. It was no coincidence that it got this right either; you can see on diff (the next page), it also knew that the odd number was a chapter page, and got the chapter name right.
  • diff It knows when chapters begin, so as to add only a proper footer instead of a header like usual.
@Pigsonthewing: PseudoSkull (talk) 23:27, 10 December 2020 (UTC)
@Inductiveload: @Billinghurst: @Xover: Pinging admins who seem experienced with code, as this discussion has been stagnant for almost two weeks. Also, the bot has been updated since the last posting and can now do more tasks. I still use it. It'd be nice to have the bot flag. PseudoSkull (talk) 02:01, 24 December 2020 (UTC)

Quick scan.

I don't see clear technical statements of identification and fixes that you are undertaking. In fact, we would generally not approve bots to do bot work on such a wide range of tasks in a blanket fashion. I would think that we are wanting to see clear examples of anything being done, not expecting anyone to fathom it.
run through Index pages and automatically correct headers or footers where necessary, or add them if they're missing.

Index pages? Headers? Footers? Fix? What does that mean?
Do you mean adding adding {{RunningHeader}}, if yes then let us call it that.

run through Index pages and automatically correct headers or footers where necessary, or add them if they're missing.

No, don't do it. Why would you do that on others work? What you do with your works is your business, there is no consensus that the template is to be used. In fact, I would say that I would prefer that it isn't.

Replace any accidental en dashes (–) with em dashes (—)

Accidental? How do you know that it is accidental and not what is shown in the work?

Replace plain double-em-dashes (——) with ——

See above

Empty any OCR content of pages that do not need to be proofread

Huh?

create TOC and Illustrations pages, to make it easier and make PseudoSkull not have to type as much

If you are doing preliminary work prior to proofreading, then I don't have any concern

will paste autocorrected OCR scans into PseudoSkull's userspace at User:PseudoSkull/Waylaid for now

OCR from where? OCR current aligned to a Page: from a File: Why? Why would we not put OCR against the image and leave not proofread.

I would expect that each bot run would be able to clearly identify the task being undertaken, possibly with a brief technical explanation. I know that I run a lot of ad hoc stuff though would say something like convert {{header}} to {{IrishBio}}. Where I am doing more complex replacements I will note the detail on the bot user page (see user:sDrewthbot for examples). More complex changes get their own subpage. One can never have too much specific technical detail, either for when you look for faults, or someone else wants to do something similar some time down the track. — billinghurst sDrewth 02:25, 24 December 2020 (UTC)

Addended comment. Some of the changes that you are proposing for works that you are proofreading are better to be fixed WHEN proofreading. Look to utilise something like Wikisource:TemplateScript (TS). I have a truckload of coding in mine to pick up that ugly legwork, or some of those problematic to identify eg, bom => born. TS is really good for works that have lots of technical formatting like TIWW or IndianBio. — billinghurst sDrewth 02:32, 24 December 2020 (UTC)
@Billinghurst: Thank you for your feedback. I suppose I did leave the page a bit more ambiguous than necessary, and I will tidy it up to make it more specific. Most problematic of all of it having been ambiguous is that people seem to get the idea that I might be using the bot on other people's works. I only thus far have used the bot for my own works, and I don't plan to use it on anyone else's works, unless they specifically request it; in which case, I will carefully look through the technical aspects of their work, to see if, for example, en dashes are to ever legitimately appear in that text. The books (only books) thus far I've done are all guaranteed never to have en dashes appear in them, so if an en dash ever appears in their transcriptions this is an error, likely caused by me pressing Option + - on the Mac keyboard instead of Option + Shift + - when correcting a nonexistent em dash in the OCR. Anyway, since you have pointed out that a community bot should be more formed to do a very specific and large-scale task and not lots of very minor tasks, I think now I agree that now is not the time to give my bot the community bot status. The bot is mostly for my personal use, and the idea is not and was never to impose it on other people. PseudoSkull (talk) 02:52, 24 December 2020 (UTC)

Deleted redirect

I have asked User:Beleg Tâl to restore a redirect, The Condor/2 (2)/Prominent Californian Ornithologists. III. A. M. Shields, which they recently deleted, on the false premise that it was unused. It is not, but they have declined to restore it. Can someone please do so? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 14:36, 14 December 2020 (UTC)

Modernising numbers of old style biographical templates, removing project disclaimers

We have a number of older templates for some biographical works that utilise an older style header formatting that put the wikipedia link into the notes field, rather than paired with the other interwiki, and have parameters like other_projects and wikipedia2. They also include a "disclaimer".

Examples of these templates:

Full list via [12]

These have elements that precede provision of Wikidata, and change to the standard linking that we provide the {{plain sister}}.

They are not all utilising newer header fields that we have incorporated, eg. contributor

There was an earlier discussion that took place at Wikisource:Proposed deletions/Archives/2017#Project_disclaimers about the disclaimers that failed to reach a consensus.

I wish to update and standardise these templates to modern form of formatting and fields. — billinghurst sDrewth 05:52, 20 December 2020 (UTC)

 SupportBeleg Tâl (talk) 18:48, 23 December 2020 (UTC)
 Support Inductiveloadtalk/contribs 11:21, 24 December 2020 (UTC)
 Support --Jan Kameníček (talk) 13:33, 24 December 2020 (UTC)
 Support --Xover (talk) 22:31, 24 December 2020 (UTC)

Sentence case for titles...?

...why? Book titles are supposed to be title case, are they not? I thought this was just undoubted consensus as the grammatically correct way to portray titles of works. It's been bothering me for a long time that I see so many titles of books here using sentence form, and I thought these were some errors due to misunderstanding of grammar on part of the transcribers, until I read through some of the style guide again and figured out it actually says we prefer that, "unless an original capitalisation is consistently used". The quote I used makes it even stranger. I don't think I've ever seen a book being referred to in its sentence case, unless the writer was clearly being ungrammatical throughout his writing maybe, and certainly I've never seen anything but title case used for any of the books I've transcribed for Wikisource. Is this something that makes books here easier to sort through and more easily searchable, or is sentence case for titles a writing convention that was used for books far older than the ones I'm reading? PseudoSkull (talk) 18:25, 23 December 2020 (UTC)

@PseudoSkull: It's a library convention (from the Anglo-American Cataloguing Rules, Appendix A, which goes into excruciating detail) that all words except proper nouns are lowercase.
Although the Wikisource:Style Guide technically says we use it, very very many works use Title Case. With title case, there are still ambiguities like "do you capitalise "A" and "To"?". Inductiveloadtalk/contribs 18:42, 23 December 2020 (UTC)
@PseudoSkull: Either replicate the what is in the work or what is described by Inductiveload, and do a redirect from one to the other. Be generous in redirects and not get caught up in the changing styles of conventions. What modern publishers now is not what publishers or authors used then, so there is no perfect solution, so be flexible and cover them all. — billinghurst sDrewth 23:49, 23 December 2020 (UTC)
One of the big argument against sentence case is "A Rose is a Rose" could arguably be, in sentence case, "A Rose is a Rose" if one of the characters is named Rose, and I understand there are real life examples where a book title was deliberately ambiguous.--Prosfilaes (talk) 00:50, 24 December 2020 (UTC)
In my opinion fussing too much over page titles is not necessary. They're basically just unique page labels, they could be integers (actually page IDs are, internally), they can be moved. The default title size on page gives them undue perception of importance. The important thing is they're clear enough to be useful for maintenance and browsing. Attempting to capture all the nuances would basically lead to replicating 38 pages (yes, really) of the AACR Appendix A: Capitalization rules as Wikisource policy and would just a waste of mental cycles. Inductiveloadtalk/contribs 08:31, 24 December 2020 (UTC)
The problem with title case is that it is often not really clear what should be capitalized: e. g. w:Title case mentions three different styles (start case, AP Stylebook, Chicago Manual of Style), but I have already met even more. Some of these styles are also not quite clear, lacking e.g. definition of a "principal" word. I personally always use title case when the work itself uses it either in its title page or anywhere inside the work where the work’s title is mentioned (e.g. cover, halftitle page, colophone, foreword, page headers…) and keep its style. If the work does not manifest which title-case style should be used, we can either choose our way of its title-casing or stick to the sentence case, which is usually clear (with very rare exceptions like the one mentioned above by Prosfilaes). --Jan Kameníček (talk) 13:58, 24 December 2020 (UTC)

Is there a standardized way the index page for a newspaper at Wikisource to appear?

We have these very different styles:

Is there one to harmonize on, or a way that we can have each type so the reader can choose what they prefer? --RAN (talk) 01:53, 25 December 2020 (UTC)

Magazines for scan backing

I've been working on a lot of magazines recently, and I'd like to make a list of the more notable ones that I should pull from IA or HathiTrust when I get a chance. A lot of this information may be on Wiki; e.g. Author:Zona Gale says that certain volumes of The Smart Set, Everybody's, Sunday Magazine, Harper's Magazine, American Magazine, The Century Magazine, and Harper's Monthly Magazine would scan-back existing works. I know about Lovecraft and Author:Robert Ervin Howard; I am working on 1925 Weird Tales, but not any post-1925 pulps at this time. Any other suggestions of authors or magazines that I should look for for works that need scan backing on Wikisource?--Prosfilaes (talk) 10:27, 25 December 2020 (UTC)

Dropinitial: paragraph is gone despite a blank line

1, 2, 3. But it works with two paragraphs: 4. Caused by recent changes in {{Dropinitial}}. --Ratte (talk) 12:19, 26 December 2020 (UTC)

@Ratte: I reverted the problematic change. It was intended to fix drop initials in an indented container, but it's not that simple (and might not be possible). Thanks for reporting. Inductiveloadtalk/contribs 12:42, 26 December 2020 (UTC)
Ok, thanks. Ratte (talk) 12:47, 26 December 2020 (UTC)

Bug in text file export?

In proof-reading Anne of Green Gables, I found some occasional issues with whitespace in exported text files. The general idea is that the core data is fine when viewed/edited using the normal means on Wikisource, but the exported text is not. This means that no fix can be applied by normal editing.

It seems like there's a bug in the text export code. For a list of instances of this problem, see the listing here, where items are marked as exp (for export). The steps I took to generate the text file export are described here.

The issues result in extra spaces, or a problem with a paragraph break.

Some of the issues occur near page breaks in the underlying book, where a page with a captioned picture occurs. John O'Hanley (talk) 20:01, 28 December 2020 (UTC)

@John O'Hanley: I flicked through the "exp" labelled lines. From my looking at the whitespace errors, they are not the production technique they are the compilation by a person. ProofreadPage transclusion inserts a space between pages, so the errors that we did are things like not including the apostrophe inside {{hwe}} when the hyphenated word spans a page; not using exclude = when transcluding to skip blank pages (and their added space). Some of the white space issues that you reported seems to have been fixed by another user in mid-Dec. @Samwilson: can we look for the generation of a string of (soft) spaces and remove duplicates?
FYI - I have added exclude attributes for several other blank pages. John O'Hanley (talk) 15:11, 29 December 2020 (UTC)
I don't see the paragraph break issues when I quickly throw into PDF documents, and none of them are at page breaks so that sounds weird and I ask that you check those reported again. [I edited four Page: ns, and amended two transclusions in main ns]
Thanks for your report. — billinghurst sDrewth
Addendum, YES, I can regenerate the paragraph issue. It seems that they fail where the previous paragraph is incomplete though some sort of break, eg. interrupted by an image => example https://wsexport.wmflabs.org/?lang=en&page=Anne+of+Green+Gables+(1908)/Chapter+XXXVIII&format=txt&fonts=&images=false and compare with last page of text Page:Anneofgreengables-rbsc.djvu/457billinghurst sDrewth 05:29, 29 December 2020 (UTC)
I can see what is happening here with the wikitext. When transcluded paragraphs seem to generate </p><p> to start a paragraph, in these examples I see that the new paragraph just starts with a <p> with no prior termination. I am unable to determine what is happening with the prior components generation of the wikitext to cause that difference. Best we can do is force the issue to indicate that it is a new paragraph with something like a {{nopt}}. — billinghurst sDrewth 22:31, 29 December 2020 (UTC)
How would I do that workaround? John O'Hanley (talk) 23:16, 29 December 2020 (UTC)
My theory is that 1) last line of a page and first line of next page are enclosed in <p> ... page numbering span etc. ... but no divs... </p> 2) since there is an image between the two "text" pages, a div is inserted, making the <p> of the page before and the </p> of the page after, orphans, so they are stripped. And the poor first line after the image page is broken. Mpaa (talk) 01:17, 30 December 2020 (UTC)
That is
<p>"I forgave you that day by the pond landing,&#32;<span><span class="pagenum ws-pagenum" id="428" data-page-number="428" data-page-name="Page:Anneofgreengables-rbsc.djvu/454" data-page-index="454" title="Page:Anneofgreengables-rbsc.djvu/454"><span id="pageindex_454" class="pagenum-inner">&#8203;</span></span></span>although I didn't know it. What a stubborn little goose I was. I've been—I may as well make a complete confession—I've been sorry ever since."
</p><p>"We are
As suggested above, I added a {{nop}} and it works, but one must be careful not to leave blank lines in between otherwise an extra <br /> is added. It works but easy to get it wrong.Mpaa (talk) 16:17, 30 December 2020 (UTC)
Imitating your example, I have applied the {{nop}} workaround to the remaining 3 cases. You may want to verify the changes. Thanks to everyone for your help in this regard. Well done. John O'Hanley (talk) 18:08, 30 December 2020 (UTC)


I see some zero-width spaces in the text export of Anne of Green Gables (1908)/Chapter XIV. Is that intentional? This comes from the space between just and found:
  • U+0074 : LATIN SMALL LETTER T
  • U+0020 : SPACE [SP]
  • U+200B : ZERO WIDTH SPACE [ZWSP]
  • U+0066 : LATIN SMALL LETTER F

I have retested today, and updated the listing. Many items are now fixed, but not all. There are 7 items remaining, 3 with a ZERO WIDTH SPACE [ZWSP], and 4 with a missing paragraph-break issue (as noted above by billinghurst). John O'Hanley (talk) 16:56, 29 December 2020 (UTC)

I forgot to commit my changes to github - the listing should be updated in a moment... John O'Hanley (talk) 18:09, 29 December 2020 (UTC)

There's something wacky with this page. The image of the page is a mismatch for the text (wrong page). This page has one of those ZWSP issues. John O'Hanley (talk) 17:58, 29 December 2020 (UTC)

The thumb is frozen on the old page, if you put any other number than 1024 here, it displays the rigt page: https://commons.wikimedia.org/w/thumb.php?f=Anneofgreengables-rbsc.djvu&w=1024&p=164 I have not been able to refresh it. Mpaa (talk) 21:20, 29 December 2020 (UTC)
@John O'Hanley: we are typically visual checkers of text, and really until now that has not been identified as an issue unless we have the occasional link and template issue. Is it a deal breaker with the exported text, or is it more that you are seeing these issues. I would say that we have similar issues in many places due to OCR'd text and visual checking and fully unaware.
I understand. I appreciate that you are interested mainly in the visual appearance. But if you publish text exports, you don't have any control over other use cases for the text. It would be nice if the text output was a bit cleaner. It's not a serious issue for me at all, but it would, I think, be a nice improvement to your site. Perhaps a filter of some sort could be done on the text output? I can't see any reason not to strip out those ZWSB characters, for example. That would be simple to implement, no? Similarly, I can't see any reason not to change double-spaces into single-spaces. Just strip them out; you wouldn't need to waste time tracking down the exact cause... The issue with missing paragraph breaks is a more important defect, though, and should be addressed. John O'Hanley (talk) 23:08, 29 December 2020 (UTC)

 Comment re "dinner-time", "dinner time" and "dinnertime". We proofread against what is there, not what dictionaries show. There will even be variations within a work, and I noticed in that work two variations of the same word with regard to hyphenation, so not overly fussed with that one. — billinghurst sDrewth 22:05, 29 December 2020 (UTC)

I see ZWSP in the text. I use this tool. John O'Hanley (talk) 23:13, 29 December 2020 (UTC)
I see a ZWSP for each page numbering span, most likely as they are inside a valid span they are considered text and exported. I agree that stripping them in the export tool would be the easiest thing to do. E.g.
the pond landing,&#32;<span><span class="pagenum ws-pagenum" id="428" data-page-number="428" data-page-name="Page:Anneofgreengables-rbsc.djvu/454" data-page-index="454" title="Page:Anneofgreengables-rbsc.djvu/454"><span id="pageindex_454" class="pagenum-inner">&#8203;</span></span></span>although I
Mpaa (talk) 21:20, 29 December 2020 (UTC)
If we don't want the pagenumber spans to go to export at all, we can just add ws-noexport to the inner span, and the export tool will drop it, leaving an empty outer span. Or drop the whole thing with ws-noexport in the outer span.
The ZWSP was inserted after a discussion in 2019. @Xover: is it still something that belongs in MediaWiki:Proofreadpage pagenum template and not in the JS? Inductiveloadtalk/contribs 09:02, 30 December 2020 (UTC)
@Inductiveload: In the message you just edit-conflicted—:)—I was going to suggest ws-noexport'ing the whole page number span. So far as I can tell we neither need nor want it in the output; but ebook export is not something I've looked at, which was why I was going to ping you for input on that. :)
The zero-width spaces are still needed in the template, and for the same reasons. But now that we can more easily hack the script we can prevent it from immediately overwriting them (grr!). There are some possible approaches that might obviate the need which we could explore, but that's a ways down the line. --Xover (talk) 10:52, 30 December 2020 (UTC)
@Xover:, OK I ws-noexport'd the entire thing. There's no useful text content there anyway at export. If we can think of some useful way to use the pagenum span in export one day, we can revisit it easily enough. Inductiveloadtalk/contribs 11:01, 30 December 2020 (UTC)
Confirmed: I see no more ZWSP issues. Thank you! John O'Hanley (talk) 14:32, 30 December 2020 (UTC)

Template:Engine to index pages by default

Short time ago I learnt about an excellent template {{Engine}} thanks to a billinghurst’s contribution in a discussion above and about the possibility to add it to an index page. This proved so useful to me that I would like to suggest making it accessible at all index pages by default. Is it technically possible? --Jan Kameníček (talk) 14:20, 30 December 2020 (UTC)

I would prefer not to. Primarily it only works after the text has been added to the pages, so of no value to a new work. Secondarily, with compiled works we make it more general—multiple Index:, rather than specific to the Index:—and it is in those multi-volume works that it has best value. If you think that there is value to it, then possibly we can mention it on the page that discusses building an Index: page, and give some instruction about it. — billinghurst sDrewth 22:40, 30 December 2020 (UTC)

Deblacklisting YouTube, Amazon, eBay etc. for autoconfirmed users

It'd be really handy to be able to show at Wikisource:WikiProject Film/Not uploaded to Commons where an encode of a film is available to download/buy, as a reminder for users (such as myself) to get the film from that source before it goes out of stock or gets deleted, and rip it for Wikimedia Commons. But the filters on Wikisource prevent all of these types of links for any user except administrators apparently. But I am an autoconfirmed user, and this filter I presume is mostly protecting against spammers, who would almost certainly not be autoconfirmed. As what I am doing is not spam, could you please lift these filters for non-autoconfirmed users, so I do not have to avoid the filters by only including video IDs or the ends of Amazon links? PseudoSkull (talk) 13:08, 27 December 2020 (UTC)

I (as an administrator) tried to add the link you had a problem with (for The Primitive Man) and was also prevented from doing so. Per phab:T36928, it is not possible to allow certain user groups to override the blacklist.
Thoughts on some options:
  1. Ask an admin to manually whitelist all your desired links. This is probably too unwieldy.
  2. Hack your way around the blacklist somehow. I don't know if this would work, and it's probably a bad idea per w:en:WP:BEANS.
  3. Deblacklist all of YouTube. I think this might be fine if we also add an AbuseFilter rule to stop new users posting YouTube links.
  4. Shepherd that bug through to a fix and obtain the relevant user right.
BethNaught (talk) 16:25, 27 December 2020 (UTC)
you could go complain at [13], but they think blacklists are a good thing, and like the "admin may i" gatekeeping. Slowking4Rama's revenge 22:14, 3 January 2021 (UTC)

 Comment

  • Blacklists are blacklists, except where we whitelist, there are no exemptions for any level of editor. Make it simpler to add items to the whitelist is always worth asking.
  • I would argue that it is not our job to be adding links to commercial products. How would you like us to try and differentiate between your adding commercial links, another editor account adding commercial links, an IP address adding commercial links, and spammers adding commercial links.
  • I see no reason to remove Amazon or Ebay from the blacklist, though I can see some argument that YouTube has limited value and could be added to work's talk pages for use as a source within {{textinfo}}. [I will note that it is heavily abused by spambots]
  • Any user is able to utilise search engines to search and find commercial products and hardly needs our help.
  • To Slowking4 -- living a life of snideness must be marvellous. That may be how you would approach the role, but let me say that blacklists save a whole lot of work. special:log/spamblacklist

billinghurst sDrewth 04:50, 4 January 2021 (UTC)

thanks for making the admin case. the notorious commercial pirate UK, and US governments persist in using youtube as a reference. as we saw in the The Report of the Iraq Inquiry - Executive Summary, filtering youtube prevented doing actual work on that text. but it is a small price to pay for admin comfort. (that was not snide, that was an accurate reflection of admin attitudes, which you have confirmed.) filters are so adversive and abrupt that even veteran editors are confused, as we saw at Scots wikipedia. Slowking4Rama's revenge 17:53, 4 January 2021 (UTC)
As I said, if you have a case for removing youtube from the blacklist, then present it. Don't give me the sarcasm, the snideness, the bitterness, just be pleasant and present the case. {if those expressions are not your intent, they are how they come across). At this time, admins have said that we will whitelist addresses, or suspend the blacklist entry as required.

FWIW we have substantially more hits of youtube from spammers than we do from users. And it is far easier to assist in whitelisting then the repeated removal of the spam. Minor inconvenience is a two-way street, and would be identified as part of any conversation about the best means to move forward with removing a domain from the blacklist. — billinghurst sDrewth 01:06, 5 January 2021 (UTC)

Duplication of government works (again)

I noticed this issue before, but it has come to life again. The whistleblower letter on the Trump–Ukraine scandal is now scan-backed twice from two different scans at two different locations—Letter to Chairman Burr and Chairman Schiff, August 12, 2019 (the original) and Trump–Ukraine whistleblower complaint. The files were created at almost the same time, the latter only two hours after the former, but the latter was transcluded nearly one year after the former. The problem with the duplication of the transcript needs to be resolved as well. TE(æ)A,ea. (talk) 23:03, 27 December 2020 (UTC).

Done I have redirected to one work, and deleted the others components. I have also resolved at Commons. Thanks for the alert. — billinghurst sDrewth 01:16, 5 January 2021 (UTC)

Anne of Green Gables

I recently did an independent proof-read of Anne of Green Gables. In doing so, I compared my transcription with yours. It helped me find errors in my text. At the same time, it led me to errors in the wikisource text. Here is the list of issues.

https://johanley.github.io/anne-of-green-gables/index.html#issues-wikisource

Could you get in touch with the original wikisource proof-readers, and let them know?

Second question: the nature of the issues is in many cases repetitive - missing quotes, period-versus-comma. Are we sure this work was actually edited by at least one human? unsigned comment by John O'Hanley (talk) .

Thanks. Do you have the page numbers? Fixing mistakes that a sincere comparison has found would be welcomed here, especially as the Wikisource editions has the scans to compare against. ShakespeareFan00 (talk) 21:05, 21 December 2020 (UTC)

No page numbers. But the listing has chapter number, and the text that starts off the paragraph. In addition, about half the cases have links to the page on archive.org, so that will have the page number. John O'Hanley (talk) 21:10, 21 December 2020 (UTC)

Meant the scan page numbers, because that's what the Page: structure typicaly uses.
BTW If you find errors, don't feel you need to ask to repair them if what's been nominally validated, doesn't agree with the scan. If you want to mark something that looks like a genuine error in the original printing use {{SIC}} , I've done this for stuff I've proofread. ShakespeareFan00 (talk) 21:23, 21 December 2020 (UTC)
No scan page numbers, no. I can apply repairs if needed. But given the nature and number of the errors, it's best to let you folks know explicitly, in case something unusual has happened with this text. John O'Hanley (talk) 21:38, 21 December 2020 (UTC)


@BethNaught: , You are typically good at finding typos and scan errors others may have overlooked? Your comments? ShakespeareFan00 (talk) 21:07, 21 December 2020 (UTC)

I don't have any comments that someone else wouldn't be able to give. I would only note for John O'Hanley that yes, each page of this book has been checked by at least two Wikisource users. You can see this by clicking the "Source" tab at the top of the work, which takes you to the scan index: pages in green are "validated" i.e. checked by two people. The fact that they didn't catch these errors is unfortunate, but we don't expect people to be perfect. BethNaught (talk) 21:44, 21 December 2020 (UTC)

 Comment To help find the errors in Page: ns, I have added a search box to the Index: page for quick searches. — billinghurst sDrewth 04:22, 22 December 2020 (UTC)

and if they are "repetitive - missing quotes, period-versus-comma" then a find and replace in visual editor should expedite another sweep through. Slowking4Rama's revenge 21:30, 22 December 2020 (UTC)


FYI - Wikisource editors have fixed most of the issues. Updated listing.

thank you for the error reporting. you can also leave notes on the talk page [14] and leave a note at scriptorium, and people will respond. cheers. Slowking4Rama's revenge 01:05, 1 January 2021 (UTC)


Barging in, but regarding the "checked by two people" thing... I did a test on one work several months ago. (I could find the discussion where I mentioned this if wanted) I closely reviewed between 50 and 100 pages that had been gone over very nicely by quite competent people here. However, there was still an error every 10 pages or so. I even found an error *I* had missed. It left me with the feeling there will be lots of errors in the average review done here.

You might see me occasionally peek into other works, not only for possible interest, but also to check 'quality'. I *often* find errors in the first or second page I peek at, even of 'validated' pages. I sometimes end up downright snippy in my edit summaries, when a long series of missed opportunities is found.

Please take this not so much as a criticism of the project, but rather a reflection on one aspect of work here - fidelity to the source. It is not at a point we should be comfortable with. It is that resulting discomfort that certainly motivates me in my review pass. Shenme (talk) 05:10, 10 January 2021 (UTC)

Are you referring to this thread? If not, I'm still pointing it out as it found it interesting.
It's alleged there that "Validations are not being done properly … so the texts are often only little improved after the initial proofreading". Naming no names, but I can sympathise with that. Some validators of texts I proofread go very fast, such that I'm not confident they're checking everything properly. Perhaps some of them are professionals, or skilled enough to justify that—but for some, when I check works they proofread, I see lots of errors.
I think some may have a "quantity over quality" mindset. The appropriate balance can be debated, but the fact that (at least for this book) we've been empirically shown to be worse than DP Canada should give us pause. To illustrate, people sometimes praise me for being a high-quality and thorough proofreader, but I think the truth is just that I go more slowly, double-check easily-scannoed punctuation, and spellcheck the result.
Also, the discussion of The Great Gatsby at Wikisource talk:Proofread of the Month#June (Fiction: Novel) was illuminating to me. I didn't realise people routinely made assumptions about the strengths and weaknesses of the OCR/uncorrected transcription: to me, marking something as proofread or validated meant signing off on every aspect of the page.
In short, I don't think we have clear enough guidance or expectations about what level of checking "proofread" and "validated" require. BethNaught (talk) 11:23, 10 January 2021 (UTC)
PS. I'm not claiming to be perfect, I do sometimes miss things still. BethNaught (talk) 11:23, 10 January 2021 (UTC)
@BethNaught: Help:Page status and Help:Proofread are the likeliest spaces to add further information, where would you want it? 20 cents ... to me it means aligns to Wikisource:style guide

1. all words are correct
2. general formatting reflects the work (hyphenation, italics, dashes, quotations, bold, size, ...)
3. requisite linking is undertaken (eg. q.v. links done). Sometimes works are progressed to proofread without links which is okay, though I would not expect them to be validated without required links.

I also think that we need to look to similar guidance relating to presenting a work/transclusion about what is best practice, and including something about WD items. — billinghurst sDrewth 02:09, 11 January 2021 (UTC)

@Billinghurst: I agree, we specifically need something somewhere with a brief list of expectations, especially for being able to turn a page "yellow/proofread". Quite a few new editors don't know about the bajillions of templates and we end up with things like Page:Five_Irish_comic_songs.pdf/1 going yellow a little prematurely, which means it moves into the "to be validated" queue and loses eyes on it. It's unfair, IMO, to expect newcomers people to intuit all the expectations themselves.
WD processes are absolutely also required to be documented somewhere, because they're not obvious at all, and the work/edition ontologies are confusing (at least to me!). I know we have the WE-framework gadget, but it's a be "secret". Inductiveloadtalk/contribs 09:36, 11 January 2021 (UTC)
WD is "documented" at d:Wikidata:Wikisource, then follows d:Wikidata:Books, though not in an easy access process. — billinghurst sDrewth 10:16, 11 January 2021 (UTC)
and we have Wikisource:Wikidata. AND I am the worst to write such pages, and even to go and read them. :-/ — billinghurst sDrewth 10:19, 11 January 2021 (UTC)