Wikisource:Scriptorium/Help

From Wikisource
Jump to navigation Jump to search
Scriptorium Scriptorium (Help) Archives, Last archive
The Scriptorium is Wikisource's community discussion page. This subpage is especially designated for requests for help from more experienced Wikisourcers. Feel free to ask questions or leave comments. You may join any current discussion or a new one. Project members can often be found in the #wikisource IRC channel (a web client is available).

This page is automatically archived by Wikisource-bot

Have you seen our help pages and FAQs?


Multiple works on the same topic[edit]

I have recently transcribed four works about the forger, William Booth. He is not known to have ever published anything, so does not require an 'Author:' page. How can I group the works together? Would Wikisource policy support a category, or a portal page? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 08:02, 11 August 2020 (UTC)

Wikisource:Portal guidelines suggests that a portal would be appropriate. Moreover, person-based categories are generally not used here. BethNaught (talk) 08:12, 11 August 2020 (UTC)
@Pigsonthewing: definitely portal, you can use either {{person}} or {{portal header}}. Categorise to category:People in portal namespace. We would also link that portal to the person item in WD. — billinghurst sDrewth 23:17, 11 August 2020 (UTC)
Done. Why doesn't {{person}} automatically apply category:People in portal namespace? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 09:06, 12 August 2020 (UTC)
Probably just hasn't had the focus as they were developed separately. {{person}} desperately needs to be converted to be Wikidata native as default. — billinghurst sDrewth 01:23, 20 August 2020 (UTC)

Anchors for Sidenotes...[edit]

@Xover:,@Billinghurst:,@Inductiveload:

Affected templates: {{left sidenote}} {{right sidenote}} {{LR sidenote}} {{RL sidenote}}

Page:Ruffhead - The Statutes at Large - vol 2.djvu/52 has sidenotes, and these can be associated with a particular section of the text. whilst I could add {{anchor}}, it would be sensible if these could be added from the sidenotes templates directly, so as to limit the amount of Templates needed in a long work.

The fix would be to do something like:

  <span {{if:{{{@|{{{anchor}}}|}}|{{{@|{{{anchor}}}|}} ....

in the relevant templates, so that the x.y style numbering suggested previously could be included directly, for this and other works using this template family. unsigned comment by ShakespeareFan00 (talk) .

Interwiki to Translation namespace not appearing[edit]

According to https://www.wikidata.org/wiki/Q2898441, the page Translation:Tikunei_Zohar is linked to its source language he:s: page https://he.wikipedia.org/wiki/%D7%AA%D7%99%D7%A7%D7%95%D7%A0%D7%99_%D7%94%D7%96%D7%95%D7%94%D7%A8 However, althought the link appears in the he: page, it does not appear on the en: page.

Missing page 97 for Volume 24 of EB1911[edit]

As of Feb. 12, 2021: https://en.wikisource.org/wiki/Page%3AEB1911_-_Volume_24.djvu/110 gives page 96. https://en.wikisource.org/wiki/Page:EB1911_-_Volume_24.djvu/111 gives page 98. Suslindisambiguator (talk) 23:28, 12 February 2021 (UTC)

@Suslindisambiguator:The pages on the index are fine. The page number on "Page:" subpages are reference numbers to the source file's page number.廣九直通車 (talk) 11:38, 14 February 2021 (UTC)
Not really index 97->98; 98->99; and 99->97 . The index needs some cleanup and I'm not sure how to do it. Languageseeker (talk) 03:51, 25 February 2021 (UTC)

Why I can't make Template:Dotted TOC page listing indent??[edit]

After I discovered the existence of {{Dotted TOC page listing}}, the template definitely helps me with my works of Hong Kong laws. However, some of them requires the indentation of some dotted TOC listings (such as Page:Prevention of Child Pornography Ordinance (Cap. 579).pdf/1, section 138A and 153Q to 153R). How can I achieve this??? I tried the most common : to methods like {{Indent}}, but nothing works? Assistance will be appreciated.廣九直通車 (talk) 13:25, 13 February 2021 (UTC)

@廣九直通車:The only way I found is to make a separate entry for "14. Section added" and separate for "138A. Use, procurement…". The problem is that Dotted TOC page listing does not enable to switch the dots off and TOC page listing is a completely different kind of template not suitable for combining with the Dotted TOC page listing within one list. So I tried to make some workaround, replacing dots in line 14. by spaces, the result of which can be seen at my sandbox. It is not ideal, it would be better if the dots in the line 14. could be simply switched off. --Jan Kameníček (talk) 14:03, 13 February 2021 (UTC)
@廣九直通車: I really cannot emphasise enough how strongly I recommend to not use {{Dotted TOC page listing}}. From a technical perspective it is a very bad solution, that creates problems (like the one you bring up here), and is really not needed. The dot leaders are purely stylistic, often exist in a very paper-page specific form, and are better omitted even if they could be added in a technically sound way. This is an issue that's up to each contributor to decide, of course, but I really strongly urge everyone to not use {{Dotted TOC page listing}} except in some hypothetical special case where having the dot leaders is extra special important. --Xover (talk) 14:25, 13 February 2021 (UTC)
Well, they are not only stylistic, their main purpose is to help the eye keep the line, and until somebody finds a better way of generating dot lines (e. g. something as easy as generating dynamic dot lines in MS Word documents), this template is the best thing we have in this regard. However, I admit that they are used also for stylistic reasons, but this stylistics is imo very important too. Without dots we would have to help the eye e.g. by creating bordered tables or something, which would interfere with the style of Contents pages of books (especially old books) too much. --Jan Kameníček (talk) 17:35, 13 February 2021 (UTC)
{{dtpl}} is specifically annoying because it produces a whole, separate, table for every single row. As well as being semantically kind of messed up, this also means it generally doesn't work brilliantly on export.
In this case, I'd say {{TOC begin}} probably produces the easiest result:

Example

{{TOC begin}}
{{TOC row 1-dot-1|spaces=0
 | 1.
 | Short title and commencement
 | A1387}}
{{TOC row 1-dot-1|spaces=0
 | 2. 
 | Interpretation
 | A1387}}
{{TOC row c|3|'''Amendments to Crimes Ordinance'''}}
{{TOC row 1-1-1
 | 14.
 |Section added}}
{{TOC row 1-dot-1|spaces=0
 | 
 | 138A. Use, procurement or offer of persons under 18 for making pornography or for live pornographic performances
 | A1405}}
{{TOC row 1-dot-1|spaces=0 
 | 15.
 | Conviction for offence other than that charged
 | A1407}}
{{TOC end}}
1.
Short title and commencement
............................................................................................................................................................................................................................................................................................................
A1387
2.
Interpretation
............................................................................................................................................................................................................................................................................................................
A1387
Amendments to Crimes Ordinance
14. Section added
138A. Use, procurement or offer of persons under 18 for making pornography or for live pornographic performances
............................................................................................................................................................................................................................................................................................................
A1405
15.
Conviction for offence other than that charged
............................................................................................................................................................................................................................................................................................................
A1407
The dot markup is still "messy" at the HTML level, but it's functional in-browser and removed on export (since very very few readers can deal with it). It might be worth investigating if we can export them only in PDF (since the PDF renderer probably can handle them).
{{TOC begin}} also handles things like vertical alignment and setting text wrapping by default. Inductiveloadtalk/contribs 19:38, 13 February 2021 (UTC)
After some cleanup, I found Inductiveload's solution is very suitable for the case: I just replaced the space between the indented TOC section number and the subtitle with {{Gap|1em}}. Definitely feels good!
By the way, what's the issue of HTML? I think previously when Billinghurst told me not to use <center></center> at here, he mentioned similar reasons.廣九直通車 (talk) 08:37, 14 February 2021 (UTC)
@廣九直通車: Great work! But is there any particular reason you have not marked the pages as Proofread? See Help:Page status for details. --Xover (talk) 10:06, 14 February 2021 (UTC)
Isn't that the pages are for other users to proofread? I'm only the guy who import them into Wikisource...廣九直通車 (talk) 11:36, 14 February 2021 (UTC)
@廣九直通車: "Importing" isn't really a concept in our model. Transcribing and formatting a page is what we call "proofreading". Proofreading can happen iteratively and collaboratively, but once a page is presumed to be "finished" (and thus ready to be transcluded for presentation) it should be marked as "Proofread". At that point it should be double-checked by a second person, and once that's done it is marked as "Validated". When you mark a page as "Not proofread" you're saying there is more work to be done before it's ready, and it should generally not be transcluded for presentation in mainspace. By my cursory look you have finished proofreading the pages and should mark them as "Proofread". I could be wrong, of course, as I only took a quick look. --Xover (talk) 13:44, 14 February 2021 (UTC)
@Jan.Kamenicek: I'll concede your point about dot leaders sometimes being used to help the eye track, but most of the time (in my experience) they are used purely stylistically and sometimes to the outright detriment of readability. And in either case I think the technical disadvantages of {{dtpl}} outweigh any benefit of the dot leaders in all but the most exceptional of cases. I very much wish the draft specification for dot leaders in CSS would materialise soon, but absent that we have no good ways to reproduce them, only various degrees of bad ways, and {{dtpl}} is the worst of the bunch. Please avoid using it whenever possible (I absolutely guarantee that at some point down the line, some poor schmuck is going to have to go through and redo every single page we use {{dtpl}} on, and it's already a bear of a task with what we have so far). --Xover (talk) 13:51, 14 February 2021 (UTC)
OK, many thanks for everyone's assistance!廣九直通車 (talk) 13:59, 14 February 2021 (UTC)

Corruption in main namespace page?[edit]

There is some corruption on the top of this main namespace page and I have no clue what caused it. Can someone please look at it. Thanks.— Ineuw (talk) 07:30, 14 February 2021 (UTC)

@Ineuw: I excluded the empty page and now it looks OK. --Jan Kameníček (talk) 08:09, 14 February 2021 (UTC)
@Jan.Kamenicek: Thanks, but the page error is still showing this morning, even though I deleted and recreated the page. This is what it looks like on the recreated page. I copied the contents to this Sandbox where it shows OK. The error disappears when I am logged out. So, I deleted the browser cookies of all wikis, logged back in and the problem re-appeared.— Ineuw (talk) 15:21, 14 February 2021 (UTC)
I am not experiencing this problem in my browsers neither logged in nor logged out, trying Chrome, Firefox and Edge, so it must be something related to your settings… Could Xover give some advice? --Jan Kameníček (talk) 17:29, 14 February 2021 (UTC)
Since the last post, it has also disappeared while I was logged in. I mostly use Firefox, and then Vivaldi as a Chromium substitute, but looked there too late. Thanks. — Ineuw (talk) 18:20, 14 February 2021 (UTC)
(e/c) Not happening for me either. However, there was something screwy with the pagelist command, which I've corrected in accordance with the policy on these. Beeswaxcandle (talk) 18:22, 14 February 2021 (UTC)
@Beeswaxcandle: Is there a policy that limits pagelist layout to numbers but not alpha letters? What about roman letters? I have done many index pages with indicating the chapters, sections, images, etc. I need it! Especially, when another policy bars me from creating wiki links to the page namespace in a book's table of contents.
In my opinion, if something is technically feasible it should be allowed. Otherwise remove the feasibility. Community policies barring the use of features is the same of telling developers not to use one programming language vs. another. Let's see how that works for Wikimedia. If I am not allowed to organize and identify the data my way to ease my work, then I don't see how I can contribute at the level I have been contributing.— Ineuw (talk) 19:07, 14 February 2021 (UTC)
@Ineuw, @Beeswaxcandle:, there is nothing technically wrong with that syntax AFAICT, it's a string like any other. It would be good to be able to have an extra "label" that doesn't affect the numbering for the purpose of splitting up the page list and making it easier to navigate during proofreading: so I created phab:T274740. Inductiveloadtalk/contribs 20:18, 14 February 2021 (UTC)
Any page that is part of a numbered flow needs to be labelled as that number. Whatever is put into the pagelist command is an anchor when the page is transcluded into the mainspace. If a page is labelled as something other than its number, then it cannot be linked to in a standardised way from other works. Pages that are interpolated (images, plates, and the like) can be labelled with what they are. In terms of roman numerals as opposed to arabic, yes, these can be done. It is because the pagelist command accepts strings, that the policy at Help:Index pages#Pages was developed to ensure that we have a standardised approach to implementing the pagelist command. Beeswaxcandle (talk) 08:40, 16 February 2021 (UTC)
@Beeswaxcandle: Yes, I understand that, which is why I'm suggesting a an extra "label" that doesn't affect the numbering. Something like this:
ProofreadPage pagelist with extra label.png
Such a thing is a pretty trivial gadget, but would be easier if it could be part of the tag markup. Inductiveloadtalk/contribs 09:10, 16 February 2021 (UTC)
Such labels would also assist with the situation of multiple works in a single pagelist? Currently some of these end up as seperate Pagelists, which confuses some Gadgets. 88.97.96.89 09:21, 16 February 2021 (UTC)
@88.97.96.89: Yes. My personal use case is delineating issues in a periodical or other collective work, which are otherwise really hard to work out. E.g. find the start of the December issue start in Index:Notes and Queries - Series 5 - Volume 10.djvu, or things like Index:Parliamentary Papers - 1857 Sess. 2 - Volume 43.pdf. Inductiveloadtalk/contribs 09:28, 16 February 2021 (UTC)

Pictogram voting comment.svg Comment It has long been possible and acceptable to have multiple <pagelist> where it adds value to the Index, and it doesn't break transclusions. I did it at index:Men-at-the-Bar.djvu years ago, as it is a work that has been dipped into and out of, and has not ToC or easy to work out where in the work. It wouldn't be normal to do it where there is a ToC as it become superfluous. — billinghurst sDrewth 11:50, 16 February 2021 (UTC)

@Billinghurst: Thanks for the example of multiple pagelists. But, please keep in mind that what is superfluous because of your knowledge and experience, may not apply to others.— Ineuw (talk) 18:41, 17 February 2021 (UTC)
Hey what.? If a transcluded ToC on an index page says chapter 1 starts at 53, then it starts at 53. Not typically a hard concept for anyone transcribing or transcluding. I didn't say don't do it, I just said generally superfluous if there is a ToC. — billinghurst sDrewth 06:27, 18 February 2021 (UTC)
I suspect we are talking about two different things. I find your solution of using multiple pagelists in a single index page to be an excellent solution for my problem and I am working on it now.— Ineuw (talk) 07:47, 18 February 2021 (UTC)

Pictogram voting comment.svg Comment/plug: The User:Inductiveload/index preview script can help here since you can see a page preview without leaving the index page. Actually, so can User:Inductiveload/popups reloaded.js, but that's still WIP. Inductiveloadtalk/contribs 08:01, 18 February 2021 (UTC)

Corruption in main namespace page? Redux[edit]

I have a simple question. All Index files' pagelists have characters in them. In fact, all Index pages I edited have text in the pagelist, and I never heard about this being an issue. The problem only cropped up in one Index page. I am continuing to indicate pages as before, and no problems. Could it be that the page is damaged? Otherwise, what changed? — Ineuw (talk) 01:02, 3 March 2021 (UTC)

Bug in Text Downloading function[edit]

I found that the blue "Download" button on the upper-right-hand corner cannot process Chinese text, instead outputting replacement characters. Can somebody report the stuff to Phabricator? Many thanks.廣九直通車 (talk) 10:33, 17 February 2021 (UTC)

@廣九直通車: have you got a specific example? Inductiveloadtalk/contribs 10:35, 17 February 2021 (UTC)
@Inductiveload:Please try to download any of the transcribed pages on Category:Laws of Hong Kong (like this), or precisely, try to download any texts hosted on Chinese Wikisource (like this). I actually suspect that the problem is not limited to English Wikisource.廣九直通車 (talk) 10:39, 17 February 2021 (UTC)
Ah, I see, it's only in the PDFs - I was looking at the EPUBs. Inductiveloadtalk/contribs 10:41, 17 February 2021 (UTC)
Also, it seems that the bug affects Japanese and Korean as well, as reflected in those (totally scrambled) downloaded Chinese texts. Even the Japanese and Korean disambiguation are affected.廣九直通車 (talk) 10:42, 17 February 2021 (UTC)
@廣九直通車: reported at phab:T274997. CJK is generally a whole "thing", so it's unsurprising that Japanese and Korean are also affected. Inductiveloadtalk/contribs 10:50, 17 February 2021 (UTC)

Bot to replace long s with {{ls}}[edit]

I've noticed that there are numerous texts where there is a long s "ſ". I was wondering if it would be possible to create a bot to replace all long s "ſ" with {{ls}}. This would enable users to toggle the view between"ſ" and s. It would also help to make the texts more compatible with no drawbacks. Perhaps, we could limit the bot to just validated texts in case the {{ls}} makes proofreading more challenging. Languageseeker (talk) 17:36, 22 February 2021 (UTC)

  • Whether or not to use reproduce long s is not a settled matter and is currently treated as a matter of discretion for the initial proofreader of a work. In general, we don't mess with their formatting provided that it's consistent with policy and they've completed the work with that formatting.
  • {{ls}} is certainly not drawback-free. The switching script is a user script that's incompatible with one of our gadgets.
So I don't think this would be appropriate.
Also, since this would be a substantial change, if you want to purse this, it should be at WS:S, not this subpage. BethNaught (talk) 19:11, 22 February 2021 (UTC)
Gosh is that still used? I though it was turned off years ago, I didn't realise it was still advertised at {{ls}}. It might be fixable now I know kind of what I'm doing :-s. Inductiveloadtalk/contribs 19:24, 22 February 2021 (UTC)
I think that it's an important script. Can you fix it please. Languageseeker (talk) 21:45, 22 February 2021 (UTC)
OK, I have fixed it up a bit and it seems to work. Whatever was the problem with the alt-index gadget seems to be resolved. Perhaps this should be a Gadget? Inductiveloadtalk/contribs 09:53, 23 February 2021 (UTC)
I am against the forced use of {{ls}}. We have works where the long-s has some significance and works where its orthographic reproduction is entirely superfluous. As for proofreading, my experience is that any differences from standard modern English are a challenge for proofreaders. This includes ligatures such as æ, diacritics such as in é or ï, and archaic hyphenations as in "to-day". We have no mechanism for turning these off. Long-s is just one of many issues that proofreaders have to train their eyes for. --EncycloPetey (talk) 16:40, 23 February 2021 (UTC)
I am for using long s but paste a ſ when proofreading. I think there should be an annotated copy -sans ſ- produced when the book is transcluded for those who find it difficult to read with ſ. Old style script is actually part of the charm of reading old books so should be preserved, I think. Zoeannl (talk) 21:49, 23 February 2021 (UTC)
Thank you for all the thoughtful feedback. @EncycloPetey I'm not advocating for forcing proofreaders to use the long s, but I think that if a text already has a long s, it might be useful to have a bot that will automatically replace it with {{ls}} in the posted text. Then, readers can turn it on or off. Some enjoy the long s, others find it distracting. Languageseeker (talk) 02:24, 24 February 2021 (UTC)

@BethNaught: I would disagree that this is not a settled matter, I believe that it is a well-settled matter. The discussion was had, and the decision was to use a standard s, or to have {{long s}}, so that users could have their long s in page: and we had a modern characterisation in main. The documentation is pretty certain about the approach. For many years we used to fix this up through patrolling and stopping the use of long s in works, though it seems that today's patrollers have not been as stringent.

With regards to bots, where I see (stumble upon) works with long s in use, I do in fact replace them using my bot. I don't go hunting them specifically. — billinghurst sDrewth 03:41, 24 February 2021 (UTC)

@Billinghurst: Believe it or not, I did search the archives before making my assertion, and I could find no such definitive discussion about never showing long s in mainspace. Could you actually point us to it, rather than merely asserting it exists? Indeed, while I found several discussion in the past decade, none were definitive. For example: this one, where a variety of opinions were expressed, including that long s should be displayed in the primary version; this one, similarly; this one where you yourself describe long s as a preference.
Additionally, where is the "pretty certain" documentation? The style guide does not mention long s; Wikisource:Style guide/Orthography does talk about long s, and while it discourages literal long s, there is no language forbidding it nor mandating {{long s}}.
It seems to me that on a project where a lot of norms are uncodified, and latitude is given for contributors' editorial discretion, to an extent practice is policy. And given that we have a featured text with literal long s, you can't rely on historical patrolling practice to counter modern patrolling practice.
I disagree that it is appropriate for you to mass-replace long s with a bot, at least in a completed work where it is consistently used. BethNaught (talk) 10:46, 24 February 2021 (UTC)
@BethNaught: Orthography says don't use long s. It is not wanted in main namespace. Template:Long s was designed to allow for those who wanted long s in the page namespace to represent the work there, yet displaying properly in main ns. So _orthography_ is meant to allow users to use long s template, or use the standard letter s and not have to reproduce long s in the text as printed.

No, the guidance doesn't ban it, and as I said elsewhere in the past couple of days, there are works where it is specifically added to be long s as a more modern work posing as an older work, for example Manners and customs of ye Englyshe; so banning it would defeat those needing to display as required. Similarly there is some German use, and also old text reproduced in works where we display it as reproduced. The conversations about the use of the ligatures and long s was also contained in search discussions as long s texts do not reproduce plain English searches, works are lost. — billinghurst sDrewth 11:26, 24 February 2021 (UTC)

So you can't point to an actual discussion deciding in favour of your position? I'm looking for the authority of the community, not the dictum of an elder statesman. BethNaught (talk) 12:15, 24 February 2021 (UTC)
  • I strongly oppose this proposal. In addition, if this is a request for a bot to accomplish that task, it needs approval elsewhere. TE(æ)A,ea. (talk) 14:42, 24 February 2021 (UTC).
Pictogram voting comment.svg Comment By the way, the internal CirrusSearch search engine and Google, Bing and Yandex treat "ſenſe" the same as "sense" (and vice versa). On the other hand, Yahoo and DuckDuckGo do not "normalise" long-s. So ſ at least doesn't totally torpedo searchability any more. Inductiveloadtalk/contribs 09:22, 25 February 2021 (UTC)
I have also noticed that search engines can cope with long s well and so I tend to agree with not preventing its usage in our main namespace. It is historical typography and it looks good in historical documents. So I would not forbid users to enter a typographical version according to their preference. If somebody wanted, they could also be allowed to enter both typographical versions (e.g. one of them as annotated) and we may think of some ways of enabling users to switch between the two versions. --Jan Kameníček (talk) 10:20, 25 February 2021 (UTC)
@Jan.Kamenicek: +1. There actually is such a system, unofficially, at least: pages using {{ls}} can be toggled using this script:
mw.loader.load("//en.wikisource.org/w/index.php?title=User:Inductiveload/Visibility.js&action=raw&ctype=text/javascript");
It also allows you to toggle external links' blue color. Inductiveloadtalk/contribs 12:42, 25 February 2021 (UTC)
@Inductiveload: Yes, I know, but that is not what I meant. I can do it after I was explained, but my parents would not do it no matter how well you would explain it to them. Similar tools make real sense only if they are accessible to ordinary users. What is more, similar scripts can be utilized only by logged in users, while vast majority of our readers do not log in. --Jan Kameníček (talk) 13:03, 25 February 2021 (UTC)
Sure, but with approval, this can be made a default gadget. Inductiveloadtalk/contribs 13:09, 25 February 2021 (UTC)
I think it would be much more user friendly to have a button to toggle long s on and off on the page than to have to make custom css. Perhaps, we could place a slider button on the top right corner of the page. Languageseeker (talk) 13:20, 25 February 2021 (UTC)
Yes, something like that would be great. --Jan Kameníček (talk) 14:49, 25 February 2021 (UTC)
"it looks good in historical documents" — no it doesn't. It looks awful and causes difficulties in reading fluently—principally because the common fonts in use don't distinguish it effectively from "f". The glyph had its hey-day in the Tudor and Stuart periods and fell into disfavour during the end of the Stuart period and was mainly used through the 18th and 19th centuries by publishers who wanted to pretend antiquity. My approach to this is to restrict use of the to works printed in the Tudor and Stuart periods up to and including the First Commonwealth. I use a plain "s" for works printed after the Restoration. Beeswaxcandle (talk) 03:59, 26 February 2021 (UTC)
I'm not a huge fan of the long-s except in facsimile reprints. But I will say that as w:long s details, use of the long-s in English plummeted between 1790 and 1810; its use in the 18th century was pretty standard.--Prosfilaes (talk) 03:26, 27 February 2021 (UTC)

Poem tag and page wraps[edit]

How do we handle <poem>, when the verse in the source document is across two or more pages? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 10:25, 25 February 2021 (UTC)

You simply finish the page with </poem> and start the new page with <poem> again. Only when a page break coincides with stanza break, you finish the page with
…
Stanza one.
<br />
</poem>
{{nop}}

For details see Help:Poetry --Jan Kameníček (talk) 11:58, 25 February 2021 (UTC)

Alternatively, don't use poem tags at all, but use <br /> between lines and block center/s & /e. This is what many of us do because of the problems of alignment of poems across pages when using poem tags. Beeswaxcandle (talk) 03:39, 26 February 2021 (UTC)

Tooltips lost in PDF Export and in Kobo epub Reader[edit]

I noticed that Richard II (1921) Yale has the notes in tooltips. When exported to a pdf, the relevant text is underlined, but there is no pop-up note. When exported to an epub, the comments renders in Calibre, but not on a Kobo. Languageseeker (talk) 13:59, 25 February 2021 (UTC)

Whether or not PDFs can support tooltips rather depends on whether the PDF format supports such a thing. @Samwilson: is there any hope of this working in PDFs?
For EPUB, which is basically HTML, Calibre uses the Chrome engine internally, so it behaves more or less just like a browser. Kobo uses a much less capable rendering engine, based on RMSDK which presumably doesn't handle the title attribute. Koreader also does not handle the title attribute. On a touchscreen, it's tricky to handle in general, because there's no "hover" concept without a mouse, and it's down to the program in question to look for a title attribute on long press.
Something that might be possible is on-the-fly conversion of elements with tooltips to references in the export process.
There are more details about what does and does not currently work at Help:Preparing for export. Inductiveloadtalk/contribs 15:29, 25 February 2021 (UTC)
Thanks for your reply. Seems that thinks are a bit broken with Kobo. Would it be possible to automatically convert tooltips to footnotes when generating an epub? Languageseeker (talk) 04:06, 26 February 2021 (UTC)
For PDFs, could we transform tooltips into comments? Languageseeker (talk) 06:24, 26 February 2021 (UTC)
@Languageseeker, @Inductiveload: This is mainly about the {{tooltip}} template isn't it, and maybe {{SIC}}? I sort of feel like it shouldn't be up to the exporter to do the translation to footnotes; that it'd be better to implement tooltips on-wiki in a better way. Because viewing the wiki on a touch screen is as annoying as on an ereader, as far as not being able to hover goes. Maybe re-implementing them as an 'annotation' reference group would be better? That'd work in print as well, and it'd give ereaders the required structure to be able display them as popups (which my Kobo does). Open to suggestions though! :) — Sam Wilson 09:05, 26 February 2021 (UTC)
Personally speaking they are web page features only, along hte lines of Wikisource:Annotations. The exported works should not have our annotations, and they should be #noprint. — billinghurst sDrewth 12:07, 26 February 2021 (UTC)

Two Enhancement to Search[edit]

Would it be possible to create two enhancement to search that would make this site much more usable to those in university?

  1. Add a citation Box to the Mainpage of a series. For example, The Czechoslovak Review would have a box that says Volume:|_| Number: |_| Page: |_| in the top right corner. You can either type in Volume:|3| Number: |4| Page: |97| or you can select the volume and number from a drop down menu and that would take you immediately to The_Czechoslovak_Review/Volume_3/Joža_Úprka#97 or to the Page:The_Czechoslovak_Review,_vol3,_1919.djvu/131 if there is no proofread version.
  2. Allow users to perform a search within a Category. For example, in Category:Periodicals, Current affairs, there would be a search box that would enable to only search the texts in the category. You can also perform an advance search as well. The search would look through both proofread text and OCR text for the words. For instance, you can search for "soldiers" from June 12, 1912 until June 16, 1914 in Category:Periodicals, Current affairs

Languageseeker (talk) 14:50, 25 February 2021 (UTC)

  • For searching within categories, use incategory:. The problem with the first proposal would be finding what article is on that page; this would likely be a manual entry for each work. As a drop-down menu (showing volumes, issues, and then articles), it is much more feasible, and can likely be done automatically. TE(æ)A,ea. (talk) 03:04, 26 February 2021 (UTC).
Thank you for your reply. I think that it's important to be able to jump to the precise page number because there are works with tens of thousands of pages where the citation is given as a specific page number. For example, the Federal Register is routinely over 50,000 pages per year with thousands of entries. A standard reference would be 78 FR 51713. A citation tool should be able to take this and jump to the right place in the Federal Register. Perhaps, there can be a lookup table generated as part of the trasclusion process that this page is found in Consumer Product Safety Commission: Notices: Meetings; Sunshine Act.
The incategory: is a good start, but I know that when I'm teaching undergrads, we use databases that allow us to search specific subsets of works. The wiki category is the best equivalent. However, strong search capacities is essential. I'm thinking of something more along the search box in Popular Science Monthly with advanced options. For example, imagine a student wants to find all mentions of parasol by women authors between 1789 and 1815. They would be able to go to the Category "Women authors," go the search box, type in "parasol," select the advance option and limit it by date. Languageseeker (talk) 04:05, 26 February 2021 (UTC)

Pictogram voting comment.svg Comment I have added a search box that will search within the publication of The Czechoslovak Review. We can make it more configurable, though at this time it hasn't been necessary. Otherwise Search doesn't work that way you want as the search data is not recorded that way, nor is the work set up that way. Search itself does not understand dates which need to be a special indexed component for any search engine. You can read more about WMF's search at mw:Help:CirrusSearch. These searches can also be setup with the Index: namespace, per Index:Men-at-the-Bar.djvu though we would not typically setup a main ns work that searches in Page: ns for works. Main ns searches should and will only retrieve transcluded work. — billinghurst sDrewth 12:22, 26 February 2021 (UTC)

Pictogram voting comment.svg Comment 2. Re incategory: searches, that is going to be a little problematic with how we do things, and maybe it is something that we should rethink now that search is more expanded. We typically only categorise the top level of a work, so INCATEGORY search parameter will only search that page, not the subpages of the work which are not categorised where you identified your interest. There are means around that, though each of them has its consequences, and definite amounts of work to achieve. — billinghurst sDrewth 12:35, 26 February 2021 (UTC)
I see. It seems to me more of a limitation with CirrusSearch that anything else. To perform the kind of research searches required in universities would require a new search engine. Would it be appropriate to post a request on Phabricator? Languageseeker (talk) 16:09, 26 February 2021 (UTC)
Phabricator is probably not (yet) the right place for this. What you describe requires a base of structured data, which we store on Wikidata. But we don't have very good tools for managing that data, so what's there is and will continue to be very spotty for the forseeable future. First step, thus, is improving our tooling for adding and maintaining that data. Once we have a reasonable coverage on Wikidata, the next is allowing search and navigation based on it. That's going to require a custom search facility that understands the information model (i.e. it knows that works have authors, were published on a given date, can be collective works, are split into volumes and issues that in turn contain articles, etc.). This kind of engine is not useful outside Wikisource (i.e. Wikipedia has no direct use for it) and is a relatively complex bit of software, so it's not something that the (often volunteer) developers working on MediaWiki can just add. It's a tall order, so it needs to start with coming up with a reasonable specification and description of how this will interact across projects and how the technology will interact with the community. Then it needs wider discussion with those impacted (the various language Wikisource projects, Wikidata, and Commons). And then we can start looking for ways to actually get it implemented (probably starting by trying to find a volunteer willing to do the programming).
Don't get me wrong, I think it's a really good idea and something I wish we had; but the realism of getting there any time soon is questionable. --Xover (talk) 16:36, 26 February 2021 (UTC)
Can we start the conversation now? I think that this is the perfect time because the pandemic has devastated university budgets and they are looking for alternatives to costly databases. Most university libraries spends millions of dollars a year for databases that mostly use texts in the public domain. We can offer them a free alternative. I'm sure that if we reach out to them, some will be willing to provide technical support and grants. What about the various Marc standards as the basis of our dataset? It's the universal standard for library catalogs and will make it super easy to batch import information from university libraries to Wikidata. We could also sell Wikisource to libraries as a way to become ADA compliable. Imagine, a student with a disability requests an electronic version of an article from 1893. Then the university can pay a student to proofread the article on Wikisource and wikisource will generate the electronic version. It's a win-win. I know the the National Library of Scotland and other libraries have participated her. Maybe, we can reach out to them and see what they want. Languageseeker (talk) 18:26, 26 February 2021 (UTC)
The conversation not only could and should start now, but it is actually overdue. I am just trying to manage the expectations of what it is realistic to achieve in any relatively close time frame for a chronically resource-starved all volunteer project. We need much better Wikidata integration for all sorts of things, and what you propose here would just be one exceptionally good showcase for such functionality. The fact is, for what we have coverage for, our metadata is actually generally better than most libraries and archives' databases; we just don't do a good job of making them structured and reusable. Bulk imports of data to Wikidata are happening at an absurd pace already so that's not really a problem; it's connecting the bibliographic related work we do here to Wikidata that's the major gap. --Xover (talk) 19:23, 26 February 2021 (UTC)
Glad you agree. I look through Wikidata and, as far as I could tell, there is no MARC parser or exporter. I think that this would be the first step towards working with any library. It will enable us to import the data from libraries into Wikidata and export it out. The MARC standard is available online [1] and open source implementations exist [2]. This is a huge undertaking and will probably require fundraising, but I think that it's the first critical step towards making Wikisource a true online library. It also probably makes sense to reach out to the Open Library for possible collaboration in software development. I'm a new user here so I haven't earned my stripes, but I would love to work with you on this project. Languageseeker (talk) 20:13, 26 February 2021 (UTC)

Importing from PGDP[edit]

I was wondering if there is a way to import a project from PGDP to Wikisource. The works on PGDP have images and corrected text with formatting. I know that the formatting will need to be wikified, is there a tool to do this? Or is this a request best posted on Phabricator? Languageseeker (talk) 16:07, 26 February 2021 (UTC)

@Languageseeker: Please don't. Project Gutenberg texts are not generally of any particular edition of a work (amalgamations of multiple editions in some cases), and their transcribers sometimes "innovate" in various ways (modernised or americanised spellings, for example). Works here should generally start with uploading a scan and then proofreading against that scan; and the raw OCR in the scan will usually be a lot better for that than the PG text. If you don't care about the fidelity of the text, why not just read it on PG directly? --Xover (talk) 16:23, 26 February 2021 (UTC)
Totally with you on account of Project Gutenberg. I would advocate for a removal of all texts from Project Gutenberg. However, it's not Project Gutenberg, but Distributed Proofreaders that feed into Project Gutenberg. They have proofread texts with formatting and the original images. So, we would get a proofread text that we could compare to and validate them against the original image. See, for example, [3] (login required) Languageseeker (talk) 17:31, 26 February 2021 (UTC)
@Languageseeker: My apologies. I have obviously not been entirely clear on the distinction between PG and DP. Having a quick look at their guidelines it appears at least mostly compatible with our practices, so they could certainly be one source of text for us (provided what they actually output matches the guidelines, which I haven't checked). We'd need to find some technical way to import page by page to a scan hosted here so we can run our own Proofreading (just with a better starting point than OCR) and to make sure our texts are validatable to that scan for our readers. Possibly a mechanism akin to Help:Match and split, and it would probably require DP to have something API-ish that we could consume, but overall it should be feasible. --Xover (talk) 18:53, 26 February 2021 (UTC)
@Xover: Created a phabricator task. Hope it gets done. Languageseeker (talk) 20:27, 26 February 2021 (UTC)
@Languageseeker: this is an interesting idea, but an importer would almost certainly be done as an external tool that constructs a matching DJVU file from page images, feeds data in over the MediaWiki API, and then uploads the pages. I do wonder if it can be fully automated. The biggest worry so far, after logging in and sniffing around a bit, is that I cannot find page images for the "complete" works, nor a reliable link to something like the IA.
Also, I'm rather jealous of their velocity, even with such a huge number of review stages, they're clocking 140 works a month.
The other problem is that they do not format works to our level, for example "--" instead of "—", capitals, not small caps (the do mark this up), no centering, no sizing, etc etc.
On the subject of Phabricator, I've recently been wishing for a way to track enWS tasks, since they often have dependencies. Does anyone know if we can use Phabricator for that? Can we ask for a project? For example "move {{header}} to module logic". Inductiveloadtalk/contribs 20:53, 26 February 2021 (UTC)
(e/c)At present, the instructions for Match&Split specifically exclude DP works. However, IF a DP work is based on a single edition and the other criteria are met, then the Match & Split tool is fine. Certainly some of Laverock's contributions were done this way and the EB11 project is also utilising a version of the process. We would still require the normal enWS validation process. Beeswaxcandle (talk) 21:00, 26 February 2021 (UTC)
@Beeswaxcandle:, DP provides a file split by page, so you can in theory do better than M&S. However, you do need to figure out where the scan came from (hopefully the IA) and work out the offset (their page 1 is not the front cover) or construct a scan from the DP images, if present. The bigger challenge will be to write a parser for their markup, because it'd be a shame to junk it all. Inductiveloadtalk/contribs 21:30, 26 February 2021 (UTC)
So, I made a really shonky script to import a DP page-by-page text file: User:Inductiveload/dp_reformat.js using the magic of regex. It seems to have worked OK: Index:The ways of war - Kettle - 1917.pdf. However, the biggest issues I see is that once DP "archives" a project, the links to the marked-up source are removed from public view, as well as the page images. I'm unsure of why they do this, but it makes it all-but impossible to do a perfect match/split on the work, even if you can hook it up with a matching edition's scan. Inductiveloadtalk/contribs 10:52, 1 March 2021 (UTC)


Open The long and short is that DP does not appear willing to share their archived projects. So, making a tool that is specific to DP makes little sense. However, I still think that it makes sense to create a tool that would allow us to import OCR from a different source or replace the image files. So, I made a different phabricator ticket. Languageseeker (talk) 14:36, 1 March 2021 (UTC)

@Languageseeker: as long as you can massage the text into "Match and Split" format, you can already drive mass page uploads directly though the normal Wikisource interface. For the case of the User:Inductiveload/dp_reformat.js, this script will (attempt to) transform raw DP text into split-ready text with as much wikiformatting as it can. I will add some quick docs at User:Inductiveload/dp_reformat. It might not work for every type of DP project (since AIUI, different projects have different formatting standards). Inductiveloadtalk/contribs 15:06, 1 March 2021 (UTC)
@Inductiveload: Your script is utterly amazing. I'm astonished. I've used it on several books and it is great. I do have one bug and one suggention
  • Bug: If the offset is negative, you need to type the number first before you can insert a minus sign.
  • Request: Can you make the Index Menu a drop down menu so that the tool could be used on non-English Wikisource pages? For example, for French it would be "Livre" and "Page"
Also, is it possible to redo the match and split if the results are incorrect? I started one for Index:The American encyclopedia of history, biography and travel (IA americanencyclop00blak).pdf and it turns out they removed several blank pages from inside of the book, so I would need to rerun the script. In the past, when I tried to do this, it failed silently.
BTW: Everything from [4] upwards is still available on PGDP. It might be good to do a collective project to add these to Wikisource before the files are archived. Languageseeker (talk) 18:20, 2 March 2021 (UTC)
@Languageseeker: I looked at the offset and I think it's a bug in OOUI (phab:T276265).
Re the Index drop down, the namespace names "Index" and "Page" are canonical, so the script should just work at other Wikisources. E.g. s:it:Index:Peregrinaggio di tre giovani figliuoli del re di Serendippo.djvu works, even though the local namespace is "Indice". Let me know if it does not.
As for fixing a bad split, this is best fixed by a bot and admin, otherwise the redirects make a mess. Let me know the range to be moved and the offset and I'll sort it for you.
I am working on salvaging the texts at F2 levels (~1600). Inductiveloadtalk/contribs 19:26, 2 March 2021 (UTC)
Haha, you're awesome. Thanks for salvaging those texts. The ones that are posted to PG are archived first, so it probably makes sense to salvage those first. It's such a rich source.
For the problem with the merge, starting with Page:The American encyclopedia of history, biography and travel (IA americanencyclop00blak).pdf/22, the text should be moved +2 pages. So 22 has the text for 24. Languageseeker (talk) 21:15, 2 March 2021 (UTC)
I tried splitting a French book and it mostly works except for Modèle:Nop and Modèle:Ch don't work. Languageseeker (talk) 21:15, 2 March 2021 (UTC)
@Languageseeker: Move underway for the misaligned pages. In future please be cautious before splitting that the alignment is correct. It is annoying, I know, but if they mess with the pages, thems the breaks.
Re the French templates, I guess it possible to handle the other subdomains, as long as you know what to map each formatting element to. E.g. I think is their "nop". But it will take a little bit of a fiddle to do so. You can also do the replacements in a text editor if you know what you want to replace with.
Also, I wonder where to put the text files - they total over 950MB when uncompressed! Maybe the IA? Inductiveloadtalk/contribs 23:18, 2 March 2021 (UTC)
Thanks. I checked a few pages in the beginning and this one tricked me.
The IA might be a good place, or you can batch upload them to Commons. It might be good to store the original OCR for the future. You never know.
A few more markup for your script: [ae] = æ; [oe] = œ; {{...}} = … Languageseeker (talk) 02:19, 3 March 2021 (UTC)
@Languageseeker: Commons doesn't accept random zips/txt files, though. Anyway fill your boots: https://archive.org/details/dp_texts
Re the OCR, I'm not really sure about that, as long as we have the scan, we can OCR to our hearts' contents.
{{...}} is actually a WS template, it's designed to prevent a line break in the middle of ". . ." Inductiveloadtalk/contribs 09:09, 3 March 2021 (UTC)
Given DP don't have a publically available archive of their completed projects, and have stated that is intentional, I'm not sure harvesting everything that is available and posting it ... to a publically available archive ... is a great way to make friends! Nickw25 (talk) 08:24, 4 March 2021 (UTC)

All, it's important to note the position of the DP in relation to the activity in this space. In summary, DP have stated a view that WS should not use their in-progress texts in line with their community wishes. This is stated by the DP General Manager in the forum discussion on this topic at their site, which can be easily located. It was stated on the first phab ticket above that DP pointed the enquirer to in progress texts rather than archived ones. That was never stated publicly there. I don't know if it was stated privately or not, although it is no longer the position, if it ever was. It's also stated by their administrators in the same forum that the bulk harvesting of texts (presumably the same referenced above) was so disruptive it caused their server to become unresponsive for a period. Maybe unintentional, although destablising other projects servers to harvest information they don't want harvested cannot be the standard for a WMF project. Given this, I'd think it's reasonable for WS volunteers to refrain from harvesting and bulk importing content from DP given their community wishes. Disclaimer that I'm a volunteer at DP as well, and have been for many years on and off. To be clear, I'm a standard volunteer there, as here, and have no more knowledge other than what has been publically posted on their forums. Nickw25 (talk) 02:35, 6 March 2021 (UTC)

@Nickw25: I'm somewhat familiar with the situation. Inductiveload archived the project pages and concatenated texts because DP removes the images and text from DP soon after they get posted to PG. It did not cause the server crash. We asked if it was possible to just obtain the concatenated text afterwards and they said no. I'm not sure what the exact issue is. It seems to be a moral/philosophical issue rather than a legal one. They asserted no copyright claim, just a statement that downloading texts is a subversive activity that disrupts the core mission of the site.I'm sure that millions of authors who have their texts lapse into the public domain would like to restrict copying their work with a similar argument, but that is not how the PD works. If the text is an exact mechanical reproduction of a text in the PD, then the reproduction is in the PD. You cannot copyright a PD work. Languageseeker (talk) 02:55, 6 March 2021 (UTC)

UniversalLanguageSelector font list?[edit]

ꜵ̈
ꜵ̈̄

Is there anywhere that I can view a list of the fonts included in the UniversalLanguageSelector extension? Perhaps I am just being a dunce, but I can't find any obvious place that lists them.

The reason I ask is – the dictionary I'm transcribing is from a time before the IPA was standardised, and it uses a few bizarre characters that don't seem to be rendered very well with the default fonts that are used (at least on my computer). The ones that are causing the most issue are the two culprits in the infobox. The diaeresis and the macron are supposed to be centred above the ꜵ. Would anyone happen to know if any of the fonts in the UniversalLanguageSelector would render this character properly, and if not, if a suitable font could be added?— 🐗 Griceylipper (✉️) 22:22, 26 February 2021 (UTC)

@Griceylipper: All fonts included with ULS. But your two ao-ligatures seem to render just fine in my browser (Safari on macOS) so I'm not really sure any ULS shenanigans are necessary. --Xover (talk) 07:49, 27 February 2021 (UTC)
I see the error on firefox and IE on Windows. Languageseeker (talk) 15:14, 27 February 2021 (UTC)
Ao ligatures not rendering correctly.png
Thanks for the list @Xover:. This is what I see on Windows 10 with Chrome. Charis SIL seems to render the ao ligature properly, but there seem to be characters missing in it as well such as ꬶ (renders fine for me, but another font is being substituted for this character.) Though I can live with that.
If Charis SIL is the best compromise out of all the fonts included in ULS, would it be appropriate to blanket apply this font to the whole work? Or is that not advisable?— 🐗 Griceylipper (✉️) 19:25, 27 February 2021 (UTC)

Undo a Move[edit]

I moved Index:Milton - Milton's Paradise Lost, tra il 1882 e il 1891.djvu to a different title and it seems to have broken everything. Is their anyway to undo the move? Languageseeker (talk) 06:48, 27 February 2021 (UTC)

@Languageseeker: Done. For the future, perhaps temper enthusiasm with a pinch of caution until you gain more experience with the software and our community? :)
Most things can be fairly easily fixed so mistakes are not a big deal, but we do have some relatively unique software and community features that can sometimes make assumptions based on experience elsewhere a little iffy. In this case the issue was that the way the software connects the scanned file at Commons with the Index: and Page: pages here is through the page name. If a file is named "File:The Book.djvu" then the Index: must be at "Index:The Book.djvu" in order to work. We also cannot easily rename the File / Index / Page: pages for various technical reasons and so we tend to treat those as mostly opaque (rather than human readable) strings. We prefer nice logical and accurate file names if they can be had at the time of upload (or shortly after, before dependent pages are created), but we live quite happily with all but the most misleading filenames in most instances. --Xover (talk) 07:43, 27 February 2021 (UTC)
@Xover: Thank you! I'll be more cautious in the future. I didn't want to rename the book in Commons because I didn't upload it, but it seems that that would have been the place to do it. Languageseeker (talk) 15:01, 27 February 2021 (UTC)

Table Trouble during proofreading[edit]

How to achieve that kind of custom table style as used in paragraph (c) of Page:Administration of Justice (Protection) Act 2016.pdf/35? I'm still confused with Template:Table style. Many thanks for solutions.廣九直通車 (talk) 07:47, 28 February 2021 (UTC)

Xover has beaten me to it. A slightly simpler way of doing the same thing is:

"175 If the document or electronic record is required to be produced in or delivered to a court of justice Ditto Ditto Ditto Imprisonment for 6 months, or fine*, or both Ditto";

Beeswaxcandle (talk) 08:54, 28 February 2021 (UTC)

@廣九直通車: (e/c) I had a go at it, see if it's what you had in mind.
Tables in HTML and CSS are complicated, table syntax in MediaWiki is complicated, and {{ts}} is complicated. And you need to wrap your head around all three to be able to effectively work with tables here. Add the sometimes really quirky table formatting in the works we reproduce and I would be very much surprised if most people didn't struggle with this at least some of the time (I know I do).
For this particular table the key was to turn off borders for the table overall, and then to apply a single border to each of the cells (except the last one). {{ts}} is just a shortcut for inserting style="border-top: none; text-align: left;" and similar. Each of the obscure little keywords documented for the template equates to one CSS keyword with a specific value (text-align: left, for example, is al).
And since {{ts}} inserts a style attribute, you need to use it where such attributes are valid; most often on the main HTML <table> element (created by {|), a <tr> (table row, created with |-), or a <td> (table cell, created with various combinations of | and ||). Inside a table row each || separates the cells, but you can also put attributes directly on those cells with || attributes | cell content. Attributes can be a {{ts}} (it outputs a style="…", recall) or colspan="…" or rowspan="…".
In any case, the point is that you need to keep four different models of the concept of a "table", with attendant syntax and details, in your head at once so it's really more surprising anyone ever figures it out. --Xover (talk) 09:08, 28 February 2021 (UTC)

Joining lines[edit]

Is there a way to easily join lines together? I'm working on a few texts with hanging indents and if I don't join the lines together, then the indent breaks. Languageseeker (talk) 11:42, 28 February 2021 (UTC)

@Languageseeker: See Wikisource:Tools and scripts#PageCleanUp --Jan Kameníček (talk) 13:07, 28 February 2021 (UTC)
@Jan.Kamenicek: Awesome, thank you so much. It's even better that I hoped for. Languageseeker (talk) 14:37, 1 March 2021 (UTC)

Project to Match and Split OCR from Distributed Proofreaders[edit]

Distributed Proofreaders has several thousands projects to proofread and correct texts against a single-edition work. Most of their projects derive their scans from IA or Google. However, they archive they scans after posting to Google. The project is to download the proofread text, match them with their appropriate scan, and post it here. Inductiveload has created a awesome tool that makes it possible to preserve much of the formatting and easily get the text into a format ready for match-split.

Here are the requirements

  1. Install User:Inductiveload/dp_reformat.js

Project Instructions

  1. Pick a project from one that Inductiveload archived at [5].
  2. The "Concatenated Text File" in the zip file and the the Project Description is in the matching HTML.
  3. See if the Project Description gives the original location for the scans. If it doesn't, you'll need to manually match the work.
  4. Create an index file for the work and make sure to create page numbers.
  5. Go the the Sandbox. Select Edit and paste in the text from the text file inside of the zip that you downloaded from Distributed Proofreaders.
  6. Select Reformat DP text in the Tool section of the left side bar.
  7. You will need to paste in the Index name for the work and calculate the offset. Distributed Proofreaders deletes the first few pages, so you will need to do a bit of math to get the correct number.
  8. Either select Show Preview or Publish Changes. Verify the pages. Distributed Proofreaders sometimes remove blank pages from within the work and you will need to check a few pages to make sure that the offset is correct.
    1. If their are multiple offsets, you will need to do the merge and split in stages.
  9. A tab called "Split" will appear next to discussion. Select discussion. If you want to verify that your split started, visit [6]
  10. Once you have done this, record that you have imported the work at User:Languageseeker/PGDP.

Disclaimer, this is a personal project and has no official sanction. Just asking for help from the community. Languageseeker (talk) 23:25, 2 March 2021 (UTC)

Follow-up note, Inductiveload's script will work on non-English wikisources, but the markup will need to be manually updated.

Are the downloaded texts coming from Gutenberg? We have had poor success with their "proofread" texts. They often mix-and-match editions, have modernizations, or other problems that make them incompatible with Wikisource standards. --EncycloPetey (talk) 00:09, 3 March 2021 (UTC)
No, Distributed Proofreader takes a single source usually from the IA or Google Books, proofreads the books against the source images, and then processes them and posts them to PG. They are strict to match the source text to the image prior to posting to PG. Before they post to PG during post-processing, they sometimes correct errata and deviate from the source text. The files on Distributed Proofreaders match the source text. You can take a look at the couple of books that I matched-and-split from Distributed Proofreaders as an example. The major thing that is lost in the Distributed Proofreader sources is the header and footer, but this is easier to add-in that proofreading the text. Languageseeker (talk) 01:08, 3 March 2021 (UTC)
Just commenting on the concerns about PG. Many (most?) of their texts now come from DP, who expect them to be a single edition. Most of DP's final output will list the changes made, and many also state if further silent changes were made. PG has come a ways from the days of refusing to acknowledge which edition something was prepared from (as I understood they did). There is certainly a subset of projects from PG that are lower risk from a WS perspective. They'd have been processed at DP (identifiable in the PG credits line), have an easily identifiable scan set (sometimes referenced in the PG text, otherwise can be traced back via the DP project comments, or a bit of investigative work) and have a transcribers note that silent changes were not made. Nickw25 (talk) 02:41, 6 March 2021 (UTC)

Also posting here to link to my comments above, that DP do not want WS using their in-flight works for this purpose. See that post for more detail. Nickw25 (talk) 02:45, 6 March 2021 (UTC)