bot: "(align formatting)"[edit]

Mpaa, in using the bot are all of those edits showing (align formatting) mistakes I made in validating? If so then something is wrong because they looked correct to me and I double-check before saving. —Maury (talk) 19:52, 6 May 2015 (UTC)

Hi. No, no problems, I wanted to align the left side note. It is just that it easier to process all pages in the same way than selecting only some pages.— Mpaa (talk) 19:57, 6 May 2015 (UTC)

Thanks for responding and further info on bot request[edit]

What a small wiki world! You may not recall, but you assisted me with my first two texts last fall when I first joined the WS project. Thanks again for that. For this topic I will respond more formally on the bot page (where I hope you'll notice my astute use of the emdash  –  :), but I just wanted to say hi and fill you in on the background to this request.

First, the OCR button: with all due respect, user Beeswaxcandle responded to my problems with the OCR button by explaining that that it activates an OCR routine that rescans the image and does not re-emit the underlying text in the DJVU file. I note that this sometimes corresponds to my experience of the button, but for me the button's behavior and even whether it appears on my toolbar is unpredictable, even with the appropriate widgets preference set, so I am unable to use it productively.

Try to set it up. It is all you need.— Mpaa (talk) 11:24, 15 May 2015 (UTC)

Second, the reason for tagging the underlying text, then uploading: I have reluctantly reached the conclusion that there are several types of correction best done on the whole volume because of the context needed to be able to make good decisions about running headers, titles, cross-page hyphenation, front v. back matter, etc., so this is my experiment along those lines. Again, Beeswaxcandle and I discussed this on my talk page if you're curious about the history.

One thing are the corrections on the whole volume, another is to embed wiki-sytax in the djvu file. I would discourage that,as if someone needs the Djvu from other purposes, the text layer will be full of useless stuff.— Mpaa (talk) 11:24, 15 May 2015 (UTC)

LBNL, I chose the Southern Historical Society Papers project because it appeared to have stalled but has a user (Maury) who was interested in making further progress on the series and has several volumes I am interested in seeing completed. For the number of pages involved, a mass reload (assuming my script on the djvu text works sufficiently well) would be far more efficient and make it possible to finish the SHS project this year.

Probably more than you wanted to know, but I thought extra detail might be helpful since you had previously assisted me. Feel free to reply on the bot page or my talk page, and thanks again, Dictioneer (talk) 00:17, 15 May 2015 (UTC)

Thanks for your quick response, both here and on the bot page. Here is my experience with the OCR button: I bring up a page in edit mode, click on "Proofread tools", and there is no OCR box. I click on "Preferences" at the top, then on "Gadgets." On the Gadgets page, in the second section, "Editing tools for Page: namespace", I click the checkbox " OCR: Enable OCR button Button in Page: namespace." and click on "Save" at the bottom of the page. I go back and reclick through from the Index space to the specific page. The OCR button doesn't appear. I go back to preferences/gadgets and in the section "Development (in beta)" I click on "Add a toolbox link to reload the current page with Resource Loader in debug mode." Lather, rinse, repeat, back to the specific page in question. Still no OCR button. I click on the "Debug" button in the left pane under "Tools." The OCR button usually appears, though not always. Sometimes I have to click on the EDIT link again, in which case the OCR button always appears. If I click on the OCR button, it reloads the text, but the text doesn't contain my most recent changes.
I should note I have tried this with Firefox on Ubuntu, Mac, and Windows, with Chrome on Ubuntu and Mac, and with Internet Explorer on Windows. If you have a fix for this, or debugging advice, or can point me to a resource that will help me sort it out, that would be great. I've also copied in common.js and common.css text from user Beeswaxcandle so that my setup was as close to his/hers as possible. If you think it would help to copy in your script source, I'm happy to try that.
In my experience, the only way that reliably reloads my updated text and formatting from the djvu source file is going to the testpage Beeswaxcandle has deleted for me. There, the correct text shows up whether I have an OCR button or not, and the OCR button (if I press it) reloads the updated text correctly in that circumstance. I have not saved this page since it is the only one I have that reliably works. This is the reason for my request for a mass delete of non-proofread pages previously uploaded by LA2-bot.
Thanks for any help you can provide or for pointing me to an appropriate resource to get this debugged. At the moment, the only workaround that gives the desired result is the one proposed by Beeswaxcandle. I am open to alternatives, but would need a link to the relevant bot, upload script or other documentation that would get me started. Also, once the text has been populated, I would be fine to upload a version of the djvu file which has all wiki-formatting removed, it's just difficult at the moment to see how to get to that point. Dictioneer (talk) 17:28, 15 May 2015 (UTC)
Try to copy my User:Mpaa/vector.js an d keep ypur preferences as simple as possible (in editing select Show edit toolbar and Enable enhanced editing toolbar.— Mpaa (talk) 18:47, 15 May 2015 (UTC)
No luck, I'm afraid. Here's what I did: A) went into preferences and hit the reset button, then verified the Show Edit & Enable Enhanced settings. B) created a blank vector.js page and copy over your source. C) exited the browser, restarted and logged in. No OCR button. Realized that I'd reset the OCR gadgets preference to off in my general reset, went to gadgets and re-enabled OCR. Went back to edit a page. Still no OCR. D) Went back to Gadgets and re-enabled the Debug setting under "Development". I am now back to the original behavior, which is if I click Edit, then Debug, I usually get the OCR button to show up. E) However, when I click the OCR button it will reload the text, but does not reload the most current of the text. F) I also went back and copied in your vector.css, common.js, and common.css just for thoroughness sake. Same result.
I think it might be illuminating to separate this problem into its less important and more important parts: the unpredictable appearance and disappearance of the button(annoying but less important), and what actually happens on a page when you press OCR. Let's assume that the OCR button appears reliably for you. How does it behave when you use it on these text pages? Here is the experiment to try: A) go to Index:Southern_Historical_Society_Papers_volume_35.djvu and click on Index page 19/file page 33, aka to see what happens. I get a page that warns me this page has been deleted by Beeswax and I should think seriously before recreating it. B) In the text itself you should see a "noinclude" running header and a hwe tag for possessor: this reflects the current updated djvu file. C) Now cancel out and go to page 20/34 (i.e., the next page), which still exists. You will see it displayed without any running header tag. D) Press OCR, and the page is refreshed with a running header but without the noinclude tags. This text is from an old version of the file, not the most current one from commons. E) Therefore, P. 19 (which Beeswax deleted) is correctly updated, p. 20 (not deleted) is not.
Did you get the same result? There are other differences on the page I could detail, but I assume one missing change is enough for now. Thanks for taking the trouble to help me figure out what's going wrong. Dictioneer (talk) 14:45, 16 May 2015 (UTC)
This is the link that is called by OCR button [1], and then this is parsed. If you copied my settings, OCR should apper under "Proofread Tools".— Mpaa (talk) 19:43, 16 May 2015 (UTC)
Another issue is that OCR considers all the text as part of the body, so I guess it will not include Template:Rh... in the header.— Mpaa (talk) 19:56, 16 May 2015 (UTC)
Unfortunately, OCR still only occasionally appears but generally doesn't. The noinclude tag was included in a revision at Beeswax's suggestion, apparently he and Maury have a hot-key that activates and .js routine that will take the running-header and put it in the header box. In any case, other changes in the text underlying the djvu file also do not appear, not just the running-header noinclude tag.
I can provide other details of what's not appearing if that's your preference, but personally I would suggest that you proceed with the deletion algorithm you propose on the bot page and that we resume trying to chase down the source of this problem at some point in the future. The problem seems important to me since users who edit text and re-upload djvu files will have some of their changes appear and some not for no apparent reason. However, this is clearly a difficult and intermittent problem, so a brief break from chasing it may be good for all involved. Let me know if there's any help I can provide on the revised deletion/reloading based on LA2-bot being the most recent updater of the page. Dictioneer (talk) 20:59, 16 May 2015 (UTC)
I'll try to upload the latest text-layer, I need to customise a script first. But trust me, updating the text layer in the djvu file is a viable option if and only if a new OCR-process will be reapplied to the file. All the rest can be done working directly on text files, without bothering changing the djvu. And also dividing header, body and footer is not trivial. If you want to apply this approach in the future, try to stick to this file format to divide the different pages: Note that this is not done to handle the Proofread Page format, so if you want to define headers and footers, try to mark them in the text with a convention that makes it easy to recognise them (e.g. @@HEADER_START@@all the header text@HEADER_END@@ or similar). It will make the rest of the process easier. There is WIP on the bot side to handle this in an easy way in the future.— Mpaa (talk) 21:12, 16 May 2015 (UTC)
One more thing. The advantage is tha once you have the file done, you can apply all the text improvements you want with a text editor, working off-line and using search and replace patterns per file instead of per page. And upload only the final step of all your improvements.— Mpaa (talk) 21:16, 16 May 2015 (UTC)
BTW, your syntax for {{hwe}} is wrong. See Page:Southern_Historical_Society_Papers_volume_35.djvu/86. {{hwe|con|fidence.}} should be {{hwe|fidence.|confidence.}}
Good catch, I've updated my script accordingly. I may start a new topic with questions about the mediawiki link above, but you've given me a huge amount of help already, so I'll let you get back to your own texts for awhile before I bug you again. Thanks so much, and if there's anything I can do for you in return, just let me know. Dictioneer (talk) 22:06, 17 May 2015 (UTC)

Have used pywikibot to upload v. 36[edit]

Hi Mpaa, thanks for pointing me at the upload script and the pywikibot ecosystem. After a bit of fumbling (including about 30 reverts -- power tools can be dangerous!:) I've managed to upload Volume 36 of SHSP and I'm pretty happy with the results. Two questions: first, I've produced a "clean" version of the underlying djvu file (no wiki-tags except for cross-page hyphenation), do you think it's worthwhile to upload the file to commons, or do you think the x-page hwe/hws tags are too much formatting as well? If it's still too much formatting, I can write a script to strip them, but am unsure what to change them to: an unhyphenated word at the bottom of the first page, an unhyphenated word at the top of the next page, or go back to how it was. I am happy to follow whatever direction you give on this. Second, should I create a separate bot account for running this script and register it, or is that unnecessary? Thanks in advance for this advice and at the risk of repeating myself, I really appreciate the technical help you've provided. Dictioneer (talk) 13:19, 29 May 2015 (UTC)

I would not touch the djvu text layer. If you extract the xml structure, each words has coordinates, etc. so I do not know how much value it has just to add the text. Regarding the bot, you need to ask to the community to grant you a bot flag for that task and usually create an account for that, see Wikisource:Scriptorium#BOT_approval_requests. And there must be a policy somewhere, should be easy to find but right now I do not have time. Bye— Mpaa (talk) 15:17, 29 May 2015 (UTC)

Strange thing[edit]

Hi Mpaa,

I don't really know my way around Wikisource. I made a small correction at Page:The Spell of the Yukon and Other Verses.djvu/60, changing "one" to "none" (there was none could place the stranger's face), which should have increased the size by one byte, but for some reason it went down by 113 bytes or something like that. Obviously something going on I don't understand. Would appreciate it if you'd take a look. --Trovatore (talk) 06:08, 16 June 2015 (UTC)

Hi. I do not know, some internal MW magic ... I wouldn't bother, your change looks fine anyhow.— Mpaa (talk) 07:36, 16 June 2015 (UTC)

A wikisource database question please[edit]

Post was moved to User talk:Ineuw#A wikisource database question please unsigned comment by Ineuw (talk) .

Apologies for stealing your topic, Mpaa! Please follow on as I expect your input/experience is going to be essential! AuFCL (talk) 03:31, 5 July 2015 (UTC)


Attempted a vallidation of this, but I'd appreciate a second view as whilst I've been very cautious, I'd like to be sure I've caught eveyrthing. Going to give it a second pass in any event. ShakespeareFan00 (talk) 19:51, 27 July 2015 (UTC)

Help with data extraction[edit]

Hi. I am asking you for help to extract some text data from the Wikisource text databases because so far I wasn't successful in achieving this goal.

The data I need is from this pageto this page of the first 105 characters of every paragraph. This may or may not contain already contain a wiki link and an anchor, but the necessary part of the text is the word "Page nnn" followed by a the reference number. From these I convert and create links and generate the anchors in the text pages.

Reference anchor:
{{fs90/s}}{{anchor|463-1}}[[Page:The Conquest of Mexico Volume 1.djvu/53#53-1|Page 9 (<sup>1</sup>)]].—

Main text source:
{{anchor|53-1}}[[Page:The Conquest of Mexico Volume 1.djvu/463#463-1|<sup>1</sup>]]

Not to confuse you, both ends are, or will be anchored and linked for convenience. — Ineuw talk 19:25, 25 August 2015 (UTC)

Done: User:Ineuw/Sandbox. Hope I got you right.— Mpaa (talk) 21:44, 25 August 2015 (UTC)
Perfect, many many thanks.— Ineuw talk 22:36, 25 August 2015 (UTC)