User talk:Harris7

From Wikisource
Jump to navigation Jump to search

What to do when a current Wikisource work has no source[edit]

{{helpme}}

Hi, I just started on Wikisource recently, and validated/corrected several pages of An Indiscretion in the Life of an Heiress by editting pages in the source text, for example: Page:Littell's Living Age - Volume 139.pdf/90

Now I wanted to validate/correct several errors I noticed in Ardessa, but there is no source.

I assume that is because there is no <pages> element in the article?

I have searched Wikimedia Commons for "The Century Magazine", 1918, but didn't find it.

Can I just edit the pages of Ardessa directly, or do I need to find a source PDF/DJVU to upload first?

Thanks! Harris7 (talk) 20:39, 19 August 2022 (UTC)[reply]

Hyphenated words across pages[edit]

When a hyphenated word is split across two pages, simply leave it "as is". The software automatically joins the word when the Pages are transcluded. --EncycloPetey (talk) 20:35, 19 July 2023 (UTC)[reply]

  • User:ShakespeareFan00 determined how to solve the issue. Apparently a carriage return before the end of section caused the hyphenation to fail to collapse. Once that carriage return was removed, everything worked as it should. --EncycloPetey (talk) 21:38, 21 July 2023 (UTC)[reply]
    EncycloPetey: Thanks to you & ShakespeareFan00 for resolving this! I found another one at the end of this page: Middlemarch (1874)/Chapter 14. I tried the same fix, which fixed the page-end-hyphen problem, but broke the italics: it changed the last italic section to bold instead of italics (i.e. interpreted three single quotes as bold instead of single quote-italic text...) Help? Harris7 (talk) 20:24, 22 July 2023 (UTC)[reply]
    There are a couple of issues there. (1) There is italicized text enclosed in single quotes, and with three quote marks in a row like that, the software interprets it as bold. The fix to that issue is to use {{'}}, which breaks up the group. (2) There were some carriage returns at the end of the first page. Sometime the system automatically inserts a carriage return when you start editing a page, so if you spot odd behavior, going back to remove the inserted carriage return can fix the problem. It's all working now. --EncycloPetey (talk) 20:32, 22 July 2023 (UTC)[reply]
    ON the issue of the hypen-section space glitch, How quickly could a rule be found to run AWB over Page: with the issue? ShakespeareFan00 (talk) 21:33, 22 July 2023 (UTC)[reply]

Thank you[edit]

For the clearly very meticulous revision of pages in Solo. Many of your corrections are now being used in the OCR correction software I use to help proofread works. Would you be interested in just validating the pages, so that they can rise up in status? PseudoSkull (talk) 21:28, 1 November 2023 (UTC)[reply]

Hi PseudoSkull, thanks for the kind words! Sure, will do. I'm still a newbie, and clumsy/unsure with the Wikisource process; I'm hesitant to mark things "validated" when I feel I may have missed something while reading/proofreading.
I see from your talk page that you do software - are you a fulltime developer too? I retired from SW dev a couple years ago, and still really enjoy working on code - mainly C/C++/C#/Javascript. Regarding the OCR correction SW you mentioned - I assume this is your own app? Is it a Windows app, or ... ?
Also - speaking of OCR, I encountered and corrected hundreds of errors in my proofreading of Middlemarch (1874) this summer; I posted some followup questions/notes to Stamlou, whom I thought was the person that did the OCR, but that was probably an incorrect assumption. Could you take a quick look at my questions here and post a reply here on my Talk page? Any guidance would be greatly appreciated! Harris7 (talk) 12:43, 2 November 2023 (UTC)[reply]
Hey, thanks for the long-winded response.
I am indeed a software developer by job and by hobby. Though, I wouldn't call myself full-time, more like a freelancer. I essentially do business for myself. My primary area of interest in software are things like automated data entry, automated testing (especially with the Selenium package), APIs, data scraping, and data wrangling. I think my coding skills and my work on Wikisource are very well-aligned as well, which is another advantage. My time at Wikisource has actually inspired most of the learning I've done in terms of coding as well.
Since May of this year, I've been building an application (what is now more like a "power-user" app), called QuickTranscribe. The application was used to proofread Solo, the entire work that you're validating right now. Basically, the process of proofreading, processing, and submitting a work to Wikisource is an exceedingly time-consuming task, and honestly needlessly so. This combined with the fact that there's an endless sea of works that in theory need to be completed for this to become even close to a complete site means that the very long amount of time it takes to even get one work done is a huge problem.
So, my QuickTranscribe system aims to cut out as much of the tedious and repetitive work as possible, making the whole process dynamic, leaving the proofreader to actually proofread for 95% of their time on each project. I would even go as far as to say that QT has split the total work time on a transcription in half, at least. I went from getting one novel done every 3 days, to getting two novels done every day, which is something I honestly never thought possible until now.
The OCR correction software is part of the QT application I've been developing. It uses almost 2,000 lines of code just to correct the OCR that's there now, and that particular script is something I've been adding to for several years. Unfortunately, it can't detect everything. So, to answer your first question on Stamlou's talk page, the fact that there are hundreds of minor errors to be found in a single proofread work is a bit much, but not by that far. The human eye and purely logically-oriented software can only detect so much, out of the ~ 100,000 words in any given novel. If you're only finding one or two errors out of every five pages, that's fairly normal here.
The thing about OCR is that it's extremely unreliable for correctness, and it's really annoying to work with. I would never use it in work outside of Wikisource unless I absolutely had to. No matter what technology you're using, whether it's Tesseract, Google OCR, or AbbyReader, it's all going to produce some very ugly results in specific spots. We only use it here because it's faster than typing the pages out manually...
So, there being a few errors in the work doesn't mark bad proofreading. On the other hand, if all the pages are littered with errors and marked proofread (and you'd know what that looks like if you saw it, because of how bad OCR is as previously mentioned), then that would be a problem. This is a user-generated project that is always by its nature a work-in-progress, so any transcription project sitting around always has room to be improved for accuracy and presentation, even if it's been fully validated. PseudoSkull (talk) 19:51, 2 November 2023 (UTC)[reply]
PseudoSkull: Hi again, I saw your 'thank you' for my "sic" marking of "Ada" (instead of "Aïda"), but after encountering several more instances of "Ada", I got the impression that the author did this intentionally, and that it is not a typo. It appears that he has the dog's owner using "Ada" in dialog, but the author still refers to it as "Aïda" in non-dialog text. So I removed the "sics": here and here.  :-)