Help talk:Match and split

From Wikisource
Jump to navigation Jump to search

pagequality[edit]

The section "Notes about <pagequality>" might need a bit of discussion about what qualifies as proofread against our page scans. My view is that there difference between proofread against hard copy and scanned. I could see some texts from external sites as proofread, for example, but others may contain problems that are difficult to detect. I think we should get a second user to validate the file. Cygnis insignis (talk) 04:18, 10 January 2010 (UTC)[reply]

Yes, we do need some criteria. Which is one reason that I didn't tell people how to do it, just how to request it. Note, that is to the Proofread status, not to the Validated.
Currently we have our works proofread that we are putting against a text, similarly we also have GUTENBERG texts to put against scans.billinghurst sDrewth 05:48, 10 January 2010 (UTC)[reply]
I don't think that the complete removal of the note is relevant either. There are works here that have been proofread or validated, and are now being matched and split. I think that requesting that the work goes through two further proofreads is overkill, and basically just having works going to sit in abeyance. There needs to be allowance/flexibility made for works that have been proofread. — billinghurst sDrewth 14:02, 14 February 2010 (UTC)[reply]

Proposed criteria:

  • Requires no header, footer, or other content and formatting to be added or changed.
    • That it has been checked for introduced errors (during or after proofing).
  • That a work has been proofread (or validated)
    • at wikisource.
    • at Gutenberg or by Distributed Proofreader
    • That instance of formatting conventions, such as ITALIC, are consistent within the transcription. ie not just selective conversion of italic to ITALIC or <i>italic</i>.
    • That conversion from html and other coding will not introduce errors, and accord WS standards in page space transciptions.

These criteria address some of the problems I found when using it. Detecting the errors, when compared to an ocr text layer, proved difficult in many cases; it is sometimes quicker to use the regular method. I did a book with this, I gave up on our fragmented version and used the etext at PG. I expected there to be errors from my conversion and reproofing against the original, it is easy to get complacent, someone else found more when they were validating it. Getting one other person to check these M & S procedures would not take long, and it maintains the value of 'validated' that emerged from Page:space transcriptions. Meeting my own standards of thoroughness is not adequate, IMO, I want another to at least glance over it before it is validated. It does't bother me if I have indexes that remain unvalidated, proofread and my checking are satifactory and any errors will be corrected by some one, some time. No offence intended, to PG or anyone else, but our scan backed, verifiable, correctable, and carefully validated transcriptions leave the others in the shade. They do provide a flying start to hosting or improving a proofread text, via this method we are helping to test, and make the other value adding much easier. There is enough activity for errors to crop up, despite the level of proofreading that has gone before. Cygnis insignis (talk) 15:43, 14 February 2010 (UTC)[reply]

I am obviously tired, I cannot sort through the argument to what is in favour and not.
As a note, yep, those of the PG texts that I have used have been of good quality, and I basically work on pasting it into place, going through and doing global replacements to our style, getting it up to scratch before letting ThomasBot at it. Little issues like ° being written as degrees, £300 being 300 pounds. So now I paste a chapter at a time, fix it, and then match it, and split it. The only little quirk is for HWS/HWE splits. — billinghurst sDrewth 16:00, 14 February 2010 (UTC)[reply]

Deprecate[edit]

This is not working out! There are too many variables to be a practical solution. Given the accuracy of most ocr, with obvious errors, versus transcripts different editions, 'corrections', checking for those corrections, skipping the same mistakes that other proofreaders have overlooked, things like removed hyphens, anons changing spelling and who knows what else is very inefficient (overall). This disregards the value of ocr and scans and makes a mock proofread (or validated!) text. I'll be direct, I just corrected another result of this procedures and I reckon I could have done it quicker from the ocr. The errors would be obvious as the reader tries to read the text, a supposedly validated text that reinforces the valid criticism of our text integrity. The answer to this is not "well, they can fix it, they will get annoyed and go elsewhere. Cygnis insignis (talk) 08:53, 9 June 2010 (UTC)[reply]

  • This thing creates more problems than it solves. With rare exceptions, and without a great amount of time invested in checking the accuracy, the replacement of the ocr text-layer—usually better than 99% accurate—is a very bad idea and time-wasting venture. It is making reasonable PG texts into faux scan-based texts: the problems will be undetectable, or derive from differences in the edition, and the consequences are not addressed by the users who apply this. It seems like a good idea, especially if one had dealt with very bad text layers, but it becomes horribly inefficient if anything goes wrong. This page should make a very clear caution on the myriad of problems that emerge in the hundreds of edits it generates, the solution is nearly always to fix the ocr and ignore what ever existed before. cygnis insignis 09:12, 12 February 2011 (UTC)[reply]
I believe that our instructions for Match and Split should clearly state that it should only be undertaken where one can match edition and place of publishing. If they cannot be matched, then it should not be undertaken. For many PG texts that information is not available, so it should be sourced by whatever means. I think that a firm checklist should be completed before the match can progress. — billinghurst sDrewth 10:39, 12 February 2011 (UTC)[reply]

Headers/footers[edit]

The page says "When Page: namespace pages are created the contents of these fields [in the Index page] is used to populate the header and footer fields". But when I've used Match and Split, it has created pages with empty headers and <references/>, which does not match the specified values in the Index page. If I create the pages manually, header and footer are correctly set. Is there some way to have Match and Split populate the fields as desired? --Jellby (talk) 18:37, 8 January 2015 (UTC)[reply]

Error: "Match tag not found"[edit]

I tried to match with this edit. But when I clicked the "match" link I got an error, "match tag not found." Any idea what's up? -Pete (talk) 20:44, 20 March 2018 (UTC)[reply]

OCR text is skewed. The text layer is line one column one, line one column two, line two column one, line two column two, etc. so it fails. — billinghurst sDrewth 05:18, 21 March 2018 (UTC)[reply]
Match and Split really only works on complete pages. Because you are trying to start partway through a page it fails because the beginning of the page doesn't match. This particular article is so short that a quick proofread with section markers is more efficient than using Match and Split. Beeswaxcandle (talk) 05:41, 21 March 2018 (UTC)[reply]
Due to the lack of identification of the columns, it may be more appropriate to just paste the available text from the main ns page into the page ns page and then retransclude. — billinghurst sDrewth 09:47, 21 March 2018 (UTC)[reply]
Got it, thanks. I hadn't looked closely enough to find the columns problem. Beeswaxcandle, I've found in the past sometimes the M&S process is able to figure out when something starts mid-page...but I'm sure that ability has its limits. Thank you both for your insights -- mostly just curious to understand the technology and its limitations. Happy to do this one manually. -Pete (talk) 21:54, 21 March 2018 (UTC)[reply]

split tag not displayed[edit]

Anyone knows the reason for the split tab not appearing on [1]? Currently I've no access to edit interface messages on pt.wikisource. Lugusto 02:34, 30 October 2018 (UTC)[reply]

Extending Match to pdf files[edit]

@Phe, @Ruthven: I downloaded and edited Python align.py code for Match, to run it locally and to try to understand it; I found that text extracting routine can be adapted to extract text layer from pdf files too using pdftotext:

djvu routine
    ls = subprocess.Popen([ 'djvutxt', filename, '--detail=page'], stdout=subprocess.PIPE, close_fds = True)
    text = ls.stdout.read()
    ls.wait()
    for t in re.finditer(u'\((page -?\d+ -?\d+ -?\d+ -?\d+[ \n]+"(.*)"[ ]*|)\)\n', text):
        t = unicode(t.group(1), 'utf-8', 'replace')
        t = re.sub(u'^page \d+ \d+ \d+ \d+[ \n]+"', u'', t)
        t = re.sub(u'"[ ]*$', u'', t)
        t = unquote_text_from_djvu(t)
        data.append(t)
pdf routine
    ls = subprocess.Popen([ 'pdftotext', '-enc UTF-8', filename ], stdout=subprocess.PIPE, close_fds = True)
    data=text.split(u"\x0c")

I'm testing a similar code (using os.system, I can't use subprocess so far) on Internet Archive pdf files, and it seems to run well. --Alex brollo (talk) 17:41, 15 January 2019 (UTC)[reply]

Just to remember a terrible case of failing match (both for djvu and pdf), trying to match File:Salgari - La Città dell'Oro.djvu: in that djvu file, there are lots of full-page illustrations, where caption quote a brief piece of the text. Match logics searches for that text into target, and if it finds it into a page following the illustrations, it matches it into a completely wrong contest.... next page can't match. A painful but running fix: to add manually into the target text the caption in the right place; then match runs happily. --Alex brollo (talk) 23:14, 15 January 2019 (UTC)[reply]

Match is off by two pages[edit]

@Phe: and anybody who might know: I ran a match here but the result, seemingly regardless of where I started it in the text, was off by two pages. The first page matched would contain three pages' worth of text, and then all following pages would be offset accordingly. I fixed this one manually, but later in the overall work I have the same problem; and I'm curious to understand why this might be happening. (I found I can fix it using find-and-replace pretty easily if needed.) -Pete (talk) 16:10, 8 October 2019 (UTC)[reply]

Bot is down, Jan. 15[edit]

Bot is down as of Jan. 15, 2020 (as reported by Beleg Tal on the Scriptorium). -Pete (talk) 21:46, 15 January 2020 (UTC)[reply]

Bot is down, Oct. 23, 2020[edit]

Note that bot is down again https://phetools.toolforge.org/match_and_split.php -Pete (talk) 00:56, 24 October 2020 (UTC)[reply]

Bot only supports Djvu.[edit]

I attempted to use this on a pdf based scan and it said it could not find the relevant djvu. I can upload the Djvu for the relevant edition, but we already had an Index for the PDF. ShakespeareFan00 (talk) 08:15, 3 October 2022 (UTC)[reply]

Given that the lede states that the target Page: needs to have an existing djvu text layer, and that the list of criteria asks "Is the Index file of type DjVu", and that the first of the "When not to use" states "If any of the above criteria are not met", then it shouldn't be a surprise to you that it didn't work on a pdf. It was never intended to. Beeswaxcandle (talk) 17:15, 3 October 2022 (UTC)[reply]
The next question would be is there anyone interested in writing a PDF version of the tool? A number of IA PDF's contain a text layer, because I've had it show up when viewing PDF's in the Proofread Page interface. Thus it is at least feasible to consider if there's a way of doing what match and split does with a PDF?

ShakespeareFan00 (talk) 09:24, 4 October 2022 (UTC)[reply]

A few random thoughts about this:
  • Yes, it would be nice to have the capability for PDF-based texts.
  • My experience with command line tools suggests that DjVu files are a lot easier to interact with on the command line than PDFs, so I wonder if it might be prohobitively difficult to code? (This is mere speculation, I have little experience with coding.)
  • I also wonder (@Phe:) whether the existing code is open source and/or available? I'd imagine that would be super useful for anybody trying to recreate it for PDFs, and also, I've wondered about the possibility of hosting it elsewhere. I notice that you have not been active in Wikimedia projects recently, Phe (I hope you will be back at some point!) and I am slightly concerned about what will happen to this incredibly valuable code if you should ever lose interest or ability to support it or host it on toolserver. Maybe this is not a problem, but I'd appreciate having a bit more documentation about how the hosting works and what it would look like if a transition to another server is ever needed. -Pete (talk) 20:42, 4 October 2022 (UTC)[reply]

Queue seems stuck[edit]

The match queue seems to be stuck trying to perform a match on a page (California Historical Society Quarterly/Volume 22) that was subject to an edit conflict between myself and ShakespeareFan00. It is not performing matches requested after that.

Is there a way to cancel a job out of the match queue? -Pete (talk) 03:59, 19 October 2022 (UTC)[reply]