Help talk:Match and split

From Wikisource
Jump to: navigation, search


The section "Notes about <pagequality>" might need a bit of discussion about what qualifies as proofread against our page scans. My view is that there difference between proofread against hard copy and scanned. I could see some texts from external sites as proofread, for example, but others may contain problems that are difficult to detect. I think we should get a second user to validate the file. Cygnis insignis (talk) 04:18, 10 January 2010 (UTC)

Yes, we do need some criteria. Which is one reason that I didn't tell people how to do it, just how to request it. Note, that is to the Proofread status, not to the Validated.
Currently we have our works proofread that we are putting against a text, similarly we also have GUTENBERG texts to put against scans.billinghurst sDrewth 05:48, 10 January 2010 (UTC)
I don't think that the complete removal of the note is relevant either. There are works here that have been proofread or validated, and are now being matched and split. I think that requesting that the work goes through two further proofreads is overkill, and basically just having works going to sit in abeyance. There needs to be allowance/flexibility made for works that have been proofread. — billinghurst sDrewth 14:02, 14 February 2010 (UTC)

Proposed criteria:

  • Requires no header, footer, or other content and formatting to be added or changed.
    • That it has been checked for introduced errors (during or after proofing).
  • That a work has been proofread (or validated)
    • at wikisource.
    • at Gutenberg or by Distributed Proofreader
    • That instance of formatting conventions, such as ITALIC, are consistent within the transcription. ie not just selective conversion of italic to ITALIC or <i>italic</i>.
    • That conversion from html and other coding will not introduce errors, and accord WS standards in page space transciptions.

These criteria address some of the problems I found when using it. Detecting the errors, when compared to an ocr text layer, proved difficult in many cases; it is sometimes quicker to use the regular method. I did a book with this, I gave up on our fragmented version and used the etext at PG. I expected there to be errors from my conversion and reproofing against the original, it is easy to get complacent, someone else found more when they were validating it. Getting one other person to check these M & S procedures would not take long, and it maintains the value of 'validated' that emerged from Page:space transcriptions. Meeting my own standards of thoroughness is not adequate, IMO, I want another to at least glance over it before it is validated. It does't bother me if I have indexes that remain unvalidated, proofread and my checking are satifactory and any errors will be corrected by some one, some time. No offence intended, to PG or anyone else, but our scan backed, verifiable, correctable, and carefully validated transcriptions leave the others in the shade. They do provide a flying start to hosting or improving a proofread text, via this method we are helping to test, and make the other value adding much easier. There is enough activity for errors to crop up, despite the level of proofreading that has gone before. Cygnis insignis (talk) 15:43, 14 February 2010 (UTC)

I am obviously tired, I cannot sort through the argument to what is in favour and not.
As a note, yep, those of the PG texts that I have used have been of good quality, and I basically work on pasting it into place, going through and doing global replacements to our style, getting it up to scratch before letting ThomasBot at it. Little issues like ° being written as degrees, £300 being 300 pounds. So now I paste a chapter at a time, fix it, and then match it, and split it. The only little quirk is for HWS/HWE splits. — billinghurst sDrewth 16:00, 14 February 2010 (UTC)


This is not working out! There are too many variables to be a practical solution. Given the accuracy of most ocr, with obvious errors, versus transcripts different editions, 'corrections', checking for those corrections, skipping the same mistakes that other proofreaders have overlooked, things like removed hyphens, anons changing spelling and who knows what else is very inefficient (overall). This disregards the value of ocr and scans and makes a mock proofread (or validated!) text. I'll be direct, I just corrected another result of this procedures and I reckon I could have done it quicker from the ocr. The errors would be obvious as the reader tries to read the text, a supposedly validated text that reinforces the valid criticism of our text integrity. The answer to this is not "well, they can fix it, they will get annoyed and go elsewhere. Cygnis insignis (talk) 08:53, 9 June 2010 (UTC)

  • This thing creates more problems than it solves. With rare exceptions, and without a great amount of time invested in checking the accuracy, the replacement of the ocr text-layer—usually better than 99% accurate—is a very bad idea and time-wasting venture. It is making reasonable PG texts into faux scan-based texts: the problems will be undetectable, or derive from differences in the edition, and the consequences are not addressed by the users who apply this. It seems like a good idea, especially if one had dealt with very bad text layers, but it becomes horribly inefficient if anything goes wrong. This page should make a very clear caution on the myriad of problems that emerge in the hundreds of edits it generates, the solution is nearly always to fix the ocr and ignore what ever existed before. cygnis insignis 09:12, 12 February 2011 (UTC)
I believe that our instructions for Match and Split should clearly state that it should only be undertaken where one can match edition and place of publishing. If they cannot be matched, then it should not be undertaken. For many PG texts that information is not available, so it should be sourced by whatever means. I think that a firm checklist should be completed before the match can progress. — billinghurst sDrewth 10:39, 12 February 2011 (UTC)


The page says "When Page: namespace pages are created the contents of these fields [in the Index page] is used to populate the header and footer fields". But when I've used Match and Split, it has created pages with empty headers and <references/>, which does not match the specified values in the Index page. If I create the pages manually, header and footer are correctly set. Is there some way to have Match and Split populate the fields as desired? --Jellby (talk) 18:37, 8 January 2015 (UTC)

Error: "Match tag not found"[edit]

I tried to match with this edit. But when I clicked the "match" link I got an error, "match tag not found." Any idea what's up? -Pete (talk) 20:44, 20 March 2018 (UTC)

OCR text is skewed. The text layer is line one column one, line one column two, line two column one, line two column two, etc. so it fails. — billinghurst sDrewth 05:18, 21 March 2018 (UTC)
Match and Split really only works on complete pages. Because you are trying to start partway through a page it fails because the beginning of the page doesn't match. This particular article is so short that a quick proofread with section markers is more efficient than using Match and Split. Beeswaxcandle (talk) 05:41, 21 March 2018 (UTC)
Due to the lack of identification of the columns, it may be more appropriate to just paste the available text from the main ns page into the page ns page and then retransclude. — billinghurst sDrewth 09:47, 21 March 2018 (UTC)
Got it, thanks. I hadn't looked closely enough to find the columns problem. Beeswaxcandle, I've found in the past sometimes the M&S process is able to figure out when something starts mid-page...but I'm sure that ability has its limits. Thank you both for your insights -- mostly just curious to understand the technology and its limitations. Happy to do this one manually. -Pete (talk) 21:54, 21 March 2018 (UTC)