Index talk:Black's Law Dictionary (Second Edition).djvu

From Wikisource
Jump to navigation Jump to search

On archive.org there is a much sharper version of this book: (external scan). I was going to suggest changing the source file, but the downloaded djvu file is much poorer quality than the images displayed by archive.org. So, anyone proofing this work might save their eyesight by going to this url, changing "n510" to the number of the page they're looking at + 10, enlarging the image and using that to proofread from, or at least to check the illegible bits. Mudbringer (talk) 06:58, 29 November 2015 (UTC)[reply]

Also the Googlebooks' scan is much better, the umpteenth edition (1995), reprinted (2008), and with a huge number of typesetting glitches resolved. The scan here is hopeless, having at best sucked up the mistypes ('repiiedly') or else giving up and dropping words. I would not attempt any work here using this scan, after looking at two pages that were blessed as proofed and validated. Shenme (talk) 03:26, 19 May 2020 (UTC)[reply]

Which scanned pages were poor? ShakespeareFan00 (talk) 21:38, 19 May 2020 (UTC)[reply]
The original JPG's at https://web.archive.org/web/20151017030506/http://blacks.worldfreemansociety.org/2 look resaonable. @Xover: Do you have the skills to do a rebuild? 21:38, 19 May 2020 (UTC)[reply]
@ShakespeareFan00: If there's a set of scan images available anywhere I can probably figure out a way to grab them and build a new DjVu. But two caveats: 1) a quick peek did not make obvious what the problem with the current DjVu is, so knowing what to fix is a bit challenging; and, 2) all 1k+ Page: pages have been created, so if we use a different scan with differing pagination we'll have a monster job adjusting and moving Page: pages to match. And given #2, I'm somewhat reluctant to tackle #1 blind. But if someone points me at a bunch of JPEGs I can certainly make sure a DjVu comes out the other end.
@Mudbringer, @Shenme: Can you shed any light on the above? Note that the post-1910 editions are highly unlikely to be usable here as they contain changes and additions that are still protected by copyright. Only pre-1925 editions will have expired, and it looks like that's only the 1st edition (I may be wrong on the 2nd and 3rd edition, I didn't see exact dates on those). --Xover (talk) 05:54, 20 May 2020 (UTC)[reply]

┌──────┘
I have only looked at two pages, p. 346 and p. 1236. I was attracted to these as marked proofed/validated and thus I could see both how the pages should look and how to get there.

That there were so many errors remaining in each of these was very concerning. After doing both I felt it couldn't be due *merely* to lacks on the part of the reviewers. The incredible number of tweak edits in the history trying to clean up the scannos rather confirmed the dreadfulness of the scan. Comparing the scan here with the Googlebooks' one showed a much clearer scan, better typesetting, and also that the later editions/reprints had many corrections.

The scans here have so many problems that trying to make it right is too dire a requirement. Even after all the automated edits the "case load" is too great for mere mortals and two passes.

p. 346 [1]: this page forced me to look for other scans. Note the first error. I'd swear the scan here has 'hoards', which makes no sense. Other (later ed.?) scans have 'boards' - they've been corrected. Again, in extremis, words dropped. A first instance of 'i'/'l' being so very hard to distinguish. The typesetting - the type! - is dreadful. The number of '.'->',' should show that.

p. 1236 [2]: After two passes,: menu/mean , live/five , 28-14/2845 , 2&7/2847 , Iupon/Ripon, necesslty/necessity , one-founh/one-fourth , astray/estray , it/if , 51/54 , ram;/rank ; the need for corrections is so great, the art required so much, that errors like dropped words are *introduced*: 'making' , "of the roll" ; with so much to do the usual problems are missed also, and certainly one is too exhausted to look up 'estray'.

It's not just the usual 'n'/'u' and 'a'/'s' problems, but 'b'/'h', and others. Harder to see is that lowercase 'l' and 'i' are near indistinguishable, and often were. I had to look up "paroi evidence" to discover it is actually "parol evidence". Under "YA ET NAY", is that 'deniai' bad typesetting or bad type? Under "YARD", 'inciosed'. Below, 'iand'. Elsewhere 'iuuar', 'Engilsh', 'ilcense', 'Coweli', 'hoids', and many many more.

Under "YEA AND NAY", third line, what character _is_ that? And there's another just like that on the same page - "YEAR", fifth line. Another really looks like 'psyment'. It's not just an 's', they don't look near as bad elsewhere. The type is so bad. The ink too?

(After looking closer, I believe I can see a mix of type from different sources, 3 different 'i's for one, and either they just didn't have enough lowercase 'l' or the devil couldn't sort.)

I do hear y'all saying this might be the only earliest-enough copy found. But I'm looking at it and saying it might be is too much to ask to use as a source scan. Garbage in, smells out. Shenme (talk) 18:45, 20 May 2020 (UTC)[reply]

@Shenme: Thanks for the detailed explanation. I see the problem now: the scan images have been recompressed with a lossy compression algorithm with waay too aggressive settings. Compare the "YEA AND NAY" passage in our DjVu here with the source scan here.
I can almost certainly achieve better results than that (some loss is inevitable, but this is an extreme example); but there are a couple of problems. One is that the uploader of our current DjVu inserted a missing page somewhere among the 1322 pages and didn't document where. If I generate a new DjVu without that page, we'll get a lot of pages that are offset (in addition to missing a page). The second is that since all the Page: pages here have been created, you will not automatically get any OCR improvements. You'd need to manually use the Google OCR button on each page, which would overwrite any formatting and such that had already been done (and no OCR is perfect: it may still be poor enough to be a chore to proofread). --Xover (talk) 20:23, 20 May 2020 (UTC)[reply]
I am reassured that this was computer-aided obfuscation, and that would explain the randomness of e.g. 'l' vs. 'i'. The difference in scans is incredible, and the totally strange disappears once the lens are corrected.
The project is in a state replacement might not be as bad as you suppose. Only *9* pages have had anything but AWB-type generalized fixes applied. And looking at the results of those attempts on those 1300 pages will have you looking for the recycling bin.
Could I try the Google OCR on selected pages, e.g. [[Page:Black's_Law_Dictionary_(Second_Edition).djvu/243|p. 235], "COMPUTATION" "COMPULSION", to get a comparison, providing diffs y'all could look over?
Would a binary-search for the inserted page find it? I could do that... Shenme (talk) 01:28, 21 May 2020 (UTC)[reply]
I'm not seeing one *extra* page here. The pages I sampled check out against the web.archive.org capture. It has the same pages i thru vii and 1 thru 1322 for a total 1328. Note that it does not have a page for either p. iv or viii. Could that/those be the *two* inserted pages here? Shenme (talk) 05:17, 21 May 2020 (UTC)[reply]
@Hrishikes: Was the missing page one you had to insert from a second source, or one you had forgotten to add when initially creating the DJVU?
@Shenme, @Xover: — I have updated the file. Please check and give opinion. Sorry for the delay, but I did not get the alert for this ping. Hrishikes (talk) 06:55, 13 December 2020 (UTC)[reply]
The scan certainly looks better; to me, at least. Would it be desirable to delete all of the pages currently “not proofread” and replace them with the new text layer, or just run a program to replace them all? (By the way, you didn’t get alerted because the comment wasn’t signed.) TE(æ)A,ea. (talk) 13:43, 13 December 2020 (UTC).[reply]
@TE(æ)A,ea.: -- Much work has been done by other editors on these pages. Deleting the pages would cause all that history to be lost. Fresh ocr can easily be obtained by clicking the old Tesseract ocr button (the default one, without icon). This ocr can recognise columns. See my ocr at Page:Black's Law Dictionary (Second Edition).djvu/208 -- Hrishikes (talk) 14:08, 13 December 2020 (UTC)[reply]