Wikisource talk:WikiProject 1911 Encyclopædia Britannica

From Wikisource
Jump to: navigation, search


Index:EB1911 - Volume 08.djvu pages needing transcluding[edit]

I am doing checks of the proofread works, and EB1911 vol. 8 is one of those on-site. Looking at https://tools.wmflabs.org/checker/?db=enwikisource_p&title=Index:EB1911_-_Volume_08.djvu shows multiple pages that have not been transcluded, and I was wondering whether is interested in undertaking the task. — billinghurst sDrewth 11:56, 3 May 2016 (UTC)

Hi billingshurst, I agree that transcluding pages is good — but would we be better off by focussing on articles that have not been created, or pages that have not been proofed yet? Personally, I think it's better to spend time creating an article that didn't exist before than to convert an existing article (which may be 99–100 % accurate) to a transcluded version.
All articles from "A" to "Céspedes y Meneses, Gonzalo de" have been created so far (at time of writing) see Vol 5:10 for the start of articles that need to be created. — DivermanAU (talk) 07:29, 6 May 2016 (UTC)

Automated conversion script - Gutenberg to Wikisource format[edit]

To anyone who has used or would like to use the already-proofed text version of the EB1911 at Gutenberg.org (http://www.gutenberg.org/ebooks/search/?sort_order=title&go=Go&query=britannica) which has nearly all the articles from "Andros" to "Mecklenberg", I've been working on an automated script to convert HTML to Wikisource format e.g. <i>italic text</i> to ''italic text'' as well as the more complex "style=" statements, author initials etc. are handled. Features:

  • HTML markup to Wikisource markup
  • Table style conversions
  • addition of extra italics marker if the line has an uneven number of italics markers (otherwise italics don't render properly)
  • All author initials (for vols. 6 & 7) are converted e.g. <div class="author">(A. J. E.)</div> converts to {{EB1911 footer initials|Arthur John Evans|A. J. E.}}
  • article links added
  • smallcaps for A.D. and B.C.
  • subscript and superscript conversion
  • "sidenote" to "EB1911 Shoulder Heading" conversion
  • Section tags automatically added (watch for cases where an existing section tag like "s1" is in use)
  • use of <div align=center> ... </div> for centered text - this allows an equals sign in text where a Template:center won't render
  • convert <div class="condensed"> to "EB1911 fine print"
  • Greek text has the "{{Greek|" template added (where Gutenberg has "span class="grk" title") - but single Greek characters still need the template added manually.
  • Hyphens "-" are converted to ndashes "–" where there are four numbers before the the hyphen, so 1850-1851 will be converted to 1850–1851 (I added code to handle all occurrences on a line on 3 March 2017)

Known limitations:

  • Footnotes are not handled, these have to be done manually.
  • Cases where a use of a template like "smallcaps" etc. spans multiple lines will have to corrected manually (you can end up with a </span> on the line following)
  • Diagrams have to be added manually (but the caption formatting is handled)

It's not 100% perfect, but I've been using it recently and it's been a great help. I have to add italics manually to variables to most maths articles, as the Gutenberg version doesn't usually include these. As always, check with "Show changes" to see what has been changed before saving. Also be aware that although the Gutenberg version is usually very accurate but it has a few errors too (apart as the lack of italics for math variables and hyphens, not ndashes for year ranges).

The source code (written in AutoIt see https://www.autoitscript.com/site/) and executable (Guten2WikiV1.38.exe) as well as some converted text slices are here: https://www.dropbox.com/sh/dssahqtjtqleml9/AACOO55819IOefkYIXafYyZva?dl=0 . Run it from a Windows command-line with the source text as a parameter and it will convert the text into another file with "Wiki-" as a prefix. Or launch the executable and be prompted for the source text name. Use "View page source" when viewing the Gutenberg version and save that as the source text. Feel free to use and I'll try and answer questions you may have. Some likely interested users: @Billinghurst: @PBS: @Bellerophon5685: @DavidBrooks: @Library Guy: @Suslindisambiguator: @Slowking4: @DutchTreat: DivermanAU (talk) 08:46, 14 February 2017 (UTC)

This sounds great. It looks like a fun project. I don't have time to look at it this afternoon, but two questions occur. Links to other articles can be "See FOOBAR", in small caps, or "FooBar (q.v.)" using regular mixed case, which is enforced using the "nosc" parameter. Can those be distinguished?
Also, whenever I've seen them, an "uneven number of italics markers" usually means the italics are opened on one line and closed on the next. When editing in Page space manually, I usually just replace the newline by a space and let auto-wordwrap do the window fitting. I prefer that to closing the italics at the end of one line and reopening them at the start of the next (if nothing else, it makes the space that's inserted by the paginator into a non-italic space). Or are you referring to another phenomenon? DavidBrooks (talk) 21:52, 15 February 2017 (UTC)
(ETA) Sorry, I remembered one more thing. In Page space, I think we're using proper curly apostrophes and quote marks (’ and “”); at least, I have. The raw scans are inconsistent on rendering quotes as straight (ASCII) quotes sometimes, and two tick marks at others. Also, EB1911 has a wide gap after opening and before closing quotes, which I always close up. How does Gutenberg work with those? DavidBrooks (talk) 21:58, 15 February 2017 (UTC)
Hi -good questions! :) Luckily, the Gutenberg version is quite consistent and for article links, I follow these rules: if the Gutenberg source has <span class="sc"><a href="#artlinks"> I replace that string with "{{1911link|" (and substitute the terminator) which produces a link in smallcaps; if the source has <a href="#artlinks">, I use the "{{11link|" template which produces a link without small-caps. {edit - It does depend on whether the Gutenberg version has the tags for article links — on some later pages, there is a (q.v.) but no link tag.}
The italics-handling code adds an italics tag to the end of the line if there is a non-matching number of italics start and italics end tags on the line. If there isn't a match, I add an italics end tag to the end of the line and add an italics start tag to the start of the next line. This was the easiest way to automate the fixing up of italics, and it leaves the line-break intact. I had previously used macros in Notepad++ to partly automate the conversion but italics would often get out-of-sync on a page and I had to manually fix those (like you describe above).
The Gutenberg version uses curly quotes and apostrophes (without spaces between the word and the quote mark) as html characters (e.g. & ldquo; & rdquo; &rsqou; etc. I just convert these to the literal equivalents “ ” ’ . Thanks for your interest. DivermanAU (talk) 06:59, 16 February 2017 (UTC)
that’s excellent - i note you have gotten to volume 6, this should speed the work to volume 17.
deleting the soft carriage returns is an improvement gutenburg does not do; curly quotes and apostrophes are are a problem, we have find and replace in VE to convert to straight.
i’m creating articles now, but will try to pitch in in a few months. Slowking4SvG's revenge 16:10, 1 March 2017 (UTC)
I noticed a while back that there’s a gap in the Gutenberg version for articles in-between “Conduction, Electric” and “Constantine” (vol 6. pages 890 onwards), so I'm working on those currently to try and finish vol. 6. (http://www.theodora.com/encyclopedia/ is useful in these cases — though it’s not as accurate as Gutenberg).
Regarding quotes, the consensus — see "Typographical changes in Page space" section above with comments by @PBS:, @DavidBrooks: and myself — seemed to be to maintain the curly quotes and apostrophes. The OCR scan usually has curly quotes & apostrophes, Gutenberg has them and they are more faithful to the original printed book.
The soft carriage returns can be eliminated by the “clean up” script, Billinghurst added that functionality to my JavaScript page User:DivermanAU/common.js. But do some users prefer to maintain the soft line breaks as it makes it easier to proof? (The line breaks still match the print book when editing). Currently the script reads one line at a time and writes out each line terminated with <carriage return><line-feed> characters. This could be changed to replace the terminating <carriage return><line-feed> with a <space> character, but you'd need to check it didn't have adverse effects (e.g. in tables etc.). DivermanAU (talk) 23:36, 1 March 2017 (UTC) (update — I think it's best to maintain the soft line breaks, that way it makes it easier to compare changes that have been made by using the "Show changes" button. If needed, the linebreaks can be removed by the TemplateScript "clean up", or with a replace (\r\n with <space>) in NotePad++ if just one section needs the linebreaks removed.) DivermanAU (talk) 22:09, 2 March 2017 (UTC)
I completely agree that the soft line breaks make proof-reading much easier. Unfortunately editors in the past have often removed them (using the editor's word-wrap) for various reasons of their own. I only fuss about the close-and-reopen-italics problem because it introduces a small semantic error: I assume the newline is replaced in HTML with a non-italic space. That's a distinction that's probably (literally) invisible though, and when or if I go back to WS I'll stop worrying about that. It's probably more significant to worry whether a final period or comma is part of the citation (and should be italic) or not. DavidBrooks (talk) 00:01, 2 March 2017 (UTC)

Problems with Gutenberg encoding[edit]

I find problems with character encodings in the Gutenberg material. Most recently ṃ when it should have been ṁ, and ṅ when it should have been ṇ (v. 13, p. 501). I think I've only found problems with dotted characters so far. Bob Burkhardt (talk) 18:04, 3 August 2017 (UTC)

Titles in name of the article[edit]

History of the article 1911 Encyclopædia Britannica/Moltke, Helmuth Carl Bernhard, Count von ‎ :

  • 22:11, 24 January 2009‎ Bob Burkhardt (talk | contribs)‎
  • 22:15, 24 January 2009‎ Bob Burkhardt (talk | contribs)‎ (1911 Encyclopædia Britannica/von Moltke, Helmuth Carl Bernhard, Count moved to 1911 Encyclopædia Britannica/Moltke, Helmuth Carl Bernhard, Count von: von should be at the end rather than the beginning)
  • 14:39, 19 March 2010‎ Dan Polansky (talk | contribs)‎ (1911 Encyclopædia Britannica/Moltke, Helmuth Carl Bernhard, Count von moved to 1911 Encyclopædia Britannica/Moltke, Helmuth Carl Bernhard: Keep the boldfaced text as the headword.)
  • 16:48, 30 May 2011‎ Suslindisambiguator
  • 15:21, 24 December 2012‎ MpaaBot (Bot Request: Volume information for EB1911) (undo)
  • 23:53, 4 November 2014‎ Library Guy (Library Guy moved page 1911 Encyclopædia Britannica/Moltke, Helmuth Carl Bernhard to 1911 Encyclopædia Britannica/Moltke, Helmuth Carl Bernhard, Count von over redirect: now include title in article label)

The start of this article is:

MOLTKE, HELMUTH CARL BERNHARD, Count von (1800-1891), Prussian field marshal...

@user:Library Guy you chose to move this article from just the bold part of the name to include the part in small caps. While the longer name is useful if disambiguation is needed, why do you think it a good idea to move pages so that the pages name does not reflect the bold title used in the original? -- PBS (talk) 12:49, 25 March 2017 (UTC)

Yes. And I discussed this with people also, some years ago, and I also put it in the style guide. The problem is, some of the entries just don't read well if the small-caps portion is omitted, and here, while I think an English-speaking person might be perfectly comfortable calling him Helmuth Carl Bernhard Moltke, I think the von was an essential part of his name, and the EB1911 editors would have agreed. Sometimes the name associated with the title is the same as a person's surname and in the boldface the name is repeated twice and it reads very oddly when you put the last name last without including the portion in small-caps. See 1 in the Style Manual. Notice it doesn't say all small-caps. Sometimes the small-caps are just alternative names etc. It can be a judgement call sometimes, even with just the boldface. For example, Reproductive System, I decided to omit part of the boldface, since it looked like a production error. Other similar articles I think put "in Anatomy" in small caps or body font. This is the only article where I have seen this happen. All the links to it omit "in Anatomy". Bob Burkhardt (talk) 15:03, 25 March 2017 (UTC)

Two archiving bots available[edit]

FYI. The bots user:Wikisource-bot and user:SpBot are both available to automatically archive if you need them to do so. — billinghurst sDrewth 15:18, 25 March 2017 (UTC)

Thanks, I guessed there was but did not know their names. I have installed the config file for user:Wikisource-bot (chosen because the do documentation was better). I hope it works :-) --PBS (talk) 16:00, 25 March 2017 (UTC)