Wikisource talk:WikiProject 1911 Encyclopædia Britannica

From Wikisource
Jump to: navigation, search


Copy and pasting text from the searchlight version[edit]

copied from my Wikipedia user talk page:

OK, I've spent the day working on the EB1911 template and the wikisource version. I actually began by copy and pasting text from the searchlight version

https://www.studylight.org/encyclopedias/bri/

under the impression that I would easily be able to link them to the appropriate templates (other John Schonfeld, a random article that I remembered had this problem with the template and the (accent)Eduard Lartet article, all the articles I worked on were in the X-Z field). As fate would have it, it turned out that the only articles that had template links were ones that were not listed on searchlight under Z.

https://www.studylight.org/encyclopedias/bri/browse.cgi?l=z

Which led me to one of the problems with that site - sometimes the articles are placed under the first letter of the given name ie, Aaron Burr is put under A rather than B. (This seems to be particularly true of Hispanic and German names.) Also articles for letters like Z apparently are not available and the entire section of articles starting with X is not available from the contents page (I had to use the search function).

All the articles that had a parallel with an EB1911 article X-Z have been linked up. In the majority of cases I had to create the article on wikisource using searchlight. In one instance it was another language confusion Xàtiva needed to be linked to

https://en.wikisource.org/wiki/1911_Encyclop%C3%A6dia_Britannica/J%C3%A1tiva

which already existed. Another Zerhoun does have a listed EB1911 article under the name Zarhón and yet, I cannot find it on searchlight. These and the other remaining articles needed a template link illustrate the problems we have been having - there is no article in EB1911 for Zona Austral of "Southern Zone", it could be under the EB1911 article Chile, but I do not want to link it without being sure that that was were the text was from; the same with Karl Eduard Zachariae von Lingenthal and Alexander Ypsilantis, there are EB1911 for the formers father and the latters family, but I'm not sure if I should link to those pages. Also cannot find Zhetysu despite searching the dozen or so variant spellings; nada for Johann Zahn, Caroline Yale and Zapotec peoples.

For the new wikisource articles I have created, I only transferred over the text and the bare metadata predecessor, successor and wikipedia article. They probably need to be proofread and given whatever treatment the wikisource team usually gives to its articles. Also, I've been working on an EB1911 project with John Mark Ockerbloom on the Online books Page, this is our preliminary draft

http://onlinebooks.library.upenn.edu/webbin/metabook?id=britannica11

Hope this helps.--Bellerophon5685 (talk) 03:14, 20 April 2016 (UTC)

Current state of the YXZ articles that have EB1911 template and need links

https://en.wikipedia.org/wiki/Category:Wikipedia_articles_incorporating_a_citation_from_the_1911_Encyclopaedia_Britannica_with_no_article_parameter?from=Xa

--Bellerophon5685 (talk) 03:16, 20 April 2016 (UTC)

I have answered you briefly at w:Wikipedia talk:WikiProject Encyclopaedia Britannica#Copy and pasting text from the searchlight version.

You have to be careful copying text from https://www.studylight.org because it may or may not be accurate. The way I would copy text from studylight.org Z index would be to compare the names with those in the sub directories at 1911 Encyclopædia Britannica/Vol 28 VETCH to ZYMOTIC DISEASES eg 1911 Encyclopædia Britannica/Vol 28:16.

Let us suppose that you are interested in Creating 1911 Encyclopædia Britannica/Zaisan go to 1911 Encyclopædia Britannica/Vol 28 VETCH to ZYMOTIC DISEASES and follow the link scan index to the djvu index. The find the appropriate page (951) Page:EB1911 - Volume 28.djvu/978 check the studylight page Zaisan. Use the EB1911 MOS to format the text, and then follow the advise at Using transclusion on how to use the template {{EB1911set}} to populate 1911 Encyclopædia Britannica/Zaisan --PBS (talk) 21:50, 20 April 2016 (UTC)

  • If the text in both html and scan is already at Page:EB1911 - Volume 28.djvu/978 then what is the point of creating 1911 Encyclopædia Britannica/Zaisan? Couldn't we just link it to the djvu page or better yet Internet Archive?--Bellerophon5685 (talk) 00:07, 21 April 2016 (UTC)
I agree. I think the best bang for buck is to use the transclusions from the Page space versions, which themselves are to be proofread based on some pretty good scans, as described in this project page's transclusion sub-page. It seems to me that using studylight is a stop-gap. I concede that it has the advantage of getting mainspace text installed more quickly, and proofreading can be tedious, but IMO the end-game should be all transclusions so it ends up being throw-away work. Also, I don't know about the quality of studylight (but have no reason to doubt it yet). Its predecessors, such as LoveToKnow, were pretty corrupt. We copied text from them back in 2004 or so and ended up with a lot of remedial work. DavidBrooks (talk) 06:34, 21 April 2016 (UTC)
well, it is not thrown away, it is copy pasted to the side by side page view that is transcluded. there are many old articles like this so a few more won’t matter. it would be a stop gap, to link to wikipedia, until the proofreading to done. an example is [1]. Slowking4₮₳₤₭ 11:33, 28 April 2016 (UTC)
Another thought; you might compromise by using studylight as a source to copy into the Page space version, and then complete the job by using transclusion into article space (the trans syntax can be tricky but you soon get used to it). But, again, you still need to convert to wiki markup and be concerned about the quality. I recently verified three pages for no particular reason, one of which is Page:EB1911 - Volume 01.djvu/154 (didn't do the transclusion yet). Looking at Accountants, studylight did a good job with the archaic English, but there are some obvious mangled text (Saint 011aves!) and layout errors. DavidBrooks (talk) 16:41, 21 April 2016‎ (UTC)
@user:Bellerophon5685 carrying on from last my last posting. I will run through how to use transclusion
  • open a window (or a tab onto Wikisource:WikiProject 1911 Encyclopædia Britannica/Transclusion and follow the instruction
  • in another widow go to the djvu page in this case Page:EB1911 - Volume 28.djvu/978
  • Alter the "## headings ##" if necessary to the ones used in the Index page (in this case 1911 Encyclopædia Britannica/Vol 28:16)
  • edit the article you are interested in to match the facsimile copy to the right (I usually copy the link into a another widow so the text can be read more easily). I usually do the whole page at one but for this example I have just done the article "Zaisan"(diff)
  • using the index page] make a note of the volume number, preceding and the succeeding page/article names in this case volume=28 previous=Zaire next=Zaleucus. on the translucent page click on {{EB1911set}} and read the documentation of how to use this template, following those instructions make a note of the page number as described. Then in the other window (the one open to the index page Fill in the template as described:
        {{subst:EB1911set
         |volume    = 28
         |previous  = Zaire
         |next      = Zaleucus 
         |wikipedia = Zaysan (town)
         |extra_notes= 
         |from= 978
         |to= 978
        }}
-- PBS (talk) 20:48, 21 April 2016 (UTC)

Using http://www.theodora.com/encyclopedia/ as a source of mostly-proofed text[edit]

Sort of following on from the discussion above — "Copy and pasting text from the searchlight version" — I’ve found the text at www.theodora.com/encyclopedia/ to be a good source. Right-clicking and choosing "Page Source" on an article page to get the markup code shows it’s formatted already with bold and italics; I use Notepad++ to do a global replace on <b> for ''' etc. — you could record a macro if desired.

I’ve found theodora superior to the OCR text in the wikisource version. I used it to help edit https://en.wikisource.org/wiki/Page:EB1911_-_Volume_28.djvu/978 . Before using the theodora version, I did a rough proof of the auto-generated text of that page and saved it (still as not-proofed). When I pasted in the theodora version (and copying back section headers etc.), I found about 22 corrections! e.g. “Pythageras” to “Pythagoras”, “Zaieucus” to “Zaleucus”.

(edit: I’m referring to volumes which don’t exist in Gutenberg (http://www.gutenberg.org/ebooks/search/?query=Britannica&go=Go) which is the best source of proofed text, but covers only articles Andros–Magnetism.) DivermanAU (talk) 05:51, 27 April 2016 (UTC)

Thanks for the information -- PBS (talk) 07:24, 30 April 2016 (UTC)

Index:EB1911 - Volume 08.djvu pages needing transcluding[edit]

I am doing checks of the proofread works, and EB1911 vol. 8 is one of those on-site. Looking at https://tools.wmflabs.org/checker/?db=enwikisource_p&title=Index:EB1911_-_Volume_08.djvu shows multiple pages that have not been transcluded, and I was wondering whether is interested in undertaking the task. — billinghurst sDrewth 11:56, 3 May 2016 (UTC)

Hi billingshurst, I agree that transcluding pages is good — but would we be better off by focussing on articles that have not been created, or pages that have not been proofed yet? Personally, I think it's better to spend time creating an article that didn't exist before than to convert an existing article (which may be 99–100 % accurate) to a transcluded version.
All articles from "A" to "Céspedes y Meneses, Gonzalo de" have been created so far (at time of writing) see Vol 5:10 for the start of articles that need to be created. — DivermanAU (talk) 07:29, 6 May 2016 (UTC)

Automated conversion script - Gutenberg to Wikisource format[edit]

To anyone who has used or would like to use the already-proofed text version of the EB1911 at Gutenberg.org (http://www.gutenberg.org/ebooks/search/?sort_order=title&go=Go&query=britannica) which has nearly all the articles from "Andros" to "Mecklenberg", I've been working on an automated script to convert HTML to Wikisource format e.g. <i>italic text</i> to ''italic text'' as well as the more complex "style=" statements, author initials etc. are handled. Features:

  • HTML markup to Wikisource markup
  • Table style conversions
  • addition of extra italics marker if the line has an uneven number of italics markers (otherwise italics don't render properly)
  • All author initials (for vols. 6 & 7) are converted e.g. <div class="author">(A. J. E.)</div> converts to {{EB1911 footer initials|Arthur John Evans|A. J. E.}}
  • article links added
  • smallcaps for A.D. and B.C.
  • subscript and superscript conversion
  • "sidenote" to "EB1911 Shoulder Heading" conversion
  • Section tags automatically added (watch for cases where an existing section tag like "s1" is in use)
  • use of <div align=center> ... </div> for centered text - this allows an equals sign in text where a Template:center won't render
  • convert <div class="condensed"> to "EB1911 fine print"
  • Greek text has the "{{Greek|" template added (where Gutenberg has "span class="grk" title") - but single Greek characters still need the template added manually.
  • Hyphens "-" are converted to ndashes "–" where there are four numbers before the the hyphen, so 1850-1851 will be converted to 1850–1851 (I added code to handle all occurrences on a line on 3 March 2017)

Known limitations:

  • Footnotes are not handled, these have to be done manually.
  • Cases where a use of a template like "smallcaps" etc. spans multiple lines will have to corrected manually (you can end up with a </span> on the line following)
  • Diagrams have to be added manually (but the caption formatting is handled)

It's not 100% perfect, but I've been using it recently and it's been a great help. I have to add italics manually to variables to most maths articles, as the Gutenberg version doesn't usually include these. As always, check with "Show changes" to see what has been changed before saving. Also be aware that although the Gutenberg version is usually very accurate but it has a few errors too (apart as the lack of italics for math variables and hyphens, not ndashes for year ranges).

The source code (written in AutoIt see https://www.autoitscript.com/site/) and executable (Guten2WikiV1.38.exe) as well as some converted text slices are here: https://www.dropbox.com/sh/dssahqtjtqleml9/AACOO55819IOefkYIXafYyZva?dl=0 . Run it from a Windows command-line with the source text as a parameter and it will convert the text into another file with "Wiki-" as a prefix. Or launch the executable and be prompted for the source text name. Use "View page source" when viewing the Gutenberg version and save that as the source text. Feel free to use and I'll try and answer questions you may have. Some likely interested users: @Billinghurst: @PBS: @Bellerophon5685: @DavidBrooks: @Library Guy: @Suslindisambiguator: @Slowking4: @DutchTreat: DivermanAU (talk) 08:46, 14 February 2017 (UTC)

This sounds great. It looks like a fun project. I don't have time to look at it this afternoon, but two questions occur. Links to other articles can be "See FOOBAR", in small caps, or "FooBar (q.v.)" using regular mixed case, which is enforced using the "nosc" parameter. Can those be distinguished?
Also, whenever I've seen them, an "uneven number of italics markers" usually means the italics are opened on one line and closed on the next. When editing in Page space manually, I usually just replace the newline by a space and let auto-wordwrap do the window fitting. I prefer that to closing the italics at the end of one line and reopening them at the start of the next (if nothing else, it makes the space that's inserted by the paginator into a non-italic space). Or are you referring to another phenomenon? DavidBrooks (talk) 21:52, 15 February 2017 (UTC)
(ETA) Sorry, I remembered one more thing. In Page space, I think we're using proper curly apostrophes and quote marks (’ and “”); at least, I have. The raw scans are inconsistent on rendering quotes as straight (ASCII) quotes sometimes, and two tick marks at others. Also, EB1911 has a wide gap after opening and before closing quotes, which I always close up. How does Gutenberg work with those? DavidBrooks (talk) 21:58, 15 February 2017 (UTC)
Hi -good questions! :) Luckily, the Gutenberg version is quite consistent and for article links, I follow these rules: if the Gutenberg source has <span class="sc"><a href="#artlinks"> I replace that string with "{{1911link|" (and substitute the terminator) which produces a link in smallcaps; if the source has <a href="#artlinks">, I use the "{{11link|" template which produces a link without small-caps. {edit - It does depend on whether the Gutenberg version has the tags for article links — on some later pages, there is a (q.v.) but no link tag.}
The italics-handling code adds an italics tag to the end of the line if there is a non-matching number of italics start and italics end tags on the line. If there isn't a match, I add an italics end tag to the end of the line and add an italics start tag to the start of the next line. This was the easiest way to automate the fixing up of italics, and it leaves the line-break intact. I had previously used macros in Notepad++ to partly automate the conversion but italics would often get out-of-sync on a page and I had to manually fix those (like you describe above).
The Gutenberg version uses curly quotes and apostrophes (without spaces between the word and the quote mark) as html characters (e.g. & ldquo; & rdquo; &rsqou; etc. I just convert these to the literal equivalents “ ” ’ . Thanks for your interest. DivermanAU (talk) 06:59, 16 February 2017 (UTC)
that’s excellent - i note you have gotten to volume 6, this should speed the work to volume 17.
deleting the soft carriage returns is an improvement gutenburg does not do; curly quotes and apostrophes are are a problem, we have find and replace in VE to convert to straight.
i’m creating articles now, but will try to pitch in in a few months. Slowking4SvG's revenge 16:10, 1 March 2017 (UTC)
I noticed a while back that there’s a gap in the Gutenberg version for articles in-between “Conduction, Electric” and “Constantine” (vol 6. pages 890 onwards), so I'm working on those currently to try and finish vol. 6. (http://www.theodora.com/encyclopedia/ is useful in these cases — though it’s not as accurate as Gutenberg).
Regarding quotes, the consensus — see "Typographical changes in Page space" section above with comments by @PBS:, @DavidBrooks: and myself — seemed to be to maintain the curly quotes and apostrophes. The OCR scan usually has curly quotes & apostrophes, Gutenberg has them and they are more faithful to the original printed book.
The soft carriage returns can be eliminated by the “clean up” script, Billinghurst added that functionality to my JavaScript page User:DivermanAU/common.js. But do some users prefer to maintain the soft line breaks as it makes it easier to proof? (The line breaks still match the print book when editing). Currently the script reads one line at a time and writes out each line terminated with <carriage return><line-feed> characters. This could be changed to replace the terminating <carriage return><line-feed> with a <space> character, but you'd need to check it didn't have adverse effects (e.g. in tables etc.). DivermanAU (talk) 23:36, 1 March 2017 (UTC) (update — I think it's best to maintain the soft line breaks, that way it makes it easier to compare changes that have been made by using the "Show changes" button. If needed, the linebreaks can be removed by the TemplateScript "clean up", or with a replace (\r\n with <space>) in NotePad++ if just one section needs the linebreaks removed.) DivermanAU (talk) 22:09, 2 March 2017 (UTC)
I completely agree that the soft line breaks make proof-reading much easier. Unfortunately editors in the past have often removed them (using the editor's word-wrap) for various reasons of their own. I only fuss about the close-and-reopen-italics problem because it introduces a small semantic error: I assume the newline is replaced in HTML with a non-italic space. That's a distinction that's probably (literally) invisible though, and when or if I go back to WS I'll stop worrying about that. It's probably more significant to worry whether a final period or comma is part of the citation (and should be italic) or not. DavidBrooks (talk) 00:01, 2 March 2017 (UTC)

Titles in name of the article[edit]

History of the article 1911 Encyclopædia Britannica/Moltke, Helmuth Carl Bernhard, Count von ‎ :

  • 22:11, 24 January 2009‎ Bob Burkhardt (talk | contribs)‎
  • 22:15, 24 January 2009‎ Bob Burkhardt (talk | contribs)‎ (1911 Encyclopædia Britannica/von Moltke, Helmuth Carl Bernhard, Count moved to 1911 Encyclopædia Britannica/Moltke, Helmuth Carl Bernhard, Count von: von should be at the end rather than the beginning)
  • 14:39, 19 March 2010‎ Dan Polansky (talk | contribs)‎ (1911 Encyclopædia Britannica/Moltke, Helmuth Carl Bernhard, Count von moved to 1911 Encyclopædia Britannica/Moltke, Helmuth Carl Bernhard: Keep the boldfaced text as the headword.)
  • 16:48, 30 May 2011‎ Suslindisambiguator
  • 15:21, 24 December 2012‎ MpaaBot (Bot Request: Volume information for EB1911) (undo)
  • 23:53, 4 November 2014‎ Library Guy (Library Guy moved page 1911 Encyclopædia Britannica/Moltke, Helmuth Carl Bernhard to 1911 Encyclopædia Britannica/Moltke, Helmuth Carl Bernhard, Count von over redirect: now include title in article label)

The start of this article is:

MOLTKE, HELMUTH CARL BERNHARD, Count von (1800-1891), Prussian field marshal...

@user:Library Guy you chose to move this article from just the bold part of the name to include the part in small caps. While the longer name is useful if disambiguation is needed, why do you think it a good idea to move pages so that the pages name does not reflect the bold title used in the original? -- PBS (talk) 12:49, 25 March 2017 (UTC)

Yes. And I discussed this with people also, some years ago, and I also put it in the style guide. The problem is, some of the entries just don't read well if the small-caps portion is omitted, and here, while I think an English-speaking person might be perfectly comfortable calling him Helmuth Carl Bernhard Moltke, I think the von was an essential part of his name, and the EB1911 editors would have agreed. Sometimes the name associated with the title is the same as a person's surname and in the boldface the name is repeated twice and it reads very oddly when you put the last name last without including the portion in small-caps. See 1 in the Style Manual. Notice it doesn't say all small-caps. Sometimes the small-caps are just alternative names etc. It can be a judgement call sometimes, even with just the boldface. For example, Reproductive System, I decided to omit part of the boldface, since it looked like a production error. Other similar articles I think put "in Anatomy" in small caps or body font. This is the only article where I have seen this happen. All the links to it omit "in Anatomy". Bob Burkhardt (talk) 15:03, 25 March 2017 (UTC)

Two archiving bots available[edit]

FYI. The bots user:Wikisource-bot and user:SpBot are both available to automatically archive if you need them to do so. — billinghurst sDrewth 15:18, 25 March 2017 (UTC)

Thanks, I guessed there was but did not know their names. I have installed the config file for user:Wikisource-bot (chosen because the do documentation was better). I hope it works :-) --PBS (talk) 16:00, 25 March 2017 (UTC)