Index talk:The Divine Pymander (1650).djvu

From Wikisource
Latest comment: 4 years ago by Xover in topic Issues on the first book
Jump to navigation Jump to search

Missing pages

[edit]

According to the catalog record for this work, the missing pages are available in MS. form. Has anybody tried contacting the Wellcome Library to ask if the MS. pages have been digitized? If we can track down the pages I can build a new DjVu file that includes them in the appropriate positions.

I also note that this particular scan has not properly cropped the page images, leaving large black borders around each page. It would be a slow tedious manual process to remove them, but if I'm redoing the DjVu anyway I might be persuaded to do it if there's any interest (just no promises on when I'll get it done). --Xover (talk) 07:49, 8 December 2019 (UTC)Reply

Hi, I just saw your entry on the discussion page of The Divine Pymander in XVII books Everard John French 1650 The Corpus Hermeticum.pdf
I had kinda gave up to finish the proofreading of this yet so important text and was going to just publish a pdf on my side, there is interest for it, a lot of it, believe me, could you help me finish the insertion of the missing pages and of a better scan like you mentioned? I'm still a beginner on wikisource but would be able to get the wanting MS pages from the wellcome trust if needed and if they're in the book.
How do we integrate new pages? Do we need to reupload the pages and move the already proofread pages? Nazmifr (talk) 20:42, 14 December 2019 (UTC)Reply
@Nazmifr: Given a set of image files, I have tools that can generate a DjVu file (DjVu is a different file format that works roughly like PDF does) with an OCR text layer. So the process here would be to get ahold of images of the missing pages from Wellcome; combine them with the page images for the existing PDF file; generate a DjVu file from the combined set of images; upload that as a new file on Commons; and then move the Index: and already-proofread pages over to the new file. We lack any good tools for working with PDF files in this way or we could have just updated the existing PDF, but since there aren't all that many pages proofread already it is a surmountable task to migrate them.
In any case, if you can get the scan images of the missing pages from Wellcome, I can take care of all the technical bits there. I see the Wellcome Library tends to upload copies of their scans to the Internet Archive, which would work great for this too, but any place where we can grab JPEG or other common format image files will work fine. The higher the resolution the better: for typeset text the OCR process needs all the resolution it can get, and for handwritten pages (which this sounds like it is) the pseudo-paleography work of deciphering it benefits from every single extra pixel of resolution the human eye can get. --Xover (talk) 07:58, 15 December 2019 (UTC)Reply
@Xover: Good evening, thanks a lot for your response, I know of djvu, that'd be perfect so I'll try to get a hand on the missing pages and the highest quality scan available, if you can help on the reupload and migration phase I'd be really thankful, and yes I had found it on internet archive indeed, I notice wikisource resized the images from the original on archive.org btw but that's not important. There's also some work to be done on the index as it was the first one I added to wikisource so far from perfect I think.
@Xover: I have the missing manuscript pages, the guys that digitized the book had a little pagekeeper that told them not to go to the last pages where they're in M.S indeed. The writing is quite easy to read, now do we need to create a new item for this book? Do we need to reupload the file and then it'll work right away? I'm availmable when you are if you're still up for this little project! (IRC on Freenode, mail, messages here, XMPP, telegram, whatever) Nazmifr (talk) 08:41, 22 January 2020 (UTC)Reply
@Nazmifr: Now the process is to grab the original scan images from https://archive.org/details/b30329619, and the images for the newly scanned manuscript pages, and combine them in a folder in the correct order. If you either download Internet Archive's ZIP file (b30329619_jp2.zip) or browse its contents online (here) you'll see what my starting point will be. I need to be able to download the new scans from somewhere, and I'll need you to tell me where in the image order they should be inserted (don't assume I know anything at all about this work).
Once I have a folder of all the images in the correct order I will go through them to remove extraneous images, such as the image of the book's spine (which we usually do not include), and any other images that are purely artefacts of the scan process. For example, the zip file contains images like this that aren't of the work at all, it's a colour calibration swatch laying on the scanning bed which is used for digital colour correction of the actual page images.
Once that is done I will start going through all the page images and preprocess them: removing the extraneous black borders, straightening any crooked scans, adjusting colours and contrast where needed to help the OCR process, and so forth, so we end up with a uniform set of page images.
Once I have that I will run some custom scripts that will re-encode, run OCR on, and assemble the images into a DjVu file, upload that to Commons, and then rename the Index: and already proofread Page: pages here so they refer to this new DjVu file instead of the existing PDF file. At that point you should have a complete work and index to proofread from.
The only really time-consuming bit here is cropping and adjusting the images, since that has to be done manually in an image processing application, so no promises on how quickly I'll be able to get that done.
PS. please sign your posts on talk pages using ˜˜˜˜ (four tildes; will expand to your username and the date and time when saved). It makes it easier for others to follow conversations, and is required in order for notifications with {{re}} or {{ping}} to work. There's a button in the editor toolbar to insert a signature. --Xover (talk) 08:29, 22 January 2020 (UTC)Reply
@Xover: Thanks a lot for your fast and oh so detailed response, I'll download the folder later and merge the missing pages then upload that to somewhere, probably archive.org (temporarily), should I put them at the end as they are placed in the book or inline with the missing pages we had? I'll try to send you resized images as well if that can help you, I'll include both! Nazmifr (talk) 08:41, 22 January 2020 (UTC)Reply
@Nazmifr: In this particular case you can just upload them locally here on Wikisource (not on Commons!) since I happen to be an administrator on this project so I can just delete the temporary files once we're done with them. Also, since I'll have to shuffle page images around, have them in a specific naming format, and so forth, just upload the new files and give me instructions for where they should go in the page order. I'll take care of integrating them.
Wikisource's aim is to reproduce the original work, and not the scanned copy of it. In other words, if the manuscript pages are part of the page order of the original work they should appear in those positions. It's the same principle as when a scanned page image is broken or missing for some reason: we grab a scan of that page or pages from somewhere else and insert them at the correct position of our scan. Page 42 doesn't stop being page 42 as a result of some later process like scanning a particular physical copy.
A Work (an abstract that covers all editions) has one or more edition (separate publications, possibly with completely different publisher, a new editor, and far removed in time), which consists of some number of physical copies. A particular scan coincides with a particular physical copy, but we aim to reproduce the edition; and, indeed, often reproduce multiple editions of the same work. See eg. Hamlet (Shakespeare): we have everything from the edition published in the First Folio in 1623, up to a critical edition in 1917. --Xover (talk) 09:08, 22 January 2020 (UTC)Reply
@Xover: Good evening Xover, That took longer than I wanted, here are the slightly corrected pages https://archive.org/details/pymander and the original photographs if you prefer to do the rotation and post treatment yourself https://archive.org/details/img20200116125937 Sadly there are multiple works in the book that contained the divine pymander and I don't have a scan of all the pages, still they were at the end so I guess the natural order should be the front cover, the printed pages, the M.S pages then the back cover ? How would that work with the transclusion? And I see, it makes sense, that's awesome as in the end we could be able to compare such different editions to see what changed during time. Anyhow, tell me if the scans are okay and if I need to do anything now (apart from the proofreading once we have the final djvu). By the way is there a place where the wikisource contributors get together online? I've already received some help on other works I worked on but it's hard to really know what to do sometimes. Nazmifr (talk) 19:53, 24 January 2020 (UTC)Reply
@Nazmifr: Right, that took a little longer than I had hoped, but them's the breaks.
In any case, I've had a stab at integrating the mss. pages into the scan and uploaded it as a DjVu at File:The Divine Pymander (1650).djvu; set up a new index for it at Index:The Divine Pymander (1650).djvu; and moved (some of) the existing pages over (mainly to preserve history).
I had a look at the existing pages and I must say those were rather far from proofread; mostly they looked like raw OCR text pasted in and saved. I've gone over these and corrected them so you have a reference for future pages.
Incorporatimg the mss. pages when transcluding this to mainspace will be a little bit tricky, but we can tackle that later (it will involve labeled sections and selective transclusion). If you just proofread as normal we can do the necessary tweaking later.
Feel free to ping me if you need help with anything.
The Wikisource community discussion venue is the Scriptorium (which I highly recommend everyone puts on their watchlist and follows), and if you need help with something you can ask at Wikisource:Scriptorium/Help. There's an IRC channel too (#wikisource on Freenode), but it's pretty dead these days. --Xover (talk) 08:00, 29 January 2020 (UTC)Reply
@Xover:Wow the new index is dope, thank you so much! I'll go back to proofreading, a few questions before I get back to the work:
shouldn't we preserve longs s (ſ) in the proofreading?
also I thought the line structure needed to be kept before transclusion like there's a model for the hyphenated words.
Please tell me what are the best practices to implement before I'll continue the proofreading, in the meantime I'll transfer the proofreading I have already done offline for some of the M.S. pages. Followed the scriptorium page, i'll check the irc but yea that's a sad reality indeed for modst projects Nazmifr (talk) 19:55, 29 January 2020 (UTC)Reply
@Xover:One quick question, how do I add some text inline with the text on the proofreading page, but that won't be transcluded? Nazmifr (talk) 20:52, 29 January 2020 (UTC)Reply
@Nazmifr: Oh, almost missed this question, sorry. There are a couple of ways that can be done, and which to use depends on what you're trying to achieve. Could you elaborate? --Xover (talk) 13:35, 30 January 2020 (UTC)Reply
@Nazmifr: Preserving long s is optional and up to the contributor(s) active on a particular text (it should be consistent throughout a given text). However, we mostly do not preserve the long s because it is a low-level typographic feature rather than something intrinsic to the work. When done we use the {{ls}} template for it, and that displays the long s in the Page: namespace but displays a modern s when transcluded to mainspace (both because some find it hard to read, and it confuses various search engines). There is thus very little point in preserving it, and it takes extra effort to do when proofreading and clutters up the wikimarkup on each page. In other words, I would recommend against trying to preserve it. But, ultimately, that's up to the ones (i.e. you) doing the work.
I'm not sure what you mean by "line structure", but if you mean hard line breaks we usually do not preserve those except where they serve a specific purpose, such as in a poem or similar. In running prose we normally unwrap lines for better dynamic display and easier editing. It makes no sense to preserve line-wrapping forced by the physical sheet size of the original book when we are trying to produce a modern dynamic version of it. Or put another way, our goal is not a diplomatic or even semi-diplomatic edition: we preserve the spirit of the work, but not those parts of the details of a given edition's physical properties that makes no sense in a digital non-paged medium.
Feel free to ping me any time if you have questions. And the existing pages should hopefully be a useful reference for how to do the remainder. --Xover (talk) 07:51, 30 January 2020 (UTC)Reply

Issues on the first book

[edit]

@Xover:

Hi, I think the first book is in good to be transcluded, I carried some tests but still don't really know what should be done about some things, are you still around with some insights?

- The end of the first book is on the same page than the beginning of the second, so from the current configuration it's missing "is hidden and kept secret."

- Several pages merge the last line and first line of the next where it's a list so the new page should start on a new line, these are: 40-41 56-57 72-73. I've tried to mess with br tags but to no avail

- On line 51 there are odd words, illegible on the M.S., I managed to find that on a more recent version that it should say "51. Malice is the nourishment of the world" what ought to be done?

Nazmifr (talk) 15:55, 10 April 2020 (UTC)Reply

@Nazmifr: The "two different sections are on the same physical page"-problem is addressed with labelled section transclusion: on the page you tag what belongs to which section, and on transclusion you specify which section it is you want (see the changes I made for an example: it's the fromsection and tosection bits that are relevant for this). It's a bit complicated to get started with, but it works really well to solve these kinds of issues once you get used to it. There's also a Gadget available in your preferences called "Easy LST" that makes a much simpler syntax for simpler uses of labelled sections available. Basically you add "## Book 1 ##" on a line by itself in front of the bit of the page that belongs to Book 1, and "## Book 2 ##" in front of the bit belonging to Book 2. When you save the page the gadget will convert the easy syntax into the full HTML-like syntax described in the docs, in vice versa when you later edit the page again.
Merging lines across pages is a well-known and common problem on Wikisource. The short version is that when you want to preserve the end of one page on a separate line from the beginning of the next, you insert a {{nop}} template at the end of the first page. The long version is… long, so I'll skip it for now.
On the illegible text, strictly speaking you should mark it as {{illegible}} and leave it. But I would presonally cheat a little and just replace it with the correct text if you feel certain it is correct. See also the thread just started on the Scriptorium here: Reconstruction on context, and clipped scans....
I am kinda sorta around, but very busy IRL, so my response time may not be the best. Feel free to ping me, but no guarantees on when I'll notice it. You should also not hesitate to ask for help at the Scriptorium where the wider community will see it. --Xover (talk) 18:16, 10 April 2020 (UTC)Reply