User:Mukkakukaku/Guide

From Wikisource
Jump to navigation Jump to search

Locating and uploading works[edit]

HathiTrust[edit]

HathiTrust (http://www.hathitrust.org) has a collection of scanned public domain works. Full access is restricted to members of subscribing institutions — if you are affiliated with a university, you may have access in this fashion. Normal users are restricted to the built-in, page-by-page PDF reader — note that this may also include geographical restrictions; US-based users will probably not be affected.

The following instructions explain how to download a work from HathiTrust as a non-subscriber. It assumes that you have access to the work; if you don't, post at Requested Texts and include the HathiTrust link to the work. Alternatively, you might use the freeware tool "Hathi Download Helper" which downloads all the pages, combines them to a single pdf file and uploads it to "the Internet Archive".

You will need
  1. Microsoft Windows 7 or greater
  2. Firefox browser
  3. "/ DownThemAll" plugin for Firefox



Process overview

There steps of the overall process are as follows:

  1. Use Firefox and the "DownThemAll" plugin to download all of the pages of the PDF from HathiTrust as images (.jpg or .png files.)
  2. Use built-in Windows utilities to convert these images into a PDF.
  3. Add an OCR layer to the PDF.
  4. Upload the finished PDF to Commons.



Downloading pages from HathiTrust
  1. Navigate to the work that you want to download on HathiTrust.
  2. In the PDF viewer, navigate to the last page of the PDF and copy the URL from the address bar of your browser.
  3. Open a text editor and past the URL you just copied. It will look something like this:
    https://babel.hathitrust.org/cgi/pt?id={ID NUMBER}&view=image&seq={number}
  4. In the line below this URL copy the following template:
    https://babel.hathitrust.org/cgi/imgsrv/image?id={ID NUMBER};seq={number};width=2000
  5. Copy the {ID number} ( starting after ?id= and ends before next & ) and the {number} of the last page ( starting after seq= and end before next & ) into the template. Then copy the new URL
  6. Open the "DownThemAll Manager". It can be found under Tools > DownThemAll! Tools > Manager.
    • Can't see the tools menu? Tap your 'alt' key.
  7. Create a new download by clicking the "+" button on the menu
    1. In the 'Download' field, paste the URL you just copied. It will look something like this:
      https://babel.hathitrust.org/cgi/imgsrv/image?id={ID NUMBER};seq={number};width={resolution}
    2. Change the 'width' part of the URL to 2000
    3. Change the 'seq' part of the URL to be formatted like this: [1:{last page number}]. So if your PDF has 200 pages, it would look like this: seq=[1:200]
    4. Change the 'Mask' field to include the page number. This involves adding the *inum*. I use this pattern: *name**inum*.*ext*.
    5. Click 'Start.'
  8. Once all downloads are complete, verify that you have all the pages and spot check a few to make sure they're in order. Usually the ones that get missed are the very last ones if you fudge the length of the PDF. Make sure, for example, that the last downloaded image is actually the back cover/last page of the book on HathiTrust.



Converting HathiTrust downloaded page images into a PDF

Now you need to combine these images, in order, into a PDF. Here are the below instructions with pictures, if that helps.

  1. Open Windows Explorer and navigate to the folder that contains your images.
  2. Change your view settings until the files are listed in the order they should appear in your pdf (eg. page1.jpg, page2.jpg, etc.) I find the best way to do this is to change the view to "Details" and then sort by "Name".
  3. Select all of the image files.
  4. Right click on the first file in your selection, and choose "Print" from the menu. It is important that you right click on the first image because that first image is the one that will be used as your first page. A popup menu will show.
  5. From the list of printers select "Microsoft Print to PDF."
  6. Make sure that your paper size is set to "Letter" or something similarly appropriate ("A4", etc.)
  7. At the bottom right, you'll see a link called "Options...". Click it.
  8. On the new popup, select the option that says "Printer Options". Click it.
  9. On the third new popup (Microsoft really likes popups....), there is a dropdown box that lets you select the orientation of your images on the PDF. You'll probably want to select "Portrait" here. Save your choices on the two Options popups.
  10. There is a check box that says "Fit Picture to Frame." You'll want to make sure it's not checked.
  11. Click "Print." It will prompt you for a file name and then take a few minutes saving your new PDF. As with all things, how long it takes depends on how many pages you've pulled.

At this point you have a PDF containing all of the images you downloaded from HathiTrust. Yay! Open it up and make sure all the pages are in order.

Transcribing works[edit]

Transcription is something of an art form. There is always a template that you don't know about or don't remember; sometimes it helps to 'watch' the pages you proofread just to see if the person who comes along to validate them might show you an additional trick or two.

The basics[edit]

  1. If a paragraph ends at the close of a page, the last thing in the transcription should be a {{nop}}. When works are transcluded to the main namespace, the software tries to join-up the text from the end of one page to the beginning of the next -- this is helpful for sentences that cross over page boundaries. The {{nop}} forces a paragraph break at the end of the page. If you don't include it, the next paragraph will be joined up to the current one. It's best to just make a habit of ending a page in a {{nop}}.