Wikisource:WikiProject U.S. Roads/Tutorial

From Wikisource
Jump to navigation Jump to search

This is a basic tutorial on the steps necessary to transcribe a document from start to finish. For this tutorial, we will use File:AASHO USRN 1970-11-07.pdf from commons:Category:Minutes from the American Association of State Highway and Transportation Officials as our base document for transcription.

You will end up creating pages in three different namespaces, the mainspace, the Index: space and the Page: space.

  • Mainspace holds a single page for the finished document.
  • Index: space holds a single page that corresponds to the copy of the scanned file on Commons.
  • Page: space holds separate pages, one for each page of the scanned file on Commons. At the top of each of these pages, there will be < and > tabs to navigate forward and backward within the larger document. There is also a ^ tab that will link up to the Index: page.

The Index: page for a document will link to the separate pages in the Page: space. Once these are complete, you'll create the document in mainspace, which will transclude all of the separate document pages in the Page: space to create the finished transcribed document.

Steps[edit]

  1. Create the Index: page for the desired document. Since we're using File:AASHO USRN 1970-11-07.pdf, we need to create Index:AASHO USRN 1970-11-07.pdf.
    1. On the edit page for the Index, insert the title of the document as a wikilink, appending the date in ISO format to disambiguate if necessary. In this case, it's U.S. Route Numbering Sub-Committee Agenda 1970-11-07.
    2. Add the author, linking to a portal if appropriate. In this case, it's [[Portal:American Association of State Highway and Transportation Officials|American Association of State Highway Officials]], which links as American Association of State Highway Officials.
    3. Add the publisher, location of publication and year of publication.
    4. Set the Scans parameter to the appropriate format of the base file, in this case PDF.
    5. Set up the pagelist. In this case, the first page of the document is number 396, so we'll use <pagelist 1=396 />. More complicated documents can have more complicate page lists, such as Index:Report of Joint Board on Interstate Highways.pdf.
    6. Save the page.
  2. On the Index: page there will be a set of redlinks at the bottom, one for each page of the base document and numbered following the internal numbering as set up in the pagelist. Click the first one to start transcribing that page of the PDF.
    1. This will bring you to Page:AASHO USRN 1970-11-07.pdf/1, which is the page that holds the text corresponding to the first page of the PDF file. The original scan will appear next to the edit window.
      1. If the base file has an OCR layer already, that text will pre-load in the edit window.
      2. If not, click the OCR icon on the toolbar to have the server attempt to decipher the text of the scan for you. You may have to enable the default OCR gadget in your Preferences.
      3. There is an additional OCR gadget that uses Google's OCR engine instead of the one on Wikisource; the Google Engine works better on poor-quality scans. To enable it, add mw.loader.load('//wikisource.org/w/index.php?title=MediaWiki:GoogleOCR.js&action=raw&ctype=text/javascript'); to your common.js page.
    2. Edit the OCR-generated text, formatting it as necessary.
      1. Each Page: has a header, a body and a footer. Only the content in the body will be transcluded into the finished document, but the header and footer will be present on each Page: to allow them to display correctly. The header can contain the page numbers that appear in the original document (Wikisource will insert them in another way) as well as other text that we don't need to appear in the finished document.
      2. For the AASHO/AASHTO minutes, most of the content is formatted as a large table that spans multiple pages. The {| to start the table will appear within the body of the first page, but in the header on subsequent pages. The reverse is also true: the |} to close the table will appear in the footer on the first page as well as all subsequent pages, except the last, where it needs to appear in the body to properly close the table. There is also the {{nop}} template that should be inserted at the top of the body and in the footer for any pages where a table is continuing across the internal page breaks. This makes sure that the server recognizes the appropriate line breaks in the overall transcription later on. (Insert this if the text on the new page is going to start a new paragraph instead of continuing the one from the previous page, or if the text on the new page is going to start a new row in the table rather than continue the previous row.)
      3. There are several templates available to help format the text, allowing it to appear centered or right-aligned. We can insert <br/> tags as necessary to force line breaks, but we should not force breaks in the middle of a paragraph. We should also recombine words split by hyphenation at the end of a line in the original document. The goal is to create a finished product that mimics the original document, but does not exactly copy its formatting because our readers will read it on different types of devices.
    3. If necessary, you can click and drag the scanned copy to scroll it around next to the edit window.
    4. Once the transcribed copy has been cleaned up and formatted, save the page. Under the edit summary space, there is a set of color-coded options to indicate the status. If you are pleased with the results, you can click the yellow option to mark the page as Proofread. If not, leave it red. (More on editing status later.)
    5. Click the > tab at the top of the page to navigate to the next page of the document. Create and edit that page, continuing with each page in the original file.
  3. Once all of the Page: pages are created, go back to the Index: page.
    1. Edit the page to change the Progress status to mark it as ready to proofread or ready to be validated, as appropriate.
  4. Click the redlink to create the finished document in the mainspace.
    1. There is a {{header}} template that needs to be added here. For simplicity sake, copy and paste the header from another similar document, such as U.S. Route Numbering Sub-Committee Agenda 1970-06-30
    2. Under the header, there should be a <pages> tag. This is what tells the server to transclude all of the subpages of the document here. Change the name of the index page in this tag if copied from another document.
    3. There should be a license template that appears at the bottom. For the older AASHO/AASHTO documents, it will be {{PD-US-no-notice}}.
    4. Save the page.
  5. If there are any formatting irregularities that appear, now is the time to edit the separate pages in the Page: space to correct them.
    1. The Source tab on the mainspace page will take you to the Index: page.
    2. It may be necessary to purge the document page to update it after editing the appropriate Page:
  6. Add the link to the newly transcribed document to the appropriate portal and WS:USRD to advertise it to other editors.

Proofreading and validation[edit]

As each document's subpages are transcribed, they're marked with a page status.

  • Not proofread is for a page that has not yet been proofread by an editor. This is useful for transcriptions in progress.
  • Proofread indicates that one editor has proofread that transcription.
  • Validated means that the page has been proofread by a second editor. This option will only appear to an editor who was not the one who tagged the page as Proofread
  • Problematic flags a page with issues.
  • Without text is for a page without text, such as a blank page in the original file.

An editor can proofread his/her own work, but that editor can't then validate it. Proofreaders or Validators should be looking to spot any errors in the document transcription. Typos in the original can be marked with {{SIC}} for example if the original has "norhterly", we can use {{SIC|norhterly|northerly}} to give "norhterly". This preserves the integrity of the original document while indicating the correct spelling.