User:George Orwell III/whitepaper2

From Wikisource
Jump to navigation Jump to search

You are what you upload

[edit]

In my view, that should be rule number one around here.

Of course, any newbie is not expected to know or follow such nonsense; for those of you with any respectable amount of contribution time under your belt already, it should be part of your transcription arsenal by now. You can be the absolute most consistent best top quality transcriber that ever walked the Earth but the chances of anyone coming to such a realization diminishes, to some degree or another depending, along with the quality of the accompanying source files of your work.

A. — What you need to know first

[edit]
  1. Comprehension/ability to tell time & date.
  2. Understanding that rarely is anything dealing with technology ever static and that we depend on technology. The entire endeavor from Google scanning a physical book into a digital format to validating a transcribed source file derived from that book here on Wikisource would not be possible were it not for the technology behind it all.

    Without getting too technical -- if technology improves over time, its safe to say any results based on or in that same technology should also see improvement over time. The newer the version of any piece of hardware or software technology, chances are the "better" they will be compared to their predecessors (e.g. ABBYY 8.0 is not a good as ABBYY 9.0 is not as good as ABBYY 10.0 is not as good as ABBYY 10.1 and so on...)

  3. Familiarity with the basics of image and text file types.-- Understanding the nuances of stuff like: the results of scanning a paper page of text almost always produces an image file - a facsimile of the paper page. In some cases the text content might still be embedded within that same file however - the presence of such text does not change the fact the file "type" is an image. Multiple image files comprised of scanned results per paper page can be compiled into a single document file to mimic the pagination of the original source (a .PDF file for a physical book; the individual images making up a contiguous order of positions -- not pages -- in that document file). The virtual /position number rarely ever lines up with the printed page number. Etc.
  4. Familiarity with various pages associated with a single work hosted on Internet Archive
    1. The URL for the "main" page of any work follows this format:

      https://archive.org/details/IDENTIFIER where for our example, the IDENTIFIER is womanswhoswhoofa00leon making our target URL https://archive.org/details/womanswhoswhoofa00leon.

    2. The link of the URL listing all the files related to or needed by that "main" page (otherwise known as an Index) can be found in the left-hand side-bar titled View the book. Its labeled near the bottom next to All files: as HTTPS.

      While the URL to the Index incorporates the INDENTIFIER introduced in A4-1, there is no consistent easy way to ascertain the exact address other than the HTTPS link found on the "main" page.

    3. The link to the entire "history" of both the "main" page of a hosted work and the "Index" of files supporting it follows this format:

      https://catalogd.archive.org/history/IDENTIFIER where, again, for our example, the IDENTIFIER is womanswhoswhoofa00leon making our target URL https://catalogd.archive.org/history/womanswhoswhoofa00leon.

  5. ... I'm sure I'll remember something -- placeholder

B. — What you need to check for

[edit]
  1. How old is the candidate file?.-- If you accept the premise laid out in A4-2, the age of the candidate file is worth taking into consideration in your decision making process. And by age we mean two things:
    1. When was the work first scanned into a digital format and by who (like what Google does).
    2. When was that work put through the derive process on Internet Archive
Ascertaining the "original" date of creation in 1-1 is not always easy nor always worth the effort. What is easy is determining when the current work on IA was first processed. Why? Because the older that date is compared to today, the more likely the file is not the most optimal possible based on the ever improving technology over time premise. How? By analyzing the Index of the candidate file on Internet Archive.

  1. What exactly is in the Index?.-- Once a file is uploaded to IA and the basic metadata (Title, Author, Language, etc.) has been inserted, the "derive" process begins. Depending on the type and make-up of the uploaded file, a consistent pre-set batch of file manipulation programs are executed against the uploaded file until a final set of resulting files have been created -- this processing is more commonly known as the "derive process". Files created in the Index are the products resulting from the upload and derive stages of processing. Knowing which type of file is the result of what stage in addition to inspecting the timestamps of each helps us determine the age aspect.

    Using the previous identifier example for illustrative purposes here again, a typical Index looks something like this:



    Without diving too deep into the details just yet, one thing is obvious -- the source file uploaded to Commons is approximately 4-years, 3-months old already. How old was the source file uploaded to Internet Archive that produced the file ultimately uploaded to Commons? Can't say with any certainty but for our argument's sake, lets say this book was scanned and the resulting file or files were uploaded for processing by IA on the same day.

    So what does a 4.3 year difference make? You tell me. Our example file's derive history highlights on the left and a 2 day old file's derive history highlights to the right -- note the (v##### or version no. for each.

    <--- BookOp SetupMetaXML (v30094 Sep08 09:19) Starting PDT: 2010-09-08 09:19:58 ----
    <--- BookOp DevelopRawJp2 (v30003 Sep08 09:20) Starting PDT: 2010-09-08 09:20:01 ----
    <--- BookOp AbbyyZipToGz (v13154 Sep08 09:20) Starting PDT: 2010-09-08 09:20:01 ----
    <--- BookOp DevelopMekel (v18767 Sep08 09:20) Starting PDT: 2010-09-08 09:20:01 ----
    <--- Module ProcessJP2 (v30682 2010Sep08 09:20) Starting PDT: 2010-09-08 09:20:30 ----
    <--- Module AnimatedGIF (v30465 2010Sep08 15:35) Starting PDT: 2010-09-08 15:35:55 ----
    <--- Module AbbyyXML (v30243 2010Sep08 15:36) Starting PDT: 2010-09-08 15:36:44 ----
    Updating meta.xml with ocr = "ABBYY FineReader 8.0"
    <--- Module DjvuXML (v28794 2010Sep09 01:28) Starting PDT: 2010-09-09 01:28:18 ----
    <--- Module PDF (v21957 2010Sep09 02:26) Starting PDT: 2010-09-09 02:26:23 ----
    <--- Module DjVu (v27253 2010Sep09 02:26) Starting PDT: 2010-09-09 02:26:23 ----
    <--- Module JPEGCompPDF (v30180 2010Sep09 04:17) Starting PDT: 2010-09-09 04:17:53 ----
    <--- Module HackPDF (v23989 2010Sep09 04:17) Starting PDT: 2010-09-09 04:17:54 ----
    <--- Module GrayscalePdf (v29966 2010Sep09 07:22) Starting PDT: 2010-09-09 07:22:26 ----
    <--- Module DJVUTXT (v23986 2010Sep09 11:19) Starting PDT: 2010-09-09 11:19:28 ----

    <--- BookOp SetupMetaXML (v63345 Jan01 20:30) Starting PST: 2015-01-01 20:30:00 ----
    <--- BookOp DevelopRawJp2 (v38364 Jan01 20:30) Starting PST: 2015-01-01 20:30:01 ----
    <--- BookOp AbbyyZipToGz (v59030 Jan01 20:30) Starting PST: 2015-01-01 20:30:01 ----
    <--- BookOp DevelopMekel (v38252 Jan01 20:30) Starting PST: 2015-01-01 20:30:01 ----
    <--- Module ProcessJP2 (v54800 2015Jan01 20:30) Starting PST: 2015-01-01 20:30:02 ----
    <--- Module AnimatedGIF (v50716 2015Jan01 20:41) Starting PST: 2015-01-01 20:41:35 ----
    <--- Module AbbyyXML (v60634 2015Jan01 20:42) Starting PST: 2015-01-01 20:42:15 ----
    Updating meta.xml with ocr = "ABBYY FineReader 9.0"
    <--- Module DjvuXML (v38071 2015Jan01 21:17) Starting PST: 2015-01-01 21:17:17 ----
    <--- Module EPUB (v36000 2015Jan01 21:18) Starting PST: 2015-01-01 21:18:43 ----
    <--- Module DjVu (v38041 2015Jan01 21:18) Starting PST: 2015-01-01 21:18:49 ----
    <--- Module TOC (v39713 2015Jan01 21:34) Starting PST: 2015-01-01 21:34:07 ----
    <--- Module ScandataXML (v35935 2015Jan01 21:34) Starting PST: 2015-01-01 21:34:07 ----
    <--- Module PDF (v35935 2015Jan01 21:34) Starting PST: 2015-01-01 21:34:11 ----
    <--- Module HackPDF (v35935 2015Jan01 21:34) Starting PST: 2015-01-01 21:34:11 ----
    <--- Module DJVUTXT (v38312 2015Jan01 21:34) Starting PST: 2015-01-01 21:34:58 ----

    Not only have the modules and software been updated in those 4 some-odd years but we are presented with an unusual opportunity in this case. Look closer at the Index of files for our example... notice the .PDF file (line 4) is not the "oldest" file of the bunch? This means it wasn't the original source file uploaded to IA for processing but one of the result products of the processing. The .tar archive (line 17) of presumably 1 .jp2 file for every 1 page scanned is the source file.

    Now a bad scan is bad scan and no amount of re-jiggering will dramatically improve on resulting quality of files derived. The flip side being a good-scan is a good scan and re-running the latest derive modules and updated software against it will likely improve results one way or the other - maybe even all around (i.e. better thumbnails and a superior text-layer).

    So what should you take away from all this... INVEST SOME TIME and RESEARCH into what you select for upload & hosting by us and stop letting the 'eye candy of the moment' guide your decision making for you!!! -- George Orwell III (talk) 04:08, 3 January 2015 (UTC)