Help:Internet Archive

From Wikisource
Jump to: navigation, search
Internet Archive
Shortcut:
H:IA
Guidelines to downloading files from and uploading files to the Internet Archive
Stylised image of Ancient Greek "post and lintel" architecture with four columns supporting a simple entablature and triangular pediment
The Internet Archive

The Internet Archive is a non-profit digital library that holds nearly 3 million digitised books as well as music, audio, video and other files. It is one of the main sources of DjVu files for use on Wikisource. As well as files based on their own scans, the Internet Archive will also derive files (including DjVu files) from scans uploaded by its users. This can be a useful way to convert user-made scans into a DjVu file compatible with Wikisource (as well as making the work available for others).

This help page focuses on DjVu files, because that is the most used file type on Wikisource, but the process can be used for any other file type available from the Internet Archive.

Getting files[edit]

Searching[edit]

  1. Go to The Internet Archive
  2. Search for the book (or other text) you want. The basic search has a text field and a drop-down list. Type the title of the book in the text field and set the drop-down to "Texts".
  3. Click "Go"
  4. If the correct files are found on the Archive, you should see it in the search results. If there are multiple appropriate files, select the one you deem the best. This is subjective, but a clear scan will work best for proofreading, so aim for the best quality available (also note that some scans may have dirt or writing on the pages, which may or may not make proofreading harder). Different scans may come from different editions. If so, it is up to you which you pick but the earliest edition available is a popular choice.
  5. If unsuccessful, you can also try following links, searching by subject, searching by author, or using the Advanced Search function.

If you didn't find the intended book but found some that is interesting to work, is strongly recommended to check if it is really suitable to be available on Wikisource in licensing terms (e.g., if it is a public domain work or licensed using compatible copyleft licences). Internet Archive accepts contributions still in copyright or under some restrictive licensing terms, but Wikisource will not accept them automatically, simply because they are available on archive.org - they must also meet licensing requirements.

DjVu file[edit]

The DjVu file can be downloaded (and uploaded to Wikimedia Commons) by following the steps below or manually tweaking the URL to the default DjVu URL format.

1. On the left side of the details page, will be a box with the title "View the Book" as shown in Fig. 1.
2. Click on the "HTTP" link to get to the list of files. This is indicated by the red arrow in Fig. 1.


Internet Archive View Book Box 2.png
Fig. 1.
A basic form of the "View the Book" box, found in the Details pages of the Internet Archive.


3. This will open a list of files, as shown in Fig. 2.
4. Locate the file with the .djvu suffix. This is indicated by the red arrow in Fig. 2.
Other files can be downloaded instead of the DjVu. If required, proceed with the most appropriate file from the list.
  • An alternative format for text are PDF documents, with the .pdf suffix.
  • Audio files in the Ogg Vorbis format have the .ogg suffix.
  • Video files in the Ogg Theora format have the .ogv suffix.
  • The original scans are available from this list as well. In this example, the file sikhafghansinco00shahrich_jp2.zip is an archive of JPEG 2000 images of individual pages. This can sometimes be useful as it will contain high quality versions of illustrations, photographs and other elements of the book.
5. This is the file that needs to be uploaded to Wikimedia Commons. See Uploading (below).


Internet Archive files list 2.png
Fig. 2.
Example of a list of files on the Internet Archive.

OR

The DjVu file download link can be retrieved by manually tweaking the book URL to the default DjVu download URL format.

Uploading[edit]

There are three main ways to upload the file to Wikimedia Commons.

One: ia-upload tool[edit]

Shortcut:
H:IA-Upload

The ia-upload tool is currently the most easy-to-use way to upload files from archive.org to Wikimedia Commons. You can check or contribute to their open source code.

  1. Go to IA-Upload. It will request an "OAuth" (permission to have restricted access) from your account on Wikimedia Commons at each run.
  2. Insert the archive.org identifier-access (the $ID portion of the URL as in https://archive.org/details/$ID) in the first field.
  3. Insert the desired filename for the file to be uploaded on Commons in the second field, without the File: prefix or .djvu suffix, and proceed.
  4. Review the automatic metadata, changing it as and if needed. It will be based on Commons' {{book}} template.
  5. Proceed and after a few seconds you will find the file properly uploaded to Commons and list in your contributions.

Two: Automatic transfer[edit]

Use the URL2Commons tool to automatically transfer the DjVu file from the Internet Archive to Wikimedia Commons.

  1. Refer to Help:URL2Commons for information on using the tool.
  2. Right click on the appropriate file in the Internet Archive file list and select "Copy Shortcut" or equivalent.
  3. Paste this into the top panel of the URL2Commons tool.
  4. Proceed as described in the URL2Commons help document.

Three: Manual download and upload[edit]

Download the file to your own computer, then upload it to Wikimedia Commons manually.

  1. To download, right click on the appropriate file in the Internet Archive file list and select "Save Target As.." or equivalent.
    This may take some time, depending on the size of the file.
    If you use download manager softwarte of any kind, follow the instructions for that software.
  2. Once downloaded, go to Wikimedia Commons' Upload Wizard (guided upload process with helpful steps) or Upload page (quicker but requires more knowledge of Commons' policies and methods).

Others[edit]

There are other ways to upload files to Wikimedia Commons, such as the bulk uploader Commonist. These still require downloading the file(s) to your own computer before uploading to Commons.

Adding files[edit]

Files can be added to the Internet Archive by any registered user. The following information is presented for ease of use and reference for Wikisource users. However, Wikisource is not affiliated with the Internet Archive and any or all of these stages may be changed by the Archive at any time. It is strongly recommended that anyone attempting this should refer to the Internet Archive's own instructions, and follow those above the steps listed here.

These instructions are:

The following Internet Archive blog posts might be useful as well:

How to produce a DjVu file[edit]

You need to login (don't use OpenId, it won't function[1]).

Uploading[edit]

Click "Upload" at the top-right corner. The flash upload (standard "Share" button) won't function with Firefox (use Opera or Internet Explorer instead[2]) or Linux. You can use the standard JavaScript non-flash method (although there's a file size limit of 2 GB with Firefox, but not with Chromium); FTP upload is deprecated because it's slower and crashy but is the only easy to learn possibility if you have to upload many files (which shouldn't be the case here).

OCR tricks[edit]

When the upload has been completed, the Internet Archive will start the "derive" work: OCR to create an XML document of detected text based on the uploaded PDF file, then conversion of that to a DjVu file with embedded text, creation of plain text-only dump file, among others.[3]

Don't forget to set the correct language in the metadata before starting the derive (which is run automatically after upload if there's something to derive), otherwise the OCR language will be set to English and results will be poor for works based in any other language. It's not possible to set multiple OCR languages, but you're invited to upload the same book twice with the two languages to have two OCRs.[4] The length of processing time depends on the size and complexity of your file, as well as the current Internet Archive backlog of conversion tests.[5] You can check your progress in the queue here and more detailed information about jobs you submitted here (must be logged in).

The Internet Archive uses a professional, proprietary, commercial ABBYY software[6] with a quite good images and OCR output in many languages and fonts and an aggressive compression[7] which mantains an high quality of the final DjVu file.[8] However, the Internet Archive sometimes produces over-compressed DjVu files with poor quality. If this happens, you can often download a PDF document and convert manually. You can reduce the resolution the derivation aims at, which is normally set automatically by some "guessing", via the fixed-ppi field, setting it to 300 (dpi) or lower to reduce sizes, processing time and (sometimes) errors.

Images formats[edit]

Book scans split into several tiff, jpg, jp2 format images (other formats are not accepted) are converted ("derived") as well, if you put them in a properly created tar or zip archive.[9] It's usually better to upload uncompressed scans or JPEGs; the jp2 files produced in the derivation process are compressed in a way you won't be able to emulate without a lot of effort.

Troubleshooting[edit]

If you have severe problems with your deriving process and you need admin intervention (tasks shown in red in your tasks list), ask help at infoAt sign.svgarchive.org, they're usually amazingly helpful. General requests for help should be placed in the forums though, don't bother them for nothing!


Step by step[edit]

Preparing the file[edit]

If uploading a collection of page scans:

  1. The page scans should each be in an image format. For example, JPEG format.
  2. The page scans should be named in the correct alphabetical order. It may be a good idea to use a naming format such as "MyScan001.jpg", "MyScan002.jpg" etc. If so, remember to use leading zeroes, otherwise page 10 will come after page 1 but before page 2.
  3. Make sure that the page scans are the only file in the folder you are using.
  4. Create a zip file of the folder containing your page scans. The file name should be in the format "Myscan_images.zip", where "Myscan" is whatever you want to call the file. The "_images" suffix is important; your scan may not derive properly later if this is omitted.

Files such as PDFs can just be uploaded as they are.

Uploading[edit]

Note: the following instructions are for the classic uploader, superseded by the 2013 upload and create item wizard. Most of the instructions below should be unnecessary and ignorable if you use the new, simpler uploader. A blog post How to upload a scanned book to the Internet Archive is available with many screenshots; ignore all the advice on identifiers and metadata, it's just the author's personal opinion.

  1. Log in to the Internet Archive.
  2. Click the "Upload" button at the top right of the screen.
  3. Select the file to upload
  4. Fill in the information requested and choose an appropriate licence (this will be similar to the licences on Wikisource).
    • Title (required)
    • Description (required)
    • Keywords (required)
    • Author
    • Creative Commons Licence or Public Domain Mark
  5. Wait for the upload to complete.
  6. Click the "Share my File(s)" button.
  7. You will see the message "Please wait while your page is created..." then "Your Page is Ready!" followed by link to page.
  8. Clicking the link will result in a "Your item is not yet public" message.
  9. Pick a collection for your file. The options will include "movie, audio, text, etree" and "community video, community audio, community text". You will probably be using "text" and "community text". Select the appropriate collection and click the "Submit" to the right.
    • At this stage, you might be told to wait and come back later. This text is: "Your item is in the process of being derived, and you may not replace the metadata until the derive has finished (because any changes queued now would roll back those being made by the derive). Please try this page again after your item has finished deriving. [Item History]" In this case, simply follow those instructions: try again later.
  10. In the Metadata Editor complete more information (including the information from earlier stages).
  11. Click the Submit button. This will enter the file into log. This will take some time to complete

Deriving[edit]

Derivation can take up to a few days. This can be monitored either through the filename or the 'Contributions' page. The various formats of the work should automatically be derived from the files that were uploaded. If this has not occurred, the "View the book" in the left-hand sidebar will not be showing the various available formats (DjVu, EPUB, Kindle, Daisy etc). Derivation failure can have numerous reasons, many of which are internal to IA and have nothing to do with the uploaded file.

First, force the derivation from the file page:

  1. Click "Edit item"
  2. You will see two choices: "change the information" and "change the files". Click "change the information".
  3. Click "Item Manager"
  4. Click "Derive"

In case this fails:

  1. Go to the 'Contributions' page.
  2. Click on 'See your contribution tasks that are not yet completed.'
  3. The screen will display a list similar to this image.
  4. If the derivation process is still running, then wait.
  5. If the process has stopped and marked red, and 'waiting for Admin', then email to info at archive.org, advise them of the problem and request restart of the derivation process. Be sure to include the uploaded page link.

Requested uploads[edit]

You can request mass upload of public domain book scans from any external website to Internet Archive by preparing

  • 1) A list of URLs of the books to download
  • 2) A CSV table with title, creator, date, description, sponsor (digitising institution) etc.
  • English Books from British India collection of 18 books from Savifa Virtual Library South Asia, University of Heidelberg. Solomon7968 (talk) 09:29, 4 February 2014 (UTC)
    • I can do the upload, but most of the work here is metadata. Please prepare 1) a list of URLs of the books to download, 2) a CSV table with title, creator, date, description, sponsor (digitising institution) etc. I can help you revise the table if you're unsure how to name the fields, but I don't have time for the data entering. --Nemo 09:46, 4 February 2014 (UTC)
    • @Solomon7968:: This set is small and ok to me to work on it. I can do the metadata sorting and the upload process, but only during the next week. This term is ok for you? I can also provide to you the resulting .CSV for further requests (ie., to help you to write new CSV files for new or bigger requests) Lugusto 19:03, 7 February 2014 (UTC)
      • Thanks for the offer! Many books in the collection are rare and not available elsewhere. I also dropped an E-Mail to their Contact person Nicole Merkel a month or so ago, but no response from them so far. Solomon7968 (talk) 03:31, 8 February 2014 (UTC)
        • Scans provided by SavifaDok for Transactions of the Agricultural and Horticultural Society of India v. II are in a very low quality. Fortunately I've found a better quality scan on GBS so I've picked this, did OCR+djvu processing locally on my machine (with ABBYY 11) and uploaded it to Commons as File:Transactions of the Agricultural and Horticultural Society of India - Vol 2.djvu. But please note that Transactions of the Agricultural and Horticultural Society of India isn't exactly a book, but a journal; GBS have also more issues for this title. I will check/work on 17 remaining files from SavifaDok tomorrow. Lugusto 19:46, 10 February 2014 (UTC)
          • @Solomon7968:: So sorry for this delay. Apparently all scans from SavifaDok are in very low quality. Again I've found a better scan quality on GBS (although there are also issues on both versions in the exactly same pages...), this time for A rapid sketch of the life of Raja Radhakanta Deva Bahadur with some notices of his ancestors, and testimonials of his character and learning. Done as in the previous book, File:A rapid sketch of the life of Raja Radhakanta Deva Bahadur.djvu. As soon as possible I will process more books from the 16 remaining. Lugusto 02:47, 15 February 2014 (UTC)

Google Books[edit]

PDF Scans derived from Google Books contains a warning which needs to be stripped off before adding the text to IA for facilitating proofreading for Wikisource. These are normally done by the user/bot "tpb" (not affiliated to Internet Archive): we dream of a way to suggest tpb books we're interested in; we can start accumulating Google Books URLs here and then maybe tpb at some point will fetch them.

Also see this Scriptorium thread opened by Yann. Solomon7968 (talk) 10:37, 4 February 2014 (UTC)

  • The work tpb has started seems to be ended years ago, but I'm not sure. In the meantime the GBS original collection grown considerably. Maybe we are in need of a tool to do direct research on GBS + warning page removal + IA upload instead? Lugusto 19:03, 7 February 2014 (UTC)
    • Many editors here equipped with the software to remove the warnings and watermarks replace the existing IA derived file on Commons with the clean one without warnings and watermarks. It is especially trouble-some for large files. An automated system for uploading to IA will help for sure. Solomon7968 (talk) 03:31, 8 February 2014 (UTC)


Admins who are also Wikisourcerors[edit]

Admins have a checkbox to rerun or interrupt pending tasks

Some Internet Archive volunteers are given admin status on specific collections and can edit all items in those collections. No volunteers are known to have admin status on the general "Community texts" collection, but they can still help in the simplest cases, namely a derive.php red row waiting for admin.

The following users are available for requests if you don't feel like disturbing the staff:

Notes[edit]

  1. See forums: Authentication error; not a valid OpenID, Login problems when I click "Share" .
  2. See forum.
  3. If your original PDF has no embedded text-layer, the derive process will automatically create a second, text-rich, PDF for you by applying the same previously detected OCR generated text to create one.

    Please, note, however, if your PDF comes from GoogleBooks and has a first-page disclaimer notice, the derive process will detect the disclaimer page's hidden text-layer, assume the rest of the pages in the PDF also have embedded hidden text-layers too when they never do and skip the automatic creation of the second PDF file altogeher. Keeping the disclaimer page but stripping it of all hidden text is the optimal approach for reasons having to do with the complimentary creation of a DjVu file at the same time - swapping it with a suitable null or blank page will do just as well and of course the last resort is deletion of the disclaimer page.

  4. See forum.
  5. Example: Vocabolario degli accademici della Crusca, 1691, took 5.1 days to derive.
  6. Version 9.0 since 2013.
  7. In the example, dimension is 1/6 compared to djvudigital output.
  8. Example: this 205 MB PDF document of a 1691 book from Gallica is converted by pdf2djvu.sh script in a hardly readable 382.4 MB djvu, in a little better readable 316.7 MB djvu by djvudigital and in a better quality 51.3 MB djvu by the Internet Archive.
  9. FAQ; documentation of the format to use. Remember: put extensions in lowercase everywhere, use tif with a single f, put the ppi value of the images in the metadata. If your archive of images is not recognized as such, it may help to edit the metadata and set its format as "Single Page Processed TIFF ZIP" (even if it's a TAR) and so on. You should probably first the _images.zip archive format.