Help:Internet Archive

From Wikisource
Jump to navigation Jump to search
Internet Archive

Guidelines to downloading files from and uploading files to the Internet Archive

Stylised image of Ancient Greek "post and lintel" architecture with four columns supporting a simple entablature and triangular pediment
The Internet Archive

The Internet Archive is a non-profit digital library that holds nearly 3 million digitised books as well as music, audio, video and other files. It is one of the main sources of DjVu files for use on Wikisource. As well as files based on their own scans, the Internet Archive will also derive files (including DjVu files) from scans uploaded by its users. This can be a useful way to convert user-made scans into a DjVu file compatible with Wikisource (as well as making the work available for others).

This help page focuses on DjVu files, because that is the most used file type on Wikisource, but the process can be used for any other file type available from the Internet Archive.

Getting files[edit]

Searching[edit]

1. Go to the Internet Archive
2. Search for the book (or other text) you want. The basic search has a text field and a drop-down list. Type the title of the book in the text field and set the drop-down to "Texts".
3. Click "Go"

Fig. 1.

4. If the correct files are found on the Archive, you should see it in the search results. If there are multiple appropriate files, select the one you deem the best. This is subjective, but a clear scan will work best for proofreading, so aim for the best quality available (also note that some scans may have dirt or writing on the pages, which may or may not make proofreading harder). Different scans may come from different editions. If so, it is up to you which you pick but the earliest edition available is a popular choice.

Fig. 2.
Upper portion of the details screen.

5. If unsuccessful, you can also try following links, searching by subject, searching by author, or using the Advanced Search function.

If you didn't find the intended book but found some that is interesting to work, is strongly recommended to check if it is really suitable to be available on Wikisource in licensing terms (e.g., if it is a public domain work or licensed using compatible copyleft licences). Internet Archive accepts contributions still in copyright or under some restrictive licensing terms, but Wikisource will not accept them automatically, simply because they are available on archive.org - they must also meet licensing requirements.

DjVu file[edit]

Note: DjVu files are no longer created for new uploads to the Internet Archive as of March 2016, so you may not find one if the book was uploaded after this date.

The DjVu file can be downloaded (and uploaded to Wikimedia Commons) by following the steps below or manually tweaking the URL to the default DjVu URL format.

1. On the right of lower half of the details page, a box with the title "DOWNLOAD OPTIONS" as shown in Fig. 3. This section of the page will probably not be visible until you scroll down past the document viewing area.
2. Click on the link INDICATED BY THE RED ARROW to get to the list of files in Fig. 4.

Fig. 3.

3. This will open a list of files, as shown in Fig. 4.
4. Locate the file with the .djvu suffix. This is indicated by the red arrow in Fig. 3.
Other files can be downloaded instead of the DjVu. If required, proceed with the most appropriate file from the list.
  • An alternative format for text are PDF documents, with the .pdf suffix.
  • Audio files in the Ogg Vorbis format have the .ogg suffix.
  • Video files in the Ogg Theora format have the .ogv suffix.
  • The original scans are available from this list as well. In this example, the file sikhafghansinco00shahrich_jp2.zip is an archive of JPEG 2000 images of individual pages. This can sometimes be useful as it will contain high quality versions of illustrations, photographs and other elements of the book.
5. This is the file that needs to be uploaded to Wikimedia Commons. See Uploading (below).

Fig. 4.
Example of a list of files on the Internet Archive.

OR

The DjVu file download link can be retrieved by manually tweaking the book URL to the default DjVu download URL format.

Uploading[edit]

There are three main ways to upload the file to Wikimedia Commons.

One: IA Upload tool[edit]

Shortcut:
H:IA-Upload

The IA Upload tool is currently the most easy-to-use way to upload files from archive.org to Wikimedia Commons. You can check or contribute to its open source code.

  1. Go to IA-Upload. It will request an "OAuth" (permission to have restricted access) from your account on Wikimedia Commons at each run.
  2. Insert the archive.org identifier-access (the $ID portion of the URL as in https://archive.org/details/$ID) in the first field.
  3. Insert the desired filename for the file to be uploaded on Commons in the second field, without the File: prefix or .djvu suffix, and proceed.
  4. Review the automatic metadata, changing it as and if needed. It will be based on Commons' {{book}} template.
    • Note that you can select different source files for the DjVu: if you select to create the DjVu from either JP2 or PDF, then your request will be placed in a queue (displayed on the tool homepage) and will usually take about 15 minutes. If you select DjVu as the source, the upload will happen immediately (but not all IA items have this as an option).
    • Using the JP2 files as a source will not result in high quality images being uploaded to Wikimedia servers.
  5. Proceed and after a short wait you will find the file properly uploaded to Commons and list in your contributions.
  6. With certain books on archive.org, the generated Djvu file at Commons, will contain a misaligned text layer owing to an unresolved technical issue. If this happens, please mark the generated DJVU for deletion at Commons and post a request for a clean scan of the work on the Scriptorum.

Two: Automatic transfer[edit]

Use the URL2Commons tool to automatically transfer the DjVu file from the Internet Archive to Wikimedia Commons.

  1. Refer to Help:URL2Commons for information on using the tool.
  2. Right click on the appropriate file in the Internet Archive file list and select "Copy Shortcut" or equivalent.
  3. Paste this into the top panel of the URL2Commons tool.
  4. Proceed as described in the URL2Commons help document.

Three: Manual download and upload[edit]

Download the file to your own computer, then upload it to Wikimedia Commons manually.

  1. To download, right click on the appropriate file in the Internet Archive file list and select "Save Target As.." or equivalent.
    This may take some time, depending on the size of the file.
    If you use download manager software of any kind, follow the instructions for that software.
  2. Once downloaded, go to Wikimedia Commons' Upload Wizard (guided upload process with helpful steps) or Upload page (quicker but requires more knowledge of Commons' policies and methods).

Others[edit]

There are other ways to upload files to Wikimedia Commons, such as the bulk uploader Commonist. These still require downloading the file(s) to your own computer before uploading to Commons.

Adding files[edit]

Files can be added to the Internet Archive by any registered user. The following information is presented for ease of use and reference for Wikisource users. However, Wikisource is not affiliated with the Internet Archive and any or all of these stages may be changed by the Archive at any time. It is strongly recommended that anyone attempting this should refer to the Internet Archive's own instructions, and follow those above the steps listed here.

These instructions are:

The following Internet Archive blog posts might be useful as well:

How to produce a DjVu file[edit]

You need to login (don't use OpenId, it won't function[1]).

Uploading[edit]

Click "Upload" at the top-right corner. The flash upload (standard "Share" button) won't function with Firefox (use Opera or Internet Explorer instead[2]) or Linux. You can use the standard JavaScript non-flash method (although there's a file size limit of 2 GB with Firefox, but not with Chromium); FTP upload is deprecated because it's slower and crashy but is the only easy to learn possibility if you have to upload many files (which shouldn't be the case here).

OCR tricks[edit]

When the upload has been completed, the Internet Archive will start the "derive" work: OCR to create an XML document of detected text based on the uploaded PDF file, then conversion of that to a DjVu file with embedded text, creation of plain text-only dump file, among others.[3]

Don't forget to set the correct language in the metadata before starting the derive (which is run automatically after upload if there's something to derive), otherwise the OCR language will be set to English and results will be poor for works based in any other language. It's not possible to set multiple OCR languages, but you're invited to upload the same book twice with the two languages to have two OCRs.[4] The length of processing time depends on the size and complexity of your file, as well as the current Internet Archive backlog of conversion tests.[5] You can check your progress in the queue here and more detailed information about jobs you submitted here (must be logged in).

The Internet Archive uses a professional, proprietary, commercial ABBYY software[6] with a quite good images and OCR output in many languages and fonts and an aggressive compression[7] which mantains an high quality of the final DjVu file.[8] However, the Internet Archive sometimes produces over-compressed DjVu files with poor quality. If this happens, you can often download a PDF document and convert manually. You can reduce the resolution the derivation aims at, which is normally set automatically by some "guessing", via the fixed-ppi field, setting it to 300 (dpi) or lower to reduce sizes, processing time and (sometimes) errors.

Images formats[edit]

Book scans split into several tiff, jpg, jp2 format images (other formats are not accepted) are converted ("derived") as well, if you put them in a properly created tar or zip archive.[9] It's usually better to upload uncompressed scans or JPEGs; the jp2 files produced in the derivation process are compressed in a way you won't be able to emulate without a lot of effort.

Troubleshooting[edit]

If you have severe problems with your deriving process and you need admin intervention (tasks shown in red in your tasks list), ask help at infoarchive.org, they're usually amazingly helpful. General requests for help should be placed in the forums though, don't bother them for nothing!


Step by step[edit]

Preparing the file[edit]

If uploading a collection of page scans:

  1. The page scans should each be in an image format. For example, JPEG format.
  2. The page scans should be named in the correct alphabetical order. It may be a good idea to use a naming format such as "MyScan001.jpg", "MyScan002.jpg" etc. If so, remember to use leading zeroes, otherwise page 10 will come after page 1 but before page 2.
  3. Make sure that the page scans are the only file in the folder you are using.
  4. Create a zip file of the folder containing your page scans. The file name should be in the format "Myscan_images.zip", where "Myscan" is whatever you want to call the file. The "_images" suffix is important; your scan may not derive properly later if this is omitted.

Files such as PDFs can just be uploaded as they are.

Uploading[edit]

Note: the following instructions are for the classic uploader, superseded by the 2013 upload and create item wizard. Most of the instructions below should be unnecessary and ignorable if you use the new, simpler uploader. A blog post How to upload a scanned book to the Internet Archive is available with many screenshots; the advice on identifiers and metadata is just the author's personal opinion and is optional, however.

  1. Log in to the Internet Archive.
  2. Click the "Upload" button at the top right of the screen.
  3. Select the file to upload
  4. Fill in the information requested and choose an appropriate licence (this will be similar to the licences on Wikisource).
    • Title (required)
    • Description (required)
    • Keywords (required)
    • Author
    • Creative Commons Licence or Public Domain Mark
  5. Wait for the upload to complete.
  6. Click the "Share my File(s)" button.
  7. You will see the message "Please wait while your page is created..." then "Your Page is Ready!" followed by link to page.
  8. Clicking the link will result in a "Your item is not yet public" message.
  9. Pick a collection for your file. The options will include "movie, audio, text, etree" and "community video, community audio, community text". You will probably be using "text" and "community text". Select the appropriate collection and click the "Submit" to the right.
    • At this stage, you might be told to wait and come back later. This text is: "Your item is in the process of being derived, and you may not replace the metadata until the derive has finished (because any changes queued now would roll back those being made by the derive). Please try this page again after your item has finished deriving. [Item History]" In this case, simply follow those instructions: try again later.
  10. In the Metadata Editor complete more information (including the information from earlier stages).
  11. Click the Submit button. This will enter the file into log. This will take some time to complete

Deriving[edit]

Derivation can take up to a few days. This can be monitored either through the filename or the 'Contributions' page. The various formats of the work should automatically be derived from the files that were uploaded. If this has not occurred, the "View the book" in the left-hand sidebar will not be showing the various available formats (DjVu, EPUB, Kindle, Daisy etc). Derivation failure can have numerous reasons, many of which are internal to IA and have nothing to do with the uploaded file.

First, force the derivation from the file page:

  1. Click "Edit item"
  2. You will see two choices: "change the information" and "change the files". Click "change the information".
  3. Click "Item Manager"
  4. Click "Derive"

In case this fails:

  1. Go to the 'Contributions' page.
  2. Click on 'See your contribution tasks that are not yet completed.'
  3. The screen will display a list similar to this image.
  4. If the derivation process is still running, then wait.
  5. If the process has stopped and marked red, and 'waiting for Admin', then email to info at archive.org, advise them of the problem and request restart of the derivation process. Be sure to include the uploaded page link.

Requested uploads[edit]

You can request mass upload of public domain book scans from any external website to Internet Archive by preparing

  • 1) A list of URLs of the books to download
  • 2) A CSV table with title, creator, date, description, sponsor (digitising institution) etc.

Google Books[edit]

PDF Scans derived from Google Books contains a warning which needs to be stripped off before adding the text to IA for facilitating proofreading for Wikisource. These are normally done by the user/bot "tpb" (not affiliated to Internet Archive): we dream of a way to suggest tpb books we're interested in; we can start accumulating Google Books URLs here and then maybe tpb at some point will fetch them.

Also see this Scriptorium thread opened by Yann. Solomon7968 (talk) 10:37, 4 February 2014 (UTC)[reply]

  • The work tpb has started seems to be ended years ago, but I'm not sure. In the meantime the GBS original collection grown considerably. Maybe we are in need of a tool to do direct research on GBS + warning page removal + IA upload instead? Lugusto 19:03, 7 February 2014 (UTC)[reply]
    • Many editors here equipped with the software to remove the warnings and watermarks replace the existing IA derived file on Commons with the clean one without warnings and watermarks. It is especially trouble-some for large files. An automated system for uploading to IA will help for sure. Solomon7968 (talk) 03:31, 8 February 2014 (UTC)[reply]

The easiest way to remove the warning page is with DjView, see Help:DjVu files#Removing a copyright page. The best place to ask for someone to do it for you is probably the Repairs (and moves) section of the Scriptorium. —Beleg Tâl (talk) 03:13, 10 January 2017 (UTC)[reply]


Admins who are also Wikisourcerors[edit]

Admins have a checkbox to rerun or interrupt pending tasks

Some Internet Archive volunteers are given admin status on specific collections and can edit all items in those collections. No volunteers are known to have admin status on the general "Community texts" collection, but they can still help in the simplest cases, namely a derive.php red row waiting for admin or moving items into collections.

The following users are available for requests if you don't feel like disturbing the staff:

Notes[edit]

  1. See forums: Authentication error; not a valid OpenID, Login problems when I click "Share" .
  2. See forum.
  3. If your original PDF has no embedded text-layer, the derive process will automatically create a second, text-rich, PDF for you by applying the same previously detected OCR generated text to create one.

    Please, note, however, if your PDF comes from GoogleBooks and has a first-page disclaimer notice, the derive process will detect the disclaimer page's hidden text-layer, assume the rest of the pages in the PDF also have embedded hidden text-layers too when they never do and skip the automatic creation of the second PDF file altogeher. Keeping the disclaimer page but stripping it of all hidden text is the optimal approach for reasons having to do with the complimentary creation of a DjVu file at the same time - swapping it with a suitable null or blank page will do just as well and of course the last resort is deletion of the disclaimer page.

  4. See forum.
  5. Example: Vocabolario degli accademici della Crusca, 1691, took 5.1 days to derive.
  6. Version 9.0 since 2013.
  7. In the example, dimension is 1/6 compared to djvudigital output.
  8. Example: this 205 MB PDF document of a 1691 book from Gallica is converted by pdf2djvu.sh script in a hardly readable 382.4 MB djvu, in a little better readable 316.7 MB djvu by djvudigital and in a better quality 51.3 MB djvu by the Internet Archive.
  9. FAQ; documentation of the format to use. Remember: put extensions in lowercase everywhere, use tif with a single f, put the ppi value of the images in the metadata. If your archive of images is not recognized as such, it may help to edit the metadata and set its format as "Single Page Processed TIFF ZIP" (even if it's a TAR) and so on. You should probably first the _images.zip archive format.