Wikisource:Sources/Ebook.lib.hku.hk notes
Appearance
I just struck a rich vein of public domain document-ore: in the course of some Wikipedia research I found the domain ebook.lib.hku.hk. These ebooks seem to each be composed of a bunch of single-page .pdfs, so I assume they're actually original scans and not re-publications from elsewhere on the net. The distribution of titles amongst genres is also distinctly different from Project Gutenberg, Internet Archive, or Google Books: there's a higher percentage of technical works (because China was especially interested in Western tech pre-1923, I'm guessing?), learning English, and Hong Kong publications. (Curiously, I do not seem to have come across anything written in Chinese.)
I can't seem to find a main index or search page for all the books; if someone else finds it be sure to post it here. So I have been using a domain-scoped Google search. To get the proper ebook interface you have to trim down the URL to the last slash; so if your search result takes you to something like
trim it down to
And then click the "Go" button in your browser (or pressing "Enter" on the keyboard usually works too). Unfortunately the PDFs do not seem to be OCRed so that means any given text would probably still need to be transcribed. Windows users, see my notes on the free ConcatPDF tool which will allow you to agglomerate all these one-page .pdfs into a single big .pdf for conversion to .djvu. N.B. that you have to install those two Microsoft libraries before you try to install ConcatPDF.If anyone figures out how to just download the entire book at once let the rest of us know... --❨Ṩtruthious ℬandersnatch❩ 11:46, 15 May 2008 (UTC) |
- Examining the JavaScript at the HKU site (http://ebook.lib.hku.hk/res/js/heading.js) reveals that the PDF files are stored in a “pdf” subdirectory under each title’s directory. So, for The Art of Cross-Examination (http://ebook.lib.hku.hk/CADAL/B31423735/index.html), the images are in http://ebook.lib.hku.hk/CADAL/B31423735/pdf/. Files are named
nnnnnnnn.pdf
, wherennnnnnnn
is the page number (padded out with zeroes to make an 8.3 filename—so,00000001.pdf
,00000002.pdf
, etc. Although you can’t browse the contents of the pdf directory, you can request individual files. If you know that the last page number is (for instance) 289, you should be able to write a short script that calls wget 289 times and get them all that way. Tarmstro99 00:15, 16 May 2008 (UTC)
- And if you were to do such a thing and concatenate the scans, you might end up with something like Image:The Art of Cross-Examination.djvu. :-) Tarmstro99 01:39, 16 May 2008 (UTC)
Sweet. It's alive! It's alive! --❨Ṩtruthious ℬandersnatch❩ 04:05, 16 May 2008 (UTC)
- It sounds like an interesting book: here is colum one of a two column NYT book review. John Vandenberg (chat) 07:48, 17 May 2008 (UTC)
People are seeing a high percentage of "dud" .pdf files in these ebooks, but I have gotten through to some clear pages. Hopefully there are some entire books that work, or perhaps there's a particular .pdf reader you have to use... I'm going to keep looking into it. --❨Ṩtruthious ℬandersnatch❩ 23:25, 15 May 2008 (UTC) |
Here's a Python program to download a 700 page book:
import urllib for x in range(1,700): s="00000000"+str(x) s = s[-8:] urllib.urlretrieve('http://ebook.lib.hku.hk/CADAL/B31440708/pdf/'+s+'.pdf', s+'.pdf')
and you can then merge the files with:
pdftk *.pdf cat output combined.pdf