User:Alex brolloBot

From Wikisource
Jump to navigation Jump to search

Hi all, this is a new bot account linked to my user account User:Alex brollo. I presume I'll use this bot just to help some friends (see User talk:Alex brollo), since my time is devoted to it.source. But... who can say? ;-) --Alex brolloBot (talk) 10:30, 2 May 2010 (UTC)

Features[edit]

The bot uses pywikipedia framework for basics (mainly wikipedia.py and some utilities as pagegenerators.py) as a library used by original python scripts; here it will do occasional ad hoc jobs, so there's no need to flag it. It's philosophy is similar, even if simple, to that of it:User:Alebot.

Jan 2011 update[edit]

Just now, bot is loading pages to Index:Horse shoes and horse shoeing.djvu. I know from scriptorium talk that bot upload of pages is not appreciated; nevertheless I'm not running the usual pywikipedia script, but a DIY test procedure:

  1. to extract djvu text layer code with djvused routine (output-txt option);
  2. to extract text from the file, using field names and commands (select, paragraph, line, word) and adding "begin page" tags;
  3. to fix common scannos (mainly spaces near punctuation) and new line char;
  4. to merge hyphenated, end line words.

The resulting file is converted from utc-8 to unicode then page text (marked by "begin page" tags) is loaded into Page: namespace.

This is only a preliminary test; I guess that using coordinates too, much more automated formatting could be gained from djvu text code.

  1. the coordinates of a line correlate with the position of the line into the page. Centering could be automated.
  2. the height of lines correlate with font size. Font sizing could be automated.
  3. perhaps the length of contiguous sets of lines mark the "poem areas" allowing automation of poem tag use.
  4. first line features (position, spacing of blocks of words) could be used to extract noinclude header and footer of the page text.
  5. ...... other?

Here a little bit of original djvu text layer, page Page:Horse shoes and horse shoeing.djvu/429 (it will be uploaded into an hour, at 60' throttle):

# ------------------------- 
select "horseshoesandho00flemgoog_0429.djvu"
set-txt
(page 274 465 2612 4494
 (column 274 465 2612 4494
  (region 274 465 2612 4494
   (para 275 4411 2204 4494
    (line 275 4411 2204 4494
     (word 275 4411 391 4487 "398")
     (word 635 4431 1220 4494 "HORSE-SHOES")
     (word 1268 4435 1444 4493 "AND")
     (word 1489 4434 2204 4494 "HORSESHOEING.")))
   (para 274 4093 2568 4318
    (line 277 4213 2568 4318
     (word 277 4213 540 4313 "proach")
     (word 576 4239 651 4295 "to")
     (word 689 4215 827 4287 "any")
     (word 862 4216 1052 4312 "large")
     (word 1103 4241 1292 4296 "town")
     (word 1327 4243 1409 4289 "or")
     (word 1450 4225 1682 4313 "castle,")
     (word 1718 4218 2085 4316 "inquiring")
     (word 2121 4243 2185 4315 "if")
     (word 2225 4243 2379 4318 "that")
     (word 2413 4244 2568 4291 "were"))
    (line 274 4093 758 4186
     (word 274 4093 710 4186 "Jerusalem.'")
     (word 738 4148 758 4174 "'")))
   (para 275 3090 2590 4064
    (line 424 3973 2579 4064
     (word 424 3990 605 4062 "This")
     (word 662 3989 953 4061 "allusion")
     (word 1003 3989 1056 4059 "is")
     (word 1112 3973 1416 4061 "curious,")
     (word 1466 3990 1839 4064 "inasmuch")
     (word 1900 3991 1971 4037 "as")
     (word 2033 3990 2086 4063 "it")
     (word 2147 3991 2449 4064 "informs")
     (word 2506 3991 2579 4038 "us"))
    (line 276 3838 2582 3938
     (word 276 3863 430 3935 "that")
     (word 467 3864 652 3911 "oxen")
     (word 681 3863 856 3909 "were")
     (word 892 3845 1083 3934 "shod,")
     (word 1111 3846 1274 3933 "and,")
     (word 1303 3863 1372 3909 "as")
     (word 1402 3865 1463 3936 "if")
     (word 1488 3839 1898 3938 "something")
     (word 1933 3838 2105 3911 "very")
     (word 2128 3850 2582 3938 "remarkable,"))

Here the resulting text after python filtering and fixing:

398 HORSE-SHOES AND HORSESHOEING.

proach to any large town or castle, inquiring if that were Jerusalem.' '

This allusion is curious, inasmuch as it informs us that oxen were shod, and, as if something very remarkable,

First test of word height analysis[edit]

This is the output of a simple script that calculates distribution of height in pixel of 5000 rows (beginning from 3000th) of dsed file, considering only the rows that begin with "(word ":

>>> l=analisiDsed(base=3000,n=5000)
Lette 237517 righe da ../Nconvert/djvulab/horse.dsed
>>> for i in l:
	print "*",i[0],i[1]

	
* 16 1
* 18 5
* 19 1
* 20 1
* 21 4
* 22 2
* 23 9
* 24 7
* 25 2
* 26 3
* 27 4
* 28 3
* 29 2
* 30 3
* 31 8
* 32 5
* 33 1
* 34 3
* 35 4
* 36 25
* 37 55
* 38 33
* 39 13
* 40 2
* 41 4
* 42 6
* 43 8
* 44 19
* 45 53
* 46 120
* 47 101
* 48 40
* 49 9
* 50 7
* 51 9
* 52 32
* 53 68
* 54 79
* 55 99
* 56 180
* 57 259
* 58 202
* 59 88
* 60 35
* 61 19
* 62 12
* 63 10
* 64 7
* 65 12
* 66 4
* 67 5
* 68 15
* 69 50
* 70 162
* 71 297
* 72 532
* 73 450
* 74 222
* 75 99
* 76 79
* 77 45
* 78 33
* 79 21
* 80 8
* 81 13
* 82 3
* 84 1
* 85 5
* 86 13
* 87 22
* 88 28
* 89 36
* 90 30
* 91 17
* 92 7
* 93 5
* 94 9
* 95 23
* 96 55
* 97 112
* 98 106
* 99 71
* 100 36
* 101 3
* 103 1
* 106 1
>>> 

Clearly, a multimodal distribution, with two main peaks at 57 px and 73 px. These are typical values for normal characters as "T" and "a" I presume. Which is the pattern of a "normal" row? Which the pattern of a row, with upper characters only? Which the pattern of a row with a different font size? Which kind of fast statistics to accept or reject probable formatting? Difficult - but far from impossible tasks for the unbeliavable power of the slower CPU!