Page:Login USENIX Newsletter feb1983.djvu/10

From Wikisource
Jump to navigation Jump to search
This page has been proofread, but needs to be validated.

;login:

This is more easily understood by considering three cases. If the substring lies wholly to the right of the cutoff, the match will terminate successfully. If there is an overlap, the cutoff becomes “soft” and the match continues. If the substring lies completely to the left of the cutoff, then a match would have been discovered for an earlier pathname, so we need not search these characters! Technically, cutoff must be re-anchored to path immediately after matches. This condition is omitted above for the sake of clarity. Statistics on overlap have not been garnered, but a 40-50% speedup is consistently observed.

The author has not discovered this refinement in the literature.

Two Tier Technique

Shell-style filename expansion without undue slowdown can be had by first performing the fast search on a metacharacter-free component of name, then applying regular expression syntax “globbing” to these selected paths via the slower recursive amatch function internal to find. Ergo,

puts ( path );

becomes

if ( globchars = = NO | amatch ( path, name ) )
puts ( path );

where globchars is set if name contains shell glob characters. Using wildcarding, a primitive man command might be

vtroff -man ‘find ’*man* , ’"$l"’.[l-9]’‘

Diminishing Returns

Production find code at Ames exacts a further 20-25% space compression (entropy reduction) by assigning single non-printing ascii codes to the most common 128 bigrams, ".c" and ".P" figure prominently. Room for these codes is made by reserving only 28 count codes for the likeliest "differential" counts (the interline difference between one prefix count and the next), along with a "switch" code for out-of-range counts (remember the possible 1024 byte pathnames, courtesy BSD 4.2). Printable ascii comprises the filename residue. We will not dwell on this rather ad hoc means, which barely reduces search time.

Other algorithms to address the time-space complexity tradeoff such as Huffman or restricted variability coding [Reghbati, 1981] do not look promising — they only change an I/O-bound process to a compute-bound one. Some experiments were done with the inverted file programs inv and hunt. Here, process startup overhead (the fgrep call to disambiguate “false drops”) and space consumption (full pathnames plus an index) make inv invocations noncompetitive. Boyer-Moore sublinear search [Boyer, 1977] or macro model methods [Storer/Szymanski, 1982] might be employed, but must concern typically short 4-10 character patterns and equally short post-compression pathname content, for all their added complexity.

To conclude, we are content to scan 19000 filenames in several seconds using 180 blocks and two extra pages of C code.

REFERENCES

  • Boyer, R. S. A Fast String Searching Algorithm, Commun. ACM, Vol. 20, No. 10, October 1977.
  • Morris, R. and Thompson, K. Webster’s Second on the Head of a Pin, Unpublished Technical Memo, Bell Laboratories, Murray Hill, N. J., 1974.
  • Reghbati, H. K. An Overview of Data Compression Techniques , Computer, Vol. 14, No. 4, April 1981.
  • Storer, J. A. and Szymanski, T. G. Data Compression via Textual Substitution , J. ACM, Vol. 29, No. 4, October 1982.


10
March 1983
Volume 8, Number 1