Page:Login USENIX Newsletter feb1983.djvu/8

This page has been proofread, but needs to be validated.

;login:

Finding Files Fast

James A. Woods

Informatics General Corporation NASA Ames Research Center Moffett Field, California 94035

January 15, 1983

ABSTRACT

A fast filename search facility for UNIX is presented. It consolidates two data compression methods with a novel string search technique to rapidly locate arbitrary files. The code, integrated into the standard find utility, consults a preprocessed database, regenerated daily. This contrasts with the usual mechanism of matching search keys against candidate items generated on-the-fly from a scattered directory structure. The pathname database is an incrementally-encoded lexicographically sorted list (sometimes referred to as a “front-compressed” file) which is also subjected to common bigram coding to effect further space reduction. The storage savings are a factor of five to six over the standard ascii representation. The list is scanned using a modified linear search specially tailored to the incremental encoding; typical “user time” required by this algorithm is 40%-50% less than with naive search.

Introduction

Locating files in a computer system, or network of systems, is a common activity. UNIX users have recourse to a variety of approaches, ranging from manipulation of cd, ls, and grep commands, to specialized programs such as U. C. Berkeley’s wherels and fleece , to the more general UNIX find. The Berkeley fleece is unfortunately restricted to home directories, and whereis is limited to ekeing out system code/documentation residing in standard places. The arbitrary

find / -name "*< filename >*" -print

will certainly locate files when the associated directory structure cannot be recalled, but is inherently slow as it recursively descends the entire file system to mercilessly thrash about the disk. Impatience has prompted us to develop an alternative to the “seek and ye shall find” method of pathname search.

Precomputation

Why not simply build a static list of all files on the system to search with grepl Alas, a healthy system with 20000 files contains upwards of 1000 blocks of filenames, even with an abbreviated /u (vs. tusr) adopted for user home prefixes. Grep on our unloaded 30-40 block/second PDP 11/70 system demands half a minute for the scan. This is unacceptable for an oft-used command.

Incidently, it is not much of a sacrifice to be unable to reference files which are less than a day old—either the installer is likely to be contactable, or the file is not quite ready for use! Well-aged files originated by other groups, usually with different filesystem naming conventions, are the probable can- didates for search.

Compression

To speed access for the application, one might consider binary search or hashing, but these schemes do not work well for partial matching, where we are interested in portions of pathnames. Though fast, the methods do not save space, which is often at a premium. An easily implementable

8

March 1983

Volume 8, Number 1