Page:Untangling the Web.pdf/35

From Wikisource
Jump to navigation Jump to search
This page has been proofread, but needs to be validated.

DOCID: 4046925

UNCLASSIFIED//FOR OFFICIAL USE ONLY

  1. Learn the search syntax of the search engines you use (never assume). Most search engines use double quotes ("") to enclose a phrase and the plus + and minus - keys to indicate "must include" and "must exclude" respectively. But these are by no means universal rules (especially when using international or metasearch engines).
  2. The default operator for all major US search engines is now AND. As of February 2002, no major search engine used OR as its default operator. However, most search engines will let you use an OR in the simple search box: Yahoo and Google permit OR searches in the simple search box, but you must capitalize the OR.
  3. Keep in mind that because HTML does not have a "date" tag, "date" can mean many things: creation date; the last modified date for the page; or the date search engine found the page. I do not recommend searching by date except when using weblog, news, or newsgroup search engines.

Understanding statistical interfaces is important, especially for researchers used to boolean and other non-statistical query languages. Most search engines use statistical interfaces. The search engine assigns relative weights to each search term, depending on:

  • its rarity in their database
  • how frequently the term occurs on the webpage
  • whether or not the term appears in the uri
  • how close to the top of the page the term appears
  • (sometimes) whether or not the term appears in the metatags.

When you query the database, the search engine adds up all the weights that match your query terms and returns the documents with the highest weight first. Each search engine has its own algorithm for assigning weights, and they tweak these frequently. In general, rare, unusual terms are easier to find than common ones because of the weighting system.

However, remember that "popularity" measured by various means often trumps any statistical interface.

UNCLASSIFIED//FOR OFFICIAL USE ONLY
27