Page:Untangling the Web.pdf/55

From Wikisource
Jump to navigation Jump to search
This page has been proofread, but needs to be validated.

DOCID: 4046925

UNCLASSIFIED//FOR OFFICIAL USE ONLY


Google


Google first gained fame and widespread use because of its single-minded focus on search, exemplified by its "clean" interface, and its PageRank™ "weighted link popularity." In simple terms, Google gives each webpage a rank based on the number of other pages linking to it and the "importance" of those pages, where importance is derived from an overall link count. While PageRank is imperfect, it works better than most other approaches to ranking search results and, indeed, is one of the primary reasons for Google's success.

Some of Google's features that helped to create this very successful and powerful search tool are:

  • cached versions of webpages; Google was the first search engine to offer this option, which let users peek into its vast database.
  • automatic conversion of non-HTML filetypes to HTML is available; Google was not the first to do this, but certainly has been the most successful.
  • backlinks (the link: syntax); unfortunately, Google now limits the number of backlinks it shows, greatly reducing the utility of this option.
  • Google seems to have increased its limits on the size of indexed pages. I found an indexed PDF document over 764K, a text file over 1000K, and a webpage over 366K. Very few webpages are larger than 500K. Google does not offer HTML versions of very large PDF or Word documents, e. g., the complete 9/11 Commission Report, but exactly what their cut-off size is, I do not know.
  • Google refreshes its index continuously, not on a schedule (this is a good thing); Google's Matt Cutts explains Google's refresh rate: "It's true that when an event happens on the web, our index can often pick it up in 1–2 days, and usually even faster. But a typical page in Google's main web index is updated every 2–3 weeks or faster; it's not the case that the entire main web index is updated every 2–3 days."[1]
  • Google stopped advertising the size of its database in 2005, but Google is one of the largest if not the largest search database.

In determining the overall size of its index, Google also includes urls of pages that it has not crawled and for which it has not indexed the text. These "orphan"


  1. Matt Cutts, "Google Update Speed," Google Blogoscoped, 26 July 2006, <http://blog.outer-court.com/archive/2006-07-26.html#n28> (14 November 2006).
UNCLASSIFIED//FOR OFFICIAL USE ONLY
47