Untangling the Web/Introduction to Searching

From Wikisource
Jump to navigation Jump to search
Untangling the Web
the National Security Agency
Introduction to Searching
1867688Untangling the Web — Introduction to Searchingthe National Security Agency

Introduction to Searching


Search Fundamentals

The September-October 1997 issue of IEEE Internet Computing estimated the Worldwide Web contained over 150 million pages of information. At the end of 1998, the web's size had grown to more than 500 million pages. By early 2000, the best estimates put the number over 1 billion and by mid-2000 there was a study showing that there are over 550 billion unique documents on the web.[1] Netcraft, which has been running Internet surveys since 1995, reported in its November 2006 survey that there are now more than 100 million websites. "The 100 million site milestone caps an extraordinary year in which the Internet has already added 27.4 million sites, easily topping the previous full-year growth record of 17 million from 2005. The Internet has doubled in size since May 2004, when the survey hit 50 million."[2] The major factors driving this boom are free blogging sites, small businesses, and the relative and lower cost of setting up a website. Another recent survey found:

  • The World Wide Web contains about 170 terabytes of information on its surface; in volume this is seventeen times the size of the Library of Congress print collections.
  • Instant messaging generates five billion messages a day (750GB), or 274 Terabytes a year.
  • Email generates about 400,000 terabytes of new information each year worldwide."[3]

The numbers hardly matter anymore. The enormous size of the Internet means we simply must use search tools of some sort to find information. Otherwise, we are voyagers lost on a vast uncharted ocean.

Consider this:
When you do a search, you are going through more information in less than 30 seconds than a librarian probably could scan in an entire career 30 years ago.

All the major search engines now index well over a billion pages of information. The problem generally isn't lack of data but finding that one tiny needle in a virtual haystack of almost limitless size (much like looking for a needle in a stack of needles).

Any serious researcher needs to know more about search engines than the average person using the Net for fun or even for very specialized searches associated with a hobby or perhaps a certain topic, e.g.,. cancer research. How do you learn the ins and outs of search?


The Past, Present, and Future of Search

"Search has become the most hotly contested field in the world of technology."[4]

Remember Northern Light? How about Excite, Galaxy, Lycos, HotBot, Magellan, InfoSpace, Go, Webcrawler, iWon, Netfind, or Webtop? If so, you've been searching the Internet a long time because many of these search engines are long gone and forgotten. However many changes in search and search engines have taken place in recent years, nothing has been quite so dramatic as what has occurred in the past two years with the appearance of the new Yahoo and Live Search engines.

While many smaller, focused search tools still exist, the sad fact is that, in terms of large, powerful, world-encompassing search engines, Internet searchers at this moment have fewer major search engines from which to choose.[5] What happened to get us to this point and what does the future portend?

In the early years of the Internet, there was enormous competition in the search market among a large number of search engines vying not only for users but, more importantly, for investors. The "dot bomb" crash in mid-2000 began the shakeout of search companies that continues to this day. The biggest change wrought by the failure of so many Internet-based investments was the growth of pay-per-click advertising in search results. Pioneered by Overture, these so-called sponsored results began to show up at the top of search result lists: the more an advertiser was willing to pay, the higher his result on the list. Then, in 2002 the big search engine consolidation began: first, Yahoo purchased Inktomi, a little known but major player in the search engine world. In 2005, Overture bought AltaVista, one of the oldest and most venerable search engines on the Internet, then quickly acquired AlltheWeb, another major search engine. To top it off, in July 2003, Yahoo bought Overture,thus acquiring three huge search properties at one time.

All this was done publicly. The real revolution was what was happening behind the scenes: with a remarkable degree of secrecy, Yahoo gave the engineers it had acquired from AltaVista , AlltheWeb, and Inktomi a new task-create a whole new search engine to compete with Google. On February 18, 2004, Yahoo unveiled its new search engine, which has a database and search features to rival Google's. Shortly thereafter, Yahoo began killing off the "parents" of its new progeny: first Inktomi, then AlltheWeb and AltaVista. While users can still go to the AlltheWeb and AltaVista websites and run searches, the results are pulled from the Yahoo database and many of the unique search options and features of both search engines are no longer available. However, Yahoo continues to add new features and options that are improving its capabilities.

During 2006, two major search engines unveiled major changes that make them serious contenders: Ask and Exalead. During 2006 Teoma and Ask Jeeves ceased to exist as separate search sites and merged under the Ask.com umbrella. The French search engine Exalead came out of beta for a new look and major overhaul during 2006 and continues to offer a number of important and unique search features. MSN Search became Live Search, which left beta status in September 2006 and increased the much-needed competition from a company that knows how to make successful (if imperfect) products . Amazon.com still offers its own search engine, A9, although during 2006, Amazon eliminated some of A9's unique functions, switched from Google to Live Search to power web searches , and appeared to be if not abandoning A9 then certainly scaling it back.

All the major search sites are still trying to be the "Swiss army knife" of search engines. Google, Yahoo, Live Search, Ask, and Exalead all competed hotly with each other to roll out new, better, faster, fancier, more powerful tools to do everything from search the contents of your computer in a heartbeat to letting you "fly" around the world with a bird's (or satellite's) eye view of the planet. Among the new search engine-based tools and programs arriving this past year were vastly improved maps and mapping technologies, enhanced multimedia search, desktop search utilities, toolbars integrated into the browser, and application programming interfaces (APIs) for use by individual developers.

If 2004 was the year of the new search engine and 2005 the year of tailored search, 2006 seems to have been the first year of Web 2.0. Interactive, participatory Internet activities such as blogging, podcasts, online video sharing, and wikis dominated the discourse.

Podcasting finally came into its own last year. Podcasting is recording and broadcasting any non-musical information—be it news, radio shows, sporting events, audio tours, or personal opinions—usually in MP3 format for playback using a digital audio player. Many websites now serve as directories to help users find podcasts of every variety anywhere in the world. Podcasting has caught on because it is easy, inexpensive, mobile, flexible, and powerful. Yahoo got out in front of the podcasting trend with its new Podcasts Search site after a study the search giant published with Ipsos Insight, which disclosed that most of the people who are using RSS do so without even knowing it.[6] RSS, which either stands for Rich Site Summary or Really Simple Syndication, is an XML format for news and content syndication. News aggregators are programs designed to read RSS formatted content, which is very popular in the blogging community. Many if not most blogs make their content available in RSS.

Although there is no agreed upon definition of what Web 2.0 means, in general terms most people believe it involves at a minimum users collaborating to share information online, i.e., an interactive, participatory web in contrast to what is now being called the static web (or Web 1.0). I think the Wikipedia article on Web 2.0 sums the current state of affairs up nicely when it says "To some extent Web 2.0 is a buzzword, incorporating whatever is newly popular on the Web (such as tags and podcasts), and its meaning is still in flux.[7]

Another important aspect of Web 2.0 is that it organizes information differently from traditional web and other news and knowledge models. So reports a Time article on the frontiers of search in its 5 September 2005 issue. There is good reason to believe this claim, given a major investment firm's assessment that "by 2010, search-engine advertising will be a $22 billion industry worldwide, up from an estimated $8 billion today."[8]

One casualty of Web 2.0 appears to be directories. Directories are hierarchical guides to a subset of what are presumably the best, most relevant (or at least most popular) websites on a specific topic. Yahoo was always the king of directories, but several years ago, I noted a marked decline in both the quantity and quality of the Yahoo directory. The other major directory was and remains the Open Directory Project, which has always powered the Google Directory and, ironically, now powers the Yahoo Directory. What distinguished the Open Directory from Yahoo was that, while Yahoo was heavily commercial, the Open Directory has always relied upon volunteers to populate and maintain it. Now that most of users' creative energy seems to have moved to wikis, the ODP is in what may be a permanent and ultimately fatal decline. Today, the most successful directories tend to be specialty directories such as NewsDirectory.com or yourDictionary.com, and vertical search engines, such as Business.com or MedlinePlus, which focus on a particular topic instead of trying to catalog the entire Internet.

Directories were almost always a part of the portal concept. Portals were all the rage for a few years, while search was considered the Internet boondocks-no one was terribly interested in the boring (and unprofitable) technology of search. So where are portals now—those one-stop handy-dandy Swiss army knife web sites that tried to do and be all things to all people? Most of them are gone, thanks in large part to Google's ascendancy. With its clean, spare look, Google changed the face of Internet search by moving away from the portal concept to pure search. While it is true that Google offers a directory as well as other types of searches-Image, news, shopping, groups-Google's focus has always been on web search. Google's new look, which debuted in April 2004, included removing the directory tab from the Google home page, further evidence of the decreasing importance of directories. Although there is growing criticism of the "googlization" of websites, Google continues to be the standard by which most sites are judged.

The rapid and dramatic decline in web directories is only partially attributable to Google's success. The other explanation for the waning of directories is the Tristram Shandy paradox. The Life and Opinions of Tristram Shandy, Gentleman is a nine-volume 18th century novel in which Tristram Shandy tries to record every detail of his life but discovers his task is hopeless because it takes him one year to document only one day. As Shandy writes an additional day, it takes him an additional year to complete the events of that day. Such is the fate, to a somewhat lesser degree, of those who seek to compile an Internet directory. By the time the information in the directory is researched, compiled, and published, the Internet has changed and made much of that information obsolete.

I believe Yahoo's decision to metamorphose from directory to search engine was in part a result of a tacit recognition of the Tristram Shandy paradox. Yahoo just couldn't keep up with the Internet's changes and it became too costly to try. Creating and maintaining a directory is an extremely manpower intensive endeavor, which flies in the face of the Internet model of relying on automation and technology. Undoubtedly, Yahoo's changes were largely driven by Google's enormous financial success. Yahoo sat by for years and watched as Google's popularity (and revenues) increased as Yahoo's stagnated. "By the late '90s much of [Yahoo's] focus was actually diametrically opposed to search, which is supposed to send you to other sites. The Yahoo portal strategy was to keep the eyeballs on its turf, where they viewed more ad units, shopped, and bought premium services. Only when a third of online ad spending moved to search within a few short years did Yahoo decide to buy in big."[9]

Again in 2006 Yahoo changed the look of its homepage, but I believe Yahoo is making a fundamental error by still presenting its busy, messy portal face to the world. Although savvy Internet searchers know to go directly to http://search.yahoo.com in order to avoid the confusion and get a clean interface, most users are still going to the main Yahoo page where they are confronted with this:

Here's Yahoo's dilemma: how does it compete with Google for searchers seeking a simple, clean interface while simultaneously retaining and attracting users who want "one stop shopping"? Thus far, more searchers are still going to Google first rather than muddling their way through the kind of mess you see above. Where Yahoo excels—and in my opinion beats Google—is in shopping and in finding local information. This is a fact Yahoo not only recognizes but also embraces. Says Ted Meisel, head of Yahoo's Overture division, "We never claimed it [Yahoo] was a better approach for doing research on 18th century Spain. But if you are trying to buy a power washer for your back deck, it's a pretty good way to find what you need.[10] That's fine for personal searches, but it does not help the searcher who is using the Internet for work-related, academic, or other types of research.

The future of search seems to be in fewer but more experienced and more commercially driven hands now than a decade ago. Certainly both the quantity and quality of search results are much better today. And there are other trends in search that are going to have a major impact on users, love them or hate them. Among these are greater personalization of search, an area in which Google, Yahoo, and Live Search are all vying for your attention. Then there is the concept of social networking, through which Internet users with similar interests share their web knowledge and experience. Social bookmarking sites such as del.icio.us or digg and sharing software such as Stumbleupon are growing in popularity as individual users seek ways to help each other discover and propagate information.

There has also been a strong impetus towards more localized search for shopping, news, map directions, services, telephone lookups, and more. Yahoo initially outpaced Google in this area because it already owns an enormous warehouse of information about where its users live and work, shop and play. However, Google, Yahoo, Ask, and Live Search all moved strongly into the local and personalized search arena during 2006. Add to the mix all the other services search companies offer or plan to offer, such as Google's much ballyhooed and controversial foray into email with Gmail. The move toward greater personalization (likes and dislikes/interests/shopping/travel) and more services (especially email and tailored news) brings increased concerns about privacy and security. The more Yahoo, Google, Amazon, Microsoft, et al. know about us, the more they can serve up what we want.

But the more they know, the less control we have over our privacy and computer security. I am reminded of a scene from the film Minority Report in which the main character walks into a clothing store and, after his eye scan, the computer welcomes him by name, asks if he was happy with his previous purchase (which it details) and what he would like now. It doesn't take a lot of imagination to see how this technology can be abused. Everyone wants convenience but it is a virtual axiom of technology that every increase in convenience brings with it some decrease in privacy and, most likely, security. Now more than ever, the future of search is one that appears to be heading towards more personalization, more features, more options and, inevitably it seems, less privacy, less security, and fewer companies with the will, technological know-how, and financial resources to build and maintain search engines.


Understanding Search Engines


The best way to keep up to date with search engines in the US is to visit websites devoted to search and to read their newsletters. One of the oldest sites about search is Search Engine Watch. Although Search Engine Watch was originally designed for webmasters (by webmaster Danny Sullivan), it is a good resource for researchers who want and need in-depth information about the major English-language search services and some country specific engines. Search Engine Watch is also home to Search Day, noted search maven Chris Sherman's daily newsletter. While Search Day is kept current, Search Engine Watch now has many out of date pages.

Stepping into the breach is the superb Pandia Search Central, which offers current search news and an almost endless number of tips, tutorials, guides, and even its own search tools. Pandia has emerged as the premier site for news about and help with search.

Other good web search sites include John Battelle's Searchblog, Philipp Lenssen's Google Blogoscoped (which covers much more than just Google), Gary Price's Resource Shelf, Phil Bradley's Weblog, Greg Notess's Search Engine Showdown, as well as Web Master World and Web Search Guide. Among the best search engine-specific blogs are the Yahoo Search Blog, the Official Google Blog, Google Operating System, and Live Search Weblog.

The only thing predictable about search engines is how quickly and frequently they change not only their content but also their features. Because there are websites devoted to keeping up with the myriad changes, they are your best bet for staying on top of the ever-changing world of search tools.

Search News and Blogs

Google Operating System http://googlesystem.blogspot.com/
John Battelle's Searchblog http://battellemedia.com/
Live Search Weblog http://blogs.msdn.com/msnsearch/default.aspx
Official Google Blog http://googleblog.blogspot.com/
Pandia Search Central http://pandia.com/
Philipp Lenssen's Google Blogoscoped http://blog.outer-court.com/
Phil Bradley's Weblog http://philbradley.typepad.com/phil_bradleys_weblog/
Research Buzz http://www.researchbuzz.com/
Resource Shelf http://www.resourceshelf.com/
Search Day http://searchenginewatch.com/searchday/
Search Engine Showdown http://www.searchengineshowdown.com/
Search Engine Showdown Reviews http://www.searchengineshowdown.com/reviews/
Search Engine Watch http://searchenginewatch.com/
Search Engine Watch Web Searching Tips http://www.searchenginewatch.com/facts/index.html
Web Master World http://www.webmasterworld.com/
Web Search Guide http://www.websearchguide.ca/
Search Engine Watch Blog http://blog.searchenginewatch.com/blog/
Yahoo Search Blog http://www.ysearchblog.com/

Web Tip

Browsers assume the prefix "http://" unless you tell them otherwise, which means you do not need to type "http://"— just type the url (address).



Search Engine Basics


A search engine comprises three basic parts:

  1. The spider/robot/crawler is software that "visits" sites on the Internet (each search engine does this differently). The spider reads what is there, follows links at the site, and ultimately brings all that data back to:
  2. The search engine index, catalog, or database, where everything the spider found is stored;
  3. The search engine software that actually sifts through everything in the index to find matches and then ranks or sorts them into a list of results or hits. Important points to consider about search engines:
  • Spiders are programmed to return to websites on a regular basis, but the time interval varies widely from engine to engine. Monthly or better is considered "fresh."
  • When you use a search engine, you are searching the index or database, not the web pages themselves. This is important to remember because no search engine operates in "real time."
  • Spiders do not index all web pages they find, including pages that employ the "Robots Exclusion Protocol" or the "Robots META tag." The first of these mechanisms is a special file website administrators use to indicate which parts of the site should not be visited by the robot or spider. The second is a special HTML metatag that may be inserted by a web page author to indicate if the page may be indexed or analyzed for links. Not every robot/spider respects these mechanisms. Password protection, firewalls, and other measures will generally keep spiders from crawling a website and indexing it.

Not every search engine has its own proprietary search program but instead relies upon another company's search service for its results. Most of these strategic alliances now involve Yahoo, Google, and Windows Live Search. All these partnerships are subject to change without notice; for more on these strategic alliances, see:


Knowing that Yahoo, for example, is the search tool behind a search engine can save you time because you can be pretty sure that using AltaVista Will get you similar (although not identical) results to the other search engines also powered by Yahoo. It is critical to remember that each service powered by a particular search engine produces different results even though they may all use the same core database. Why is this? Because the search interfaces have their own algorithms that decide how queries are run, how results are returned, or even if they query the entire database (most do not). In short, go to the primary search engine-Google, Yahoo, or Live Search for best results.


A Word About Browsers:
Internet Explorer and Mozilla Firefox


Two years ago I declared that the "browser wars" were for all intents and purposes over and Microsoft's Internet Explorer (IE) had won. IE still commands more than 90 percent of the world's browser market, and AOL abandoned Netscape's Navigator/Communicator in mid-2003. However, during 2004, Mozilla browsers experienced a resurrection thanks largely to user frustration.

Caveat Browser

Alexa and Smart Browsing technology are very controversial because of their invasion of Because of Internet Exprivacy implications. For more information, take look at the article "What's Related? Everything But Your Privacy"

Curtin, M., Ellison, G., Monroe, D., "What's Related? Everything But Your Privacy," 7 Oct 1998, Revision: 1.5, <http://www .interhack .net/pubs/whatsrelated/> (24 October 2006)

Because of Internet Explorer's continued dominance of the browser market and, more importantly, because it is the standard browser for many Untangling the Web readers, I will focus my attention on Internet Explorer.

Nonetheless, all browsers have advantages and drawbacks. I still recommend you configure two browsers, both Internet Explorer and Mozilla Firefox. Both types of browsers allow you to make a number of decisions that affect your privacy and security while browsing. Also, both browsers have become much more customizable with each new release, allowing every user to select and save his own preferences for everything from fonts to what will appear on the toolbar. Be sure to familiarize yourself with the many evolving features of your browser(s). The Microsoft and Mozilla websites have extensive information and documentation about their browsers. At the Mozilla site you can download and install the highly regarded Firefox browser as well as other free software, such as the Thunderbird email program.

In October 2006, both Microsoft and Mozilla introduced new versions of their browsers: Internet Explorer ? (IE?) and Firefox 2. Microsoft, which had owned upwards of 90 percent of the browser market until Firefox took off a couple of years ago, recognized it has a genuine competitor on its hands and made significant changes and improvements to its browser to try to bring some Firefox users back into the fold. Will it work? PC World offered an excellent comparison of IE? and Firefox 2.[11] While Firefox 2's changes are mostly refinements of already existing features with no change in the browser's look and feel, IE7 marks a major overhaul since IE6 was released way back in 2001.

Among the changes to Internet Explorer 7 are tabbed browsing, integrated searching, RSS newsfeed support, and an antiphishing tool. The most noticeable change is IE7s look and feel, which is designed to resemble Microsoft's new operating system, Vista. Probably the most obvious and popular addition to IE7 is tabbed browsing, something Firefox already offered. Also, IE7 has a built-in search box, which lets users search from anywhere without having to go to the search engine's home page. Google and other search engines had successfully lobbied Microsoft not to make Live Search the default search service, so you can pick your search engine.

The other major change is invisible: improved security features designed to cope with the almost endless number of vulnerabilities that have afflicted IE6.[12] The most prominent of these security upgrades is one shared with Firefox: an "antiphishing" tool that works by warning users that a website they are about to visit may be fake and redirects them away from the page unless they actively choose to go to it. The other major new IE7 security feature is something called Protected Mode, which prevents a website from changing a computer 's files or settings. However, Protected Mode will not work with any Windows operating system except Vista, which is due out next year. Also, one of IE's major appeals had been its universality, that is, it would work with most websites. The security features in IE7 mean that some sites that could be viewed in earlier versions of IE cannot be viewed in IE7, undermining one reason many people still continued to use the Microsoft browser.

Firefox 2 is another in a long line of gradual updates. This version adds a spell checker, a system for suggesting popular search terms, and an option to pick up where you left off after a crash. Firefox 2 also upgrades the RSS newsfeed so that now, if you click on the feed itself, instead of seeing the usual XML gibberish, Firefox 2 will parse the raw feed into something readable and also subscribe to a feed using one of a numerous (but not all) newsreaders.

What is the bottom line? Firefox users should upgrade to version 2; it will be easy and pain free. IE6 users probably should wait a while before downloading IE7 to let early adopters find the inevitable bugs that Microsoft will have to fix. Frankly, after five years, you would think Microsoft could do better than come up with a browser that basically mimics the best features of Firefox and its other (much smaller) competitors. This looks mostly like catch-up and very little like innovation.

If you are going to use Netscape, another Mozilla-based browser, I do not recommend using Netscape 8x because it has many reported problems. Stick with either Netscape 7.1x or 7.2x. Also, if you prefer a streamlined version of Netscape 7x without all the annoying "extras," I can recommend one from Sillydog (silly name, great tool). "Netscape 7.1 is based on Mozilla 1.4. Both applications share almost identical features, such as tabbed browsing, custom keywords, and Sidebar. Exceptions are additions of proprietary features such as the support for Netscape WebMail and AOL mail."[13] Netscape 7.2 is based on Mozilla 1.7.2. "In addition to the technologies that Netscape 7.2 shares with Mozilla 1.7.2, it includes additional features such as a number of installed plugins, support for Windows Media Player Active X control which are not available in Mozilla."[14]

Microsoft Internet Explorer http://www.microsoft.com/windows/ie/default.htm
Mozilla Firefox http://www.mozilla.com/firefox/
Netscape 7.1 Streamline http://sillydog.org/narchive/sd/71.html
Netscape Archive (7.1 or 7.2) http://browser.netscape.com/ns8/download/archive.jsp

What the heck are "cookies"?

Cookies are text placed on your computer's hard disk (yes!) by a website in order to remember something about you. For example, a site may set a cookie that enables you to reenter without logging in or customize its pages based on the type of browser you're using. Cookies remain controversial (more later).


The Great Internet Search-Offs


Over the last decade, the inevitable "search offs" have become commonplace (both Internet vs. traditional researcher and Internet researchers against each other). Some of the findings of these "contests" provide insight into how search engines work.

  1. Most search-offs and wide-ranging studies continue tofind surprisingly little overlap among major search engines, so use more than one search engine as a general rule.
  2. The Internet is now being widely used for "serious" research, which means higher quality, more reliable information on the web. But, as with any research source, you must weigh the validity, accuracy, currency, and overall quality of the information before using it.
  3. Search engines rely on statistical interfaces, concept-based search mechanisms, or link analysis to return and rank hits; using boolean expressions[15] usually interferes with or defeats these statistical approaches. In general, do not use boolean queries unless you know exactly what you are looking for and are very comfortable with that search engine's boolean rules (no, they are not all the same; for example, you may have to use CAPS for all operators). Also, many search engines do not correctly process nested boolean queries (boolean searches with parentheses).
  4. Be aware that search engines are giving more weight to popular and/or pay-for-placement web pages. In fact, most search engines use services to determine which are the most visited, and therefore most popular, websites and return them at the top of the results list. This is a strategic move away from the traditional "words on a page" ranking system. Trustworthy search engines will clearly indicate which hits are paid entries.
  5. Learn the search syntax of the search engines you use (never assume). Most search engines use double quotes ("") to enclose a phrase and the plus + and minus - keys to indicate "must include" and "must exclude" respectively. But these are by no means universal rules (especially when using international or metasearch engines).
  6. The default operator for all major US search engines is now AND. As of February 2002, no major search engine used OR as its default operator. However, most search engines will let you use an OR in the simple search box: Yahoo and Google permit OR searches in the simple search box, but you must capitalize the OR.
  7. Keep in mind that because HTML does not have a "date" tag, "date" can mean many things: creation date; the last modified date for the page; or the date search engine found the page. I do not recommend searching by date except when using weblog, news, or newsgroup search engines.

Understanding statistical interfaces is important, especially for researchers used to boolean and other non-statistical query languages. Most search engines use statistical interfaces. The search engine assigns relative weights to each search term, depending on:

  • its rarity in their database
  • how frequently the term occurs on the webpage
  • whether or not the term appears in the uri
  • how close to the top of the page the term appears
  • (sometimes) whether or not the term appears in the metatags.

When you query the database, the search engine adds up all the weights that match your query terms and returns the documents with the highest weight first. Each search engine has its own algorithm for assigning weights, and they tweak these frequently. In general, rare, unusual terms are easier to find than common ones because of the weighting system.

However, remember that "popularity" measured by various means often trumps any statistical interface.


Types of Search Tools

Before delving into the intricacies of search engines, let's look at some other ways of finding information on the web. Search engines are not the only and often not even the best way to access information on the Internet.



Web Directories/Subject Guides/Portals


Web directories are organized subject catalogues that allow the web searcher to browse through lists of sites by subject in search of relevant information. Yahoo, Galaxy, Google Directory, Lycos, and the Open Directory are select lists of links to pertinent websites. Directories were once viewed as the future of the Internet because they could sift through the mountains of information and millions of websites to offer only the best and most relevant. However, directories have truly fallen by the wayside over the past several years with the rise of Google and, even more importantly, wikis in general and Wikipedia in particular. Directories continue to recede in importance and value to researchers as they are increasingly replaced by better alternatives, including Custom Search, by which a voluntary community of searchers shares expertise to create more focused searches with more relevant results. The reason for the decline of directories is obvious: directories are simply too manpower intensive and expensive to keep up with the ever-changing and expanding web. I would say at this point directories, while not dead, are probably moribund.

Directories rely on people to create their listings Obviously, this is a much more labor-intensive business than operating a search engine robot. Websites indexed in a directory are either described/evaluated by editors/reviewers or rely on descriptions provided by web page owners who may pay for placement in a directory. When you search a directory, the only retrievals will come from those descriptions, so keep this in mind. Although directories give you a much more limited view of the web, directories do have their own utility. Most directories also have a backup search that provides responses to queries that don't match anything in the directory listings.

Directories may produce more relevant results

Subject guide databases are always smaller than those of search engines, which means that the number of hits returned tends to be smaller as well. On the bright side, this means the results directories produce are often more relevant. For example, while a search engine typically indexes every page of a given website, a subject guide is more likely to provide a link only to the site's home page. For this reason, they lend themselves best to searching for information about a general subject, rather than for a specific piece of information.

Yahoo still has the best-known subject guide/directory and can be a good starting place for research, even on technical subjects. Yahoo used to list links alphabetically, but once Google came along with its ranked list of sites, Yahoo started offering most popular sites first before going to its alphabetical list. However, Yahoo's directory has suffered in recent years as the Google Directory has steadily improved. Google gets its directory data free in the form of the Open Directory Project.

You may not recognize the Open Directory Project by this name, but you have probably used it. The ODP is the directory behind the Google Directory, AOL Search, Yahoo Directory, and many others. The ODP "is the largest, most comprehensive human-edited directory of the Web. It is constructed and maintained by a vast, global community of volunteer editors."

Galaxy is definitely worth a look because it was designed for and by "professionals," so it has a bent toward business, technology, and science that other directories lack. You may search either the Galaxy collection or the web using their proprietary search engine. Best of the Web started life in 1994 as a web awards site and is now a full-fledged directory.

Many more specialized directories are discussed under the "Invisible" Internet.

Best of the Web http://botw.org/default.aspx
Galaxy http://www.galaxy.com/
Google Directory http://directory.google.com/
Open Directory http://dmoz.org/
Yahoo Directory http://dir.yahoo.com/

Metasearch Sites


The growth in the number of search engines has led to the creation of "meta" search sites. These services allow you to invoke several or even many search engines simultaneously. These metasearchers may do a more thorough job of sifting through the net for your topic than any single search engine. If you are new to using search engines, these are a great way to do a very broad search, while familiarizing yourself with the popular engines and how they respond. But metasearch engines inevitably lack the flexibility of individual search tools.

It is important to note that many metasearch engines do not employ some of the best search engines, such as Google and Yahoo. Also, my biggest complaint about metasearch engines is that they perform shallow searches, usually only retrieving the top ten or so hits from a site, which is far too few to be comprehensive or truly representative of what is "out there."

However, metasearch engines do serve a purpose. If you are unsure if a term will be found anywhere on the web, try a metasearch engine first to "size" the problem: you may get zero hits with a dozen search engines (you've got a problem) or you may get a half-dozen right-on-the-money hits right off the bat.


Vivisimo, in my opinion the best free metasearch tool available, opened a new search site—Clusty—in 2004 and then made Clusty its search home in 2006. Fundamentally, Vivisimo and Clusty are the same, but Clusty adds options for news, image, Wikipedia, government, and blog searches.

The Vivisimo technology behind Clusty is unique because it employs its own clustering engine, software that organizes unstructured information into hierarchical folders. Clusty offers clustered results of web, news, and certain specialty searches. The Clusty default is to search the web using Live Search, Gigablast, Ask, Wikipedia, and the Open Directory.

Clusty is especially useful for searching ambiguous terms, such as cardinal, because it clusters them by logical categories, as shown below. Also, Clusty lets users look at the sources of the search results and types of sites (e. g., .com, .gov). Clusty has a unique feature that allows users to search inside clusters. In this example, the original search was [iran] and the "find in clusters" search was [nuclear]. Here are the results of this recursive search looking at the sources of data:

For news, Clusty searches the New York Times, Associated Press, Reuters, and Yahoo News (Which subsumes a huge number of sources). One of the best features of Clusty news search is the ability to toggle among clustered results, sources, and sections (such as business, health, tech, science).

Clusty also provides a number of advanced search options and preferences, including the option to add your own customized tabs to the main search page.

Clusty stands out as one of—if not indeed—the best metasearch tools available for free and without registration on the Internet. When clustering works (and the Vivisimo technology was independently rated as accurate 90 percent of the time), it offers advantages for automatically grouping huge amounts of information logically. Because there is no human intervention, Vivisimo's clustering algorithm "also helps in discovering new areas of subject development, avoiding the 'mummy's curse,' in which human catalogers have to recognize a term before approving it for usage and then leaving the earlier material using the term un-indexed and irretrievable by that term as an authorized descriptor or metatag."[16]


Jux2 lets users query three search engines—Google, Yahoo, and Live Search (still referred to as MSN Search)—and then shows you:

  1. The Best Results from all three search engines and the total hits for each.
  2. What only Google found and what is missing from Google.
  3. What only Yahoo found and what is missing from Yahoo.
  4. What only Live/MSN Search found and what is missing from Live/MSN Search.

I believe you will be as surprised as I was to see how little overlap there often is among the "big three" search engines.


Dogpile, despite its name, is a good metasearch engine. Dogpile includes Live Search results, along with those from Google, Yahoo, and Ask Jeeves. This is, of course, very good news because Dogpile is now drawing from all the major US-based search engines with the exception of Gigablast. It also searches smaller or lesser-known search engines and directories, including the MIVA (formally FindWhat), LookSmart, Ask, About, and more. Interestingly, the European version's name is Webfetch because of "unfortunate associations" between Dogpile and manure.


Mamma, the "Mother of All Search Engines," might just be exaggerating a wee bit. Mamma offers web, news, image, and yellow and white page search options. Search engines queried are Ask, Wisenut, Gigablast, and Entireweb (a serious misnomer) and directories queried are Open Directory, About, Business.com, and two pay-per-click sources.


The famed search guide site, Pandia, offers its own excellent metasearch engine. The Pandia metasearch engine "collects and sorts the hits, takes out duplicates, and presents the end result in a simple format. "The first results you'll see are from what Pandia describes as the "essential search engines and directories," which include Google, Yahoo, HotBot and Wisenut. Strangely, Pandia continues to list AlltheWeb (Fast) and AltaVista as search engines while they acknowledge at other places on their site that Yahoo subsumed both engines. Still, this is a very good metasearch site.


More metasearch sites:

Open Directory's List of Metasearch Sites

Megasearch Sites


Megasearch sites simply store several search engines under one roof, but you have to do the searches one search engine at a time. They are becoming more sophisticated and better as time passes, serving as good entry points for finding and evaluating search engines. They are especially useful for locating international search engines.


Types of Searches and the Best Ways to Handle Them


The first thing to ask yourself is the one question a lot of people never consider: is the Internet the best place to start? In general, the Internet has become so good at answering factual questions—the kinds of things you find in an almanac, an encyclopedia, or a phone book—that it is now usually better in terms of speed, timeliness, and accuracy than other resources. For example, if I need to know the world's largest hydroelectric plants, I can open an almanac and look up this information or I can type [world's largest hydroelectric plants] into Google, Yahoo, or Live Search, where the first result links me to a page at Information Please.com that contains the answer to the question.

Still, compared to traditional library-type resources, the Internet may be:

  • slower (though this is changing with new technologies).
  • less reliable (large amounts of bad data in among the good).
  • disorganized (a library with all the books on the floor).
  • frustrating (lots of "broken" links).
  • hard to use (generally poor search tools and too much data to sift through).
  • risky because of growing privacy and security threats.

This being said, why do we need to use to the Internet? Because:

  • it has almost unlimited amount of data (also a minus… too much of a good thing and way too much of the bad).
  • the data tend to be current.
  • it offers multimedia (video, audio, charts, tables, illustrations).
  • it allows the individual to do much more of his own research.
  • it is relatively inexpensive (at least in some countries).
  • most importantly, it contains a vast amount of unique information. You've thought through your research question and decided to use the Internet to find information either because you've already tried traditional sources without success or you believe the Internet is your best option. You're sitting in front of your terminal, you've logged onto the Internet and you're staring at a blank screen. Now what? Let's start with a (relatively) easy type of search. You need to find general information about a fairly broad topic.

Let's say you need to research a broad topic unfamiliar to you, for example, Java. The best approach may not be to type java into a search engine. Why? Because you'll probably get millions of hits, and the first ones may be to commercial sites trying to sell you something relating to Java and will undoubtedly also include other meanings of Java, such as Indonesia and coffee. If you are looking for general information on a topic, wikis, specialized (vertical) search engines, and virtual libraries are often better starting points for researching general or broad topics than big search engines.

The single biggest mistake searchers make is using the wrong search tool. For example, search engines are generally not the best tools for finding current news (use a news search engine), for researching broad topics (use a specialty directory or virtual library), or for performing specialized searches such as scientific research (use a specialty search engine). That's why the number one rule for web research is:

Rule One

Use the right tool for the job.

Let's go back to the Java example where you want to find general information on the

web about Java programming. Start with the Yahoo directory and see what categories it offers on Java. You can ignore the sponsored results and the categories about Indonesia, classic arcade games, and commercial Java services. Instead, your best bet is Programming Languages > Java:

Computers_and_Internet/Programming_and_Development/Languages/Java/

Right there on one page is a wealth of promising links to documentation, reference, tutorials, news, downloads, articles, etc., and to the most lucrative resource of all, the metaguide. In this case, take a look at Java Boutique, which is a collection of useful Java information, news, forums, and more collected in one convenient location.

Thanks to thousands of individuals, corporations, and organizations, the Internet offers countless such metaguide sites on a huge variety of subjects. Which brings us directly to…

Rule Two

Let other people do as much work for you as possible (use their metaguides, their FAQs, their expertise to your advantage).

Directories are not the only good sources of general information. A number of virtual libraries and reference desks have sprung up on the web and they tend to be terrific starting places for all types of general information because they have thousands of pre-selected links to sources of data the researchers know to be good.

Let's continue with the Java example. If we go to the Intute Science, Engineering, and Technology page (formerly EEVL, the Internet Guide to Engineering, Mathematics and Computing) and search on java, we get back a list of highly relevant and carefully evaluated websites:

In addition to the obvious SUN sites about Java, there are many others, such as links to Java FAQs, news, tutorials, course notes, seminar slides, articles, development tools, users' groups, mailing lists, books, conferences, links to web-based courses, and other resources.

Now you have a new resource for future Java-related research. Naturally, the first thing to do is bookmark the page.

Rule Three

Bookmark constantly, organize your bookmarks, and back them up as though your life depends on it.

One of the biggest and most influential entries into the reference/research world on the Internet is Wikipedia, a self-described free encyclopedia that anyone can edit. Because of its growth and importance, Wikipedia has earned a separate section in this year's edition. According to the Wikipedia, the term "wiki" describes "a group of Web pages that allows users to add content, as on an Internet forum, but also allows others (often completely unrestricted) to edit the content. The term wiki also refers to the collaborative software (wiki engine) used to create such a website (see wiki software). In essence, the wiki is a vast simplification of the process of creating HTML pages, and thus is a very effective way to exchange information through collaborative effort. Wiki is sometimes interpreted as the acronym for 'what I know, is', which describes the knowledge contribution, storage and exchange up to some point."[17] The most obvious potential problem with an encyclopedia that "anyone can edit" is quality control, and in fact, one of the Wikipedia's co-founders admitted serious problems with the quality and accuracy of some (perhaps a lot) of the Wikipedia content.[18] While there is a tremendous amount of good information in Wikipedia, it should not be relied upon as a sole source. Neither should it be ignored as this example of a "disambiguation" page on "java" shows:

Wikipedia also has the advantage of offering a free encyclopedia in a number of languages besides English, including French, Polish, Portuguese, Spanish, Dutch, Swedish, Italian, German, and Japanese.

To review, the best starting places for general information on broad topics are web directories/subject guides, virtual libraries, and reference desks. There are hundreds of such websites, but I've selected a few of the best.

Abouthttp://www.about.com/

Encyclopedia.comhttp://www.encyclopedia.com/

Encyclopedia Britannica[19]http://www.britannica.com/

Hotsheethttp://www.hotsheet.com/

INFOMINEhttp://infomine.ucr.edu/

Information Pleasehttp://www.infoplease.com/

Internet Library for Librarianshttp://www.itcompany.com/inforetriever/index.htm

Intute (formerly RON)http://www.intute.ac.uk/

The Internet Public Libraryhttp://www.ipl.org/

Librarians' Index to the Internethttp://lii.org/

The Library Spothttp://www.libraryspot.com/

Martindale's The Reference Deskhttp://www.martindalecenter.com/

My Virtual Reference Deskhttp://www.refdesk.com/

Pinakes Subject Gateway[20]http://www.hw.ac.uk/libWWW/irn/pinakes/pinakes.html

Wikipediahttp://en.wikipedia.org/

WWW Virtual Libraryhttp://vlib.org/Overview.html

Yahoo Referencehttp://education.yahoo.com/reference/

Web Tip

Think of search engine databases as huge warehouses in which everything from diamonds to debris is stored. Your job is to find the jewels amid the muck.


  1. Michael K. Bergman, "The Deep Web: Surfacing Hidden Value," BrightPlanet, August 2001, <http://www.brightplanet.com/technology/deepweb.asp> (14 November 2006).
  2. "November 2006 Web Server Survey," Netcraft.com, 1 November 2006, <http://news.netcraft.com/archives/2006/11/01/november 2006 web server survey.html> (15 November 2006).
  3. School of Information Management and Systems, University of California at Berkeley, "How Much Information? 2003," 27 October 2003, <http://www.sims.berkeley.edu/research/projects/how-muchinfo-2003/execsum.htm#summary> (14 November 2006) Executive Summary.
  4. Terry McCarthy, "On the Frontier of Search," Time.com, 28 August 2005, <http://www.time.com/time/magazine/article/0.9171.1098955-1.00.html> (14 November 2006).
  5. Of course there are many non-US search engines beyond those run by Google, Yahoo, and Microsoft, but they generally target a particular part of the world and are not serious competitors with Google, Yahoo, or Live Search at this time.
  6. Yahoo! and Ipsos Insight, "RSS: Crossing into the Mainstream," October 2005 [PDF], <http://publisher.yahoo.com/rss/RSS_whitePaper1004.pdf> (14 November 2006).
  7. "Web 2.0," Wikipedia, <http://en.wikipedia.org/wiki/Web_2.0> (15 November 2006).
  8. McCarthy.
  9. Steve Smith, "Search Wars: Google vs. Yahoo!," MediaPost.com, April 2004 Issue, <http://www.mediapost.com/dtls_dsp_mediamag.cfm?magID=245868>(registration required).
  10. Steven Levy, "All Eyes on Google," Newsweek , 29 March 2005, p. 54, <http://www.msnbc.msn.com/id/4570868/> (14 November 2006).
  11. Erik Larkin, "Radically New IE? or Updated Mozilla Firefox 2--Which Browser is Better?" PC World , 18 October 2006, <http ://www.pcworld.com/printable/article/id.12?309 /printable .html> (24 October 2006).
  12. Not 24 hours after its release and the first vulnerability was detected in IE7 Of course , it also affects IE6, but this is embarrassing for Microsoft given that the company has touted the security of IE7 over its predecessor. <http://secunia.com/lnternet Explorer Arbitrary Content Disclosure Vulnerability Test/>
  13. Mozilla FAQ, <http://www.mozilla.org/start/1.4/faq/general.html#ns7> (14 November 2006).
  14. Sillydog.org Browser Archive, 31 October 2005, <http://sillydog.org/narchive/full67.php> (24 October 2006).
  15. The term "boolean," often encountered when doing searchers on the web (and frequently spelled "Boolean"), refers to a system of logical thought developed by the English mathematician and computer pioneer George Boole (1815-64). In boolean searching, an "and" operator between two words or other values (for example, "pear AND apple") means one is searching for documents containing both the words or values, not just one of them. An "or" operator between two words or other values (for example, "pear OR apple") means one is searching for documents containing either of the words. "Boolean," SearchSMB.com, <http://searchsmb.techtarget.com/sDefinition/O,290660, sid44 gci211695, 00.html> (14 November 2006).
  16. Barbara Quint, "Vivisimo Clustering Chosen to Enhance Searching at Institute of Physics Publishing Site," Infotoday, 25 March 2002, <http://www.infotoday.com/newsbreaks/nb0203252.htm> (14 November 2006).
  17. "Wiki," Wikipedia, Wikipedia, 2005. Answers.com <http://www.answers.com/topic/wiki> (14 November 2006).
  18. Andrew Orlowski, "Wikipedia Founder Admits to Serious Quality Problems," The Register, 18 October 2005, <http://www.theregister.co.uk/2005/1O/18/wikipedia_quality_problem/> (14 November 2006).
  19. Although full-text articles require a paid subscription to Encyclopedia Britannica, the site is still a useful starting place for research and includes free access to the Britannica Concise Encyclopedia.
  20. Pinakes is the gateway to Intute and dozens of other equally valuable specialized research sites.