Anyone who has used a web search engine with any regularity knows that there is an element of the unknown with every query. Sometimes the user will type in a "stream-of-consciousness" query and the documents retrieved are a perfect match, while the next query can be seemingly succinct and focused only to earn the bane of all search results -- the "no documents found" response. Oftentimes the same queries can be submitted on different databases with just the opposite results. It's an experience aggravating enough to swear off doing web searches as well as swearing at the developers of such systems.
However because of the transparent nature of computer software design, there is a tendency to forget the decisions and tradeoffs that are constantly made throughout the design process affecting the performance of the system. One of the main objectives of this book is to identify to the novice search engine builder, such as the senior level computer science or applied mathematics student or the information sciences graduate student specializing in retrieval systems, the impact of certain decisions that are made at various junctures of this development. One of the major decisions in developing information retrieval systems is selecting and implementing the computational approaches within an integrated software environment. Applied mathematics plays a major role in search engine performance and Understanding Search Engines (USE) focuses on this area, bridging the gap between the fields of applied mathematics and information management -- disciplines which previously have operated largely in independent domains.
But USE doesn't only fill the gap between applied mathematics
and information management, it also fills a niche in the information retrieval
literature. The work of William Frakes and Ricardo Baeza-Yates' (eds.)
Information Retrieval Data Structures and Algorithms, a 1992 collection
of journal articles on various related topics and Gerald
Kowalski's (1997) Information Retrieval Systems: Theory and
Implementation, a broad overview of information retrieval systems are
fine textbooks on the topic, but both understandably lack the gritty details
of the mathematical computations needed to build more successful search
engines.
With this in mind, USE is not a book that provides an overview of information retrieval systems and prefers to assume the supplementary role to the above mentioned books. Many of the ideas for USE were presented and developed as part of a Data & Information Management course at the University of Tennessee's Computer Science Department -- a course which won the 1997 Undergraduate Computational Engineering and Science Award sponsored by the United States Department of Energy and the Krell Institute. The course, which required student teams to build their own search engines, has provided invaluable background material in the development of this book.
As mentioned earlier, this book concentrates on the applied mathematics
portion of search engines. Although not transparent to the pedestrian
search engine user, mathematics plays an integral part in information retrieval
systems by computing the emphasis the query terms have in their relationship
to the database. This is especially true in vector space modeling,
which is one of the predominant techniques used in search engine design.
With vector space modeling, traditional orthogonal matrix decompositions
from linear algebra can be used
to encode both term and documents in k-dimensional space.
However, that is not to say that other computational methods are not useful or valid but in order to teach future developers the intricate details of a system, a single approach had to be selected. Therefore the reader can expect a fair amount of math including explanations of algorithms and data structures and how they operate in information retrieval systems. This book will not hide the math (concentrated in Chapter Three: Vector Space Models and Chapter Four: Matrix Decompositions) nor will it allow itself to get bogged down in it either. A person with a non-mathematical background (such as an information scientist) can still appreciate some of the mathematical intricacies involved with building search engines without reading the more technical chapters.
To maintain its focus on the mathematical approach USE has purposely
avoided digressions into Java programming, HTML programming and how to
create a web interface. An informal conversational approach has been
adopted to give the book a less intimidating tone, which is especially
important considering the possible multi-disciplinary backgrounds of its
potential readers, however, standard math notations will be used.
Boxed items throughout the book contain ancillary information such as mathematical
examples, anecdotes, and current practices to help guide the discussion.
Websites providing software (e.g., CGI scripts, text parsers, numerical
software) and text corpora are provided in Chapter Nine: Further
Reading)..
Hopefully, USE will help future developers whether they be students
or software engineers to lessen the aggravation encountered with the current
state of search engines. It's a critical time for search engines and the
future of the Web itself, as both ultimately depend on how easily users
can find the information they are looking for.