Preface

    Anyone who has used a web search engine with any regularity knows that there is an element of the unknown with every query.  Sometimes the user will type in a "stream-of-consciousness" query and the documents retrieved are a perfect match, while the next query can be seemingly succinct and focused only to earn the bane of all search results -- the "no documents found" response.  Oftentimes the same queries can be submitted on different databases with just the opposite results.  It's an experience aggravating enough to swear off doing web searches as well as swearing at the developers of such systems.

    However because of the transparent nature of computer software design, there is a tendency to forget the decisions and tradeoffs that are constantly made throughout the design process affecting the performance of the system. One of the main objectives of this book is to identify to the novice search engine builder, such as the senior level computer science or applied mathematics student or the information sciences graduate student specializing in retrieval systems, the impact of certain decisions that are made at various junctures of this development.  One of the major decisions in developing information retrieval systems is selecting and implementing the computational approaches within an integrated software environment.  Applied mathematics plays a major role in search engine performance and  Understanding Search Engines (USE) focuses on this area, bridging the gap between the fields of applied mathematics and information management -- disciplines which previously have operated largely in independent domains.

    But USE doesn't only fill the gap between applied mathematics and information management, it also fills a niche in the information retrieval literature.  The work of William Frakes and Ricardo Baeza-Yates' (eds.) Information Retrieval Data Structures and Algorithms, a 1992 collection of journal articles on various related topics and Gerald
    Kowalski's (1997)  Information Retrieval Systems: Theory and Implementation, a broad overview of information retrieval systems are fine textbooks on the topic, but both understandably lack the gritty details of the mathematical computations needed to build more successful search engines.

    With this in mind, USE is not a book that provides an overview of information retrieval systems and prefers to assume the supplementary role to the above mentioned books. Many of the ideas for USE were presented and developed as part of a Data & Information Management course at the University of Tennessee's Computer Science Department -- a course which won the 1997 Undergraduate Computational Engineering and Science Award sponsored by the United States Department of Energy and the Krell Institute. The course, which required student teams to build their own search engines, has provided invaluable background material in the development of this book.

    As mentioned earlier, this book concentrates on the applied mathematics portion of search engines.  Although not transparent to the pedestrian search engine user, mathematics plays an integral part in information retrieval systems by computing the emphasis the query terms have in their relationship to the database.  This is especially true in vector space modeling, which is one of the predominant techniques used in search engine design.  With vector space modeling, traditional orthogonal matrix decompositions from linear algebra can be used
    to encode both term and documents in k-dimensional space.

    However, that is not to say that other computational methods are not useful or valid but in order to teach future developers the intricate details of a system, a single approach had to be selected. Therefore the reader can expect a fair amount of math including explanations of algorithms and data structures and how they operate in information retrieval systems.  This book will not hide the math (concentrated in Chapter Three: Vector Space Models  and Chapter Four: Matrix Decompositions) nor will it allow itself to get bogged down in it either.  A person with a non-mathematical background (such as an information scientist) can still appreciate some of the mathematical intricacies involved with building search engines without reading the more technical chapters.

    To maintain its focus on the mathematical approach USE has purposely avoided digressions into Java programming, HTML programming and how to create a web interface. An informal  conversational approach has been adopted to give the book a less intimidating tone, which is especially important considering the possible multi-disciplinary backgrounds of its potential readers, however, standard math notations will be used.  Boxed items throughout the book contain ancillary information such as mathematical examples, anecdotes, and  current practices to help guide the discussion.  Websites providing software (e.g., CGI scripts, text parsers, numerical software) and text  corpora are provided in Chapter Nine: Further Reading)..
     

    Acknowledgements

    The authors would like to gratefully acknowledge the support and encouragement of SIAM publishers, the United States Department of Energy, the Krell Institute, the National Science Foundation for supporting related research, the University of Tennessee, the students of CS460/594 (Fall Semester 1997), and graduate assistant Luojian Chen. Special thanks to Alan Wallace and David Penniman from the School of Information Sciences at the University of Tennessee, Padma Raghavan and Ethel Wittenberg in the Department of Computer Science at the University of Tennessee, Barbara Chen at H.W. Wilson Company, and Martha Ferrer at Elsevier Science SPD for their helpful proofreading, comments, and/or suggestions.  The authors would also like to thank Katie Terpstra and Eric Clarkson for their work with the book cover artwork and design, respectively.

    Hopefully, USE will help future developers whether they be students or software engineers to lessen the aggravation encountered with the current state of search engines. It's a critical time for search engines and the future of the Web itself, as both ultimately depend on how easily users can find the information they are looking for.