As storage becomes more plentiful and less expensive, the amount of information retained by businesses and organizations is likely to increase. Searching that information and deriving useful facts, however, will become more cumbersome unless new techniques are developed to automatically partition the data into sets small enough for a human to understand.
Richard Wurman, in his book Information Anxiety, reports three startling facts:
Latent Semantic Indexing (LSI) [7], a vector-space approach to conceptual information retrieval, is useful in situations where traditional lexical information retrieval approaches fail. LSI estimates the semantic content of the documents in a collection and uses that estimate to rank the documents in order of decreasing relevance to a user's query. Since the search is based on the concepts contained in the documents rather than the document's constituent terms, LSI can retrieve documents related to a user's query even when the query and the documents do not share any common terms. Also, since LSI ranks the documents according to their relevance to the user's query, the system helps the user decide which information may be more specific to the user's interests.
Although LSI is capable of achieving significant retrieval performance gains over standard lexical retrieval techniques (see [8]), the complexity of the LSI model often causes its execution efficiency to lag far behind the execution efficiency of the simpler, Boolean models, especially on large data sets. By carefully examining the LSI model and noting the various optimizations that can be applied to its underlying implementation, though, both the retrieval benefits of the LSI model and an execution efficiency near that of the Boolean retrieval techniques can be attained. Here, an efficient, extensible, maintainable, and portable implementation of the LSI model is presented, and a simple user interface, created with the new implementation of the LSI model, is explored. Using both the new implementation of the LSI model and its corresponding user interface, users can quickly search large data sets without understanding any details of the LSI model or implementation.
The following sections outline the development and use of an efficient, extensible, portable, and maintainable implementation of the Latent Semantic Indexing model. Section 2 explores the general ideas behind vector-space models for information retrieval and describes LSI in particular. Section 3 introduces LSI++, a C++ class library for searching with the LSI model, and Section 4 examines how LSI++ can be used to build both serial and distributed search engines. In addition, Section 4 presents a World-Wide Web interface for LSI that provides users the opportunity to interact with the LSI++ search engines. Finally, Section 5 examines the execution efficiency of the search engines written with LSI++.