next up previous
Next: Original Implementation Up: The Design and Previous: The Design and

3.1 Functional Specification

Minimally, any implementation of Latent Semantic Indexing must take as input a six-tuple

q = < b, r, n, t, d, f>,

where

On any particular query, b, r, n, f, and at least one of t and d are required. However, t and d cannot both be empty.

Given q, the LSI server must perform the following tasks:

  1. Eliminate from t those terms considered common in the document collection. Typically, when a document collection is assembled, the frequency of each word in the collection is analyzed, and words that occur most often in the collection are not indexed, but instead are added to the stoplist of the collection. A stoplist for a given document collection is a list of words generally not worth indexing in the collection [11].   Words that occur frequently in the collection are not valuable in the index since they make it difficult to distinguish documents from one another, increase the amount of space required to store the index, and increase the time required to process a query. When a search is performed, terms that appear on the stoplist are removed from the query since they were not indexed in the document collection.
  2. Eliminate from t the terms that do not occur in the document collection. Words that do not occur in the document collection can't be used to project the query into the term-document space.
  3. Compose the pseudo-document that represents the query. Each term in the document collection (except those terms considered too common to index) occupies a particular position in the term-document space. The position of a particular term in the space is given by a vector, called its term vector. Similarly, each document in the collection is represented by a document vector that specifies the document's position in the term-document space. In the LSI model, a document is placed at the centroid of the terms that comprise the document. The pseudo-document that represents a query consists of the weighted sums of the term vectors associated with the terms in t combined with the sums of the document vectors associated with the documents in d.
  4. Project the pseudo-document into the term-document space and find the n highest-ranking terms, documents, or both. Although several different similarity measures can be used to determine the highest ranking terms and documents for a given query, the cosine similarity measure is most often used to find terms and documents near the pseudo-document in the term-document space. The cosine similarity measure computes the angle between the vector representing the pseudo-document and each term or document vector. The higher the cosine between the pseudo-document and a term or document, the closer the term or document is to the pseudo-document, regardless of Euclidean distance. When computing the cosine, only the first f elements of the term and document vectors are used. This allows the user to adjust whether he or she desires a more conceptual match (which corresponds to smaller values of f) or a more literal match (corresponding to larger values of f). Then, the results of each cosine are sorted, and the top n ranking terms or documents are returned to the user.

Typically, the results returned to the user include the similarity of the pseudo-document to each of the top n ranked terms or documents and the corresponding terms or document titles. The flowchart in Figure 2 summarizes the process.

 Figure 2



next up previous
Next: Original Implementation Up: The Design and Previous: The Design and



Michael W. Berry (berry@cs.utk.edu)
Tue Jul 23 08:47:48 EDT 1996