Next: Original Implementation
Up: The Design and
Previous: The Design and
Minimally, any implementation of Latent Semantic Indexing must
take as input a six-tuple
q = < b, r, n, t, d, f>,
where

On any particular query, b, r, n, f, and at least one
of t and d are required. However, t and d cannot both be empty.
Given q, the LSI server must perform the following tasks:
- Eliminate from t those terms
considered common in the
document collection. Typically, when a document collection
is assembled, the frequency of each word in the collection
is analyzed, and words that occur most often in the collection
are not indexed, but instead are added to the stoplist of the
collection. A stoplist for a given document collection is
a list of words generally not worth indexing
in the collection [11].
Words that occur frequently in the collection
are not valuable in the index since they make it difficult to
distinguish documents from one another, increase the amount
of space required to store the index, and increase the time
required to process a query. When a search is performed,
terms that appear on the stoplist are removed from the query
since they were not indexed in the document collection.
- Eliminate from t the terms that do not occur in the
document collection. Words that do not occur
in the document
collection can't be used to project the query into the
term-document space.
- Compose the pseudo-document that represents the query.
Each term in the document collection (except those terms
considered too common to index) occupies a particular
position in the term-document space. The position of a
particular term in the space is given by a vector, called
its term vector. Similarly, each document in the collection
is represented by a document vector that specifies the
document's position in the term-document space. In the LSI
model, a document is placed at the centroid of the terms
that comprise the document.
The pseudo-document that represents a query consists
of the weighted sums of the term vectors associated with the
terms in t combined with the sums of the document vectors
associated with the documents in d.
- Project the pseudo-document into the
term-document space and
find the n highest-ranking terms, documents, or both.
Although several different similarity measures can be used
to determine the highest ranking terms and documents for a given
query, the cosine similarity measure is most often used to
find terms and documents near the pseudo-document in the
term-document space. The cosine
similarity measure computes the angle between the vector
representing the pseudo-document and each term or document vector.
The higher the cosine between the pseudo-document and a term
or document, the closer the term or document is to the
pseudo-document, regardless of Euclidean distance. When
computing the cosine, only the first f elements of the term
and document vectors are used. This allows the user to adjust
whether he or she desires a more conceptual match (which
corresponds to smaller values of f) or a more literal match
(corresponding to larger values of f). Then, the results
of each cosine are sorted, and the top n ranking terms or
documents are returned to the user.
Typically, the results returned to the user include the
similarity of the pseudo-document to each of the top n
ranked terms or documents and the corresponding terms or
document titles. The flowchart in Figure 2 summarizes
the process.

Next: Original Implementation
Up: The Design and
Previous: The Design and
Michael W. Berry (berry@cs.utk.edu)
Tue Jul 23 08:47:48 EDT 1996