Once the
matrix A has been created and properly weighted,
a rank-k approximation (
) to A,
, is computed
using an orthogonal decomposition known as the singular value
decomposition (SVD) [14].
The SVD of the matrix A is defined as the product of three
matrices,

where the
columns of U and V are the left and right singular vectors,
respectively, corresponding to the monotonically decreasing (in value) diagonal
elements of
which are called the singular
values of the matrix A. As illustrated in Figure 1, the
first k columns of the U and V matrices and the first (largest) k
singular values of A are used to construct a rank-k approximation
to A via
.
The columns of U and V are orthogonal, such that
,
where r is the rank of the matrix A.
A theorem due to Eckart and Young [15] suggests
that
, constructed from the k-largest singular
triplets of A (a singular value and its corresponding left and
right singular vectors are referred to as a
singular triplet),
is the closest rank-k approximation (in the least squares sense)
to A [5].
With regard to LSI,
is the closest k-dimensional approximation
to the original term-document space represented by the
incidence matrix A. As stated previously, by reducing the
dimensionality of A, much of the ``noise'' that causes poor
retrieval performance is thought to be eliminated. Thus, although
a high-dimensional representation appears to be required for good
retrieval performance, care must be taken to not reconstruct A.
If A is nearly reconstructed, the noise caused by variability of
word choice and terms that span or nearly span the document collection
won't be eliminated, resulting in poor retrieval performance [5].
In LSI, the left and right singular vectors specify the locations of the terms and documents, respectively, in the reduced term-document space. The singular values are often used to scale the term and document vectors, allowing clusters of terms and documents to be more readily identified. Within the reduced space, semantically-related terms and documents presumably lie near each other since the SVD attempts to derive the underlying, semantic structure of the term-document space [5].