Four hypertext matrices used for performance comparisons
among the symbolic and spectral-based methods presented
in Sections 3.3 through 3.5 are listed
in Table 3. The first two matrices, MAN1 and
MAN2, were constructed from the See Also entries of the BSD 4.3
Unix manpages. The manpage of who, for example, contains
the See Also entries getuid and utmp. Hence,
two links are associated with who, namely
who
getuid and
who
utmp. Parsing all 625 manpages
produced 1853 links, and hence the rows and columns of MAN1
correspond to links and manpages, respectively. The nonzero elements
of both MAN1 and MAN2 are all 1's and simply reflect the
incidence rather than frequency of linkage. The MAN2 matrix
was derived from the MAN1 matrix by removing duplicate links
(i.e., who
getuid is the same as
getuid
who) and removing 18 manpages
(columns) whose links were not connected to the main graph of
the MAN1 matrix. The resulting MAN2 matrix had
1426 unique links corresponding to 607 manpages.
The 850 entries under the letter A of the Concise Columbia Encyclopedia (1989 Second Edition, on-line version) produced the 1778 cross-references or links for the CCE-A matrix listed in Table 3. For the 14-th entry ABDOMEN shown below, there are five cross-references or links indicated by brackets: [stomach], [liver], [gall bladder], [pancreas], and [kidneys]. Hence, the 14-th column of CCE-A has five nonzeros (or 1's) in row positions 25 ([stomach]) through 29 ([kidneys]), which correspond to the links in order of their occurrence in the text.
ABDOMEN in vertebrates, portion of the trunk between the diaphragm and lower pelvis. In humans the abdominal cavity is lined with a thin membrane, the peritoneum, which encloses the [stomach], intestines, [liver], and [gall bladder]. The [pancreas], [kidneys], urinary bladder, and, in the female, reproductive organs are also located within the abdominal cavity.The NHSE matrix in Table 3 was derived from 400 of the distributed HTML documents accessible from the National HPCC Software Exchange (or NHSE) [BDG+95] homepage on the World Wide Web (WWW). The selected documents fell under the NHSE's HPCC Programs and Activities heading, and reflect a breadth-first tree search of links of the form <A HREF="..."</A> to 3 three levels which produced a total of 4233 hypertext links.
The density and average number of nonzeros per row (
)
and column (
) of each of the four hypertext matrices are
also provided in Table 3. The Density of
each matrix is defined to be the ratio (Nonzeros)
(Rows
Columns).
Clearly, these matrices are quite sparse with only 1 or 2 links
associated with each document (manpage, article, or HTML page).
Table 4 lists the bandwidth , envelope size
, and the alternative bandwidth measure
(see in Section 3.1) for each of the hypertext matrices, and
Figure 1 illustrates the nonzero patterns for three of
the four matrices considered. The upper-triangular structure for the
nonzeros of matrices MAN2 and
CCE-A reflects the identification of links
(cross-references) in their order of occurrence in the original
(on-line) text.
Table 4: Profiles of the hypertext
matrices prior to reordering;
is the envelope size, denotes the bandwidth,
and
is the alternative (non-diagonal) bandwidth measure.
The goal of the techniques in the next two sections will be to reorder
rows and columns of the term-by-document matrix to reduce both and
.
A similar problem for symmetric matrices arises in the context of sparse
Cholesky factorization. Since the Cholesky decomposition is stable under
any symmetric permutation, various schemes have been proposed to reorder
the rows and columns to reduce the number of generated nonzero values
during the factorization.
Several of the techniques derived below have their antecedents in symmetric
envelope reduction algorithms. The correspondence analysis
technique described in Section 3.5 is the primary exception, since
its expense is too large for the Cholesky reordering problem.