next up previous
Next: Serial Results Up: Results and Conclusions Previous: Computing Environment

5.2 Methodology

Four LSI search engines were tested to determine their relative execution efficiency while processing queries. Each search engine received its input from CGI and wrote an HTML document as output. The LSI search engines tested included:

lsiFinder
lsiFinder is written in the Perl programming language [23] and has approximately the same functionality as the original LSI WWW interface written by Loren Shih at Bellcore in 1994. Given a query, lsiFinder calls mlsisearch with the appropriate parameters. The output of mlsisearch is then processed by lsiFinder to produce an HTML page. Since lsiFinder uses the original Bellcore implementation of LSI to perform the search, lsiFinder is used as the base case to which the LSI++ code is compared.

lsiQuery
lsiQuery is a serial search engine based on LSI++. lsiQuery loads the terms and documents incrementally from disk as they are needed. It receives the query via CGI, performs the search using LSI++, and writes the results as an HTML document.

lsiBackend
lsiBackend is a continuously-running, serial backend server that receives a query from lsiRemote, performs the search using LSI++, and passes the resulting HTML document back to lsiRemote. Like lsiQuery, lsiBackend incrementally loads the term and document vectors from disk.

plsiBackend
plsiBackend is a continuously-running, distributed backend server. Like lsiBackend, plsiBackend receives its input from lsiRemote, performs the search using LSI++, and passes the resulting HTML document back to lsiRemote.

To measure the performance of lsiFinder, lsiQuery, lsiBackend, and plsiBackend, twenty-five random queries were selected for each of four document collections (see Table 1 for a description of each document collection).

 


Table 1: The document collections used during performance testing to determine the execution efficiency of LSI++ . The Usenet News Archive document collection is a compilation of USENET newsgroup postings taken from a variety of newsgroups between February 2, 1996 and February 8, 1996.

Each query consisted of one to five terms known to be in the document collection and zero to four terms not in the document collection. In addition, approximately half of the queries were relevance feedback queries, with one to five documents added to the query. Each set of twenty-five queries was posed to the search engines twice, once to find related terms and again to find related documents. Table 2 summarizes the queries selected for each document collection.

 


Table 2: A summary of the 25 random queries selected for each document collection. 

The total wall-clock time for each search was recorded by the search engine. Each search engine used the gettimeofday() system call to record the time the search began and the time the search finished. The total wall-clock time was found by subtracting the starting time from the ending time. Since Perl lacks adequate timers, lsiFinder determined the starting and ending times by calling an external program that reported the results of a call to gettimeofday().

Due to the configuration of the machines on which the performance timings were taken, an HTTP server was not available to pass queries to the search engines. In order to perform the tests, the HTTP server was simulated by setting the environment variables required by CGI and passing the queries to each search engine as a CGI-encoded string. Thus, the time required by the HTTP server to receive a query from a remote browser and spawn either lsiQuery or lsiRemote is not included in the timings given in the following sections.



next up previous
Next: Serial Results Up: Results and Conclusions Previous: Computing Environment



Michael W. Berry (berry@cs.utk.edu)
Tue Jul 23 08:47:48 EDT 1996