| Purpose: | To begin thinking about large-scale information retrieval by building an inverted index (using the FAST-INV algorithm) and using it to build a query engine. |
| Available: | Monday, September 11, 1995 |
| Due: | Wednesday, October 4, 1995 at 11:59 pm |
The simple query engine we built in lab 2 would be very slow if we were working with extremely large data sets. In reality, we searched for the user-supplied query terms by comparing each term against every word in every document. This approach is useful in small data sets, but I/O costs alone would make this approach unreasonable in large data sets.
Luckily, though, by doing some of the work in a preprocessing step, we can greatly reduce the amount of time required for each query. Instead of looking through the entire text every time we receive a query from the user, we can create an inverted index that transforms the querying process into a few table look-ups and a few file accesses. We merely have to map the user's query to a set of numerical identifiers that represent the keywords in the query, find the identifiers in the inverted index, and read the list of documents that contain those keywords.
Once again, there are three parts to this lab:
We will again use HTML and CGI to provide an interface for this lab. The query.html interface you created in lab 2 can be used for this lab, although you will have to modify the ACTION URL to point to the query engine you produce in part 3 of this lab. DO NOT USE THE ORIGINAL QUERY.HTML FILE YOU CREATED FOR LAB 2. YOU SHOULD COPY QUERY.HTML TO A NEW FILE NAME SINCE LAB 2 WILL NOT HAVE BEEN GRADED BY THE TIME YOU BEGIN THIS LAB. IF YOU MODIFY QUERY.HTML INSTEAD OF COPYING IT TO A NEW FILE NAME, YOU WILL LOSE A SIGNIFICANT NUMBER OF POINTS FOR LAB 2 WHEN IT DOES NOT WORK AS EXPECTED DURING GRADING.
The interface should preserve the same functionality as the interface in lab 2. That is,
You should implement the FAST-INV algorithm on pages 37-42
of the text. Since you will most likely require structs
and lists that aren't available Perl, you should use C or
C++ for this lab. To make sure we all start on the same
footing, the document database already has been converted
to the format required for input to the FAST-INV algorithm.
The file ~cs494/public/lab3/lab3-documents.dvf
(DVF = Document Vector File)
was created by parsing the text, removing the common words,
assigning a unique numerical identifier to each unique keyword,
and sorting the list of identifiers for each document.
Lab3-documents.dvf is a binary file that contains
the HCN (Highest Concept Number) followed by a stream of
tuples of the form
Each part of the tuple is a 4-byte integer. If you are interested, lab3-documents.dvf was created with the script ~cs494/public/lab3/dvf. The script ~cs494/public/lab3/readdvf has been provided to print an ASCII representation of the lab3-documents.dvf file. However, for efficiency, you should write your FAST-INV algorithm to read lab3-documents.dvf as a binary file.
Your implementation of the FAST-INV algorithm must conform to the following two constraints:
The format of any intermediate files and the format
of the final inverted index is left to your own creativity.
However, we encourage you to use binary files throughout
the algorithm for greater time and space efficiency.
The query engine should look up the user-supplied query terms in the inverted index and return to the client (with the query terms highlighted) those documents that contain one or more of the terms. Note that since you will probably again need structs and lists, the query engine will need to be written in C or C++.
In order to find the query terms in the inverted index, you'll have to know the unique identifier assigned to each keyword when the DVF was built. An ndbm file will be used to find the integer identifier associated with a given keyword (type man ndbm for details on ndbm). When the DVF was created, the ndbm files lab3dbm.dir and lab3dbm.pag were also created. The keys for lab3dbm are non-null terminated strings and the values associated with those keys are integers. The script ~/cs494/public/lab3/readdbm has been provided to list all the < key, value > pairs in the dbm file, in case you need to examine its contents.
You may assume that if a query term isn't found in the ndbm file, it has not been indexed in your inverted index and can be discarded as a search term. By making this assumption, you won't have to worry about removing the common words (given in the stoplist) from each query.
The text we will be searching can be found in ~cs494/public/lab3/lab3-documents. Like lab 2, you should not copy this file to your own directory.
Your query engine should record the elapsed wall-clock time
required to look up the query terms in the inverted index and
retrieve all the documents containing those terms. Thus,
after you send the documents to the client, you should send
to the client a line similar to
Since we won't be using Perl for this lab, the cgi-lib.pl
library won't be available to us. However, a similar CGI library
for C programmers has been installed in
~cs494/public/lab3/libcgi. By linking with the library
libcgi.a in that directory, you'll get almost the
functionality the Perl CGI library provided (to link with this
library, add -L/cloud/homes/cs494/public/lab3/libcgi and
-lcgi to your compile line).
You should consult the
documentation for further details about the library.
Points will be deducted if you read the entire lab3-documents file each time retrieve a document.
| Due: | Wednesday, October 4, 1995 at 11:59 pm |
| Deliverables: | By the due date, send the URL of your validated interface, and
the name of the directory that contains the source and
executable for your FAST-INV algorithm/query-engine to
hudgens@cs.utk.edu. When you send
the URL to Watts, please use the subject line and put the full URL, only, at the beginning of the message, before any other info, words of greeting, or other comments. This will assist with grading and it will help Watts better differentiate between your lab submissions and other email he receives. |
| Grading: |
|
| Points: | This lab is worth 50 points. |