CS 494/594: Information Retrieval Lab 3



Purpose: To begin thinking about large-scale information retrieval by building an inverted index (using the FAST-INV algorithm) and using it to build a query engine.



Available: Monday, September 11, 1995
Due: Wednesday, October 4, 1995 at 11:59 pm

The simple query engine we built in lab 2 would be very slow if we were working with extremely large data sets. In reality, we searched for the user-supplied query terms by comparing each term against every word in every document. This approach is useful in small data sets, but I/O costs alone would make this approach unreasonable in large data sets.

Luckily, though, by doing some of the work in a preprocessing step, we can greatly reduce the amount of time required for each query. Instead of looking through the entire text every time we receive a query from the user, we can create an inverted index that transforms the querying process into a few table look-ups and a few file accesses. We merely have to map the user's query to a set of numerical identifiers that represent the keywords in the query, find the identifiers in the inverted index, and read the list of documents that contain those keywords.

Once again, there are three parts to this lab:

  1. The interface

    We will again use HTML and CGI to provide an interface for this lab. The query.html interface you created in lab 2 can be used for this lab, although you will have to modify the ACTION URL to point to the query engine you produce in part 3 of this lab. DO NOT USE THE ORIGINAL QUERY.HTML FILE YOU CREATED FOR LAB 2. YOU SHOULD COPY QUERY.HTML TO A NEW FILE NAME SINCE LAB 2 WILL NOT HAVE BEEN GRADED BY THE TIME YOU BEGIN THIS LAB. IF YOU MODIFY QUERY.HTML INSTEAD OF COPYING IT TO A NEW FILE NAME, YOU WILL LOSE A SIGNIFICANT NUMBER OF POINTS FOR LAB 2 WHEN IT DOES NOT WORK AS EXPECTED DURING GRADING.

    The interface should preserve the same functionality as the interface in lab 2. That is,



  2. Building the inverted index

    You should implement the FAST-INV algorithm on pages 37-42 of the text. Since you will most likely require structs and lists that aren't available Perl, you should use C or C++ for this lab. To make sure we all start on the same footing, the document database already has been converted to the format required for input to the FAST-INV algorithm. The file ~cs494/public/lab3/lab3-documents.dvf (DVF = Document Vector File) was created by parsing the text, removing the common words, assigning a unique numerical identifier to each unique keyword, and sorting the list of identifiers for each document. Lab3-documents.dvf is a binary file that contains the HCN (Highest Concept Number) followed by a stream of tuples of the form

    < doc #, term # >
    < doc #, term # >
    < doc #, term # >
    ...
    Thus, the beginning of the file looks like

    < HCN >
    < doc #, term # >
    < doc #, term # >
    < doc #, term # >
    ...

    Each part of the tuple is a 4-byte integer. If you are interested, lab3-documents.dvf was created with the script ~cs494/public/lab3/dvf. The script ~cs494/public/lab3/readdvf has been provided to print an ASCII representation of the lab3-documents.dvf file. However, for efficiency, you should write your FAST-INV algorithm to read lab3-documents.dvf as a binary file.

    Your implementation of the FAST-INV algorithm must conform to the following two constraints:

    1. Your postings file must reside on disk.
    2. Each load file should contain 10 keywords (except possibly the last load file, which may have fewer)

    The format of any intermediate files and the format of the final inverted index is left to your own creativity. However, we encourage you to use binary files throughout the algorithm for greater time and space efficiency.

  3. The query engine

    The query engine should look up the user-supplied query terms in the inverted index and return to the client (with the query terms highlighted) those documents that contain one or more of the terms. Note that since you will probably again need structs and lists, the query engine will need to be written in C or C++.

    In order to find the query terms in the inverted index, you'll have to know the unique identifier assigned to each keyword when the DVF was built. An ndbm file will be used to find the integer identifier associated with a given keyword (type man ndbm for details on ndbm). When the DVF was created, the ndbm files lab3dbm.dir and lab3dbm.pag were also created. The keys for lab3dbm are non-null terminated strings and the values associated with those keys are integers. The script ~/cs494/public/lab3/readdbm has been provided to list all the < key, value > pairs in the dbm file, in case you need to examine its contents.

    You may assume that if a query term isn't found in the ndbm file, it has not been indexed in your inverted index and can be discarded as a search term. By making this assumption, you won't have to worry about removing the common words (given in the stoplist) from each query.

    The text we will be searching can be found in ~cs494/public/lab3/lab3-documents. Like lab 2, you should not copy this file to your own directory.

    Your query engine should record the elapsed wall-clock time required to look up the query terms in the inverted index and retrieve all the documents containing those terms. Thus, after you send the documents to the client, you should send to the client a line similar to

    Elapsed wall-clock time: 0.34 seconds

    Since we won't be using Perl for this lab, the cgi-lib.pl library won't be available to us. However, a similar CGI library for C programmers has been installed in ~cs494/public/lab3/libcgi. By linking with the library libcgi.a in that directory, you'll get almost the functionality the Perl CGI library provided (to link with this library, add -L/cloud/homes/cs494/public/lab3/libcgi and -lcgi to your compile line). You should consult the documentation for further details about the library.



    Other comments:


    Summary:

    Due: Wednesday, October 4, 1995 at 11:59 pm
    Deliverables: By the due date, send the URL of your validated interface, and the name of the directory that contains the source and executable for your FAST-INV algorithm/query-engine to hudgens@cs.utk.edu. When you send the URL to Watts, please use the subject line

    Subject: IR lab3 submission

    and put the full URL, only, at the beginning of the message, before any other info, words of greeting, or other comments. This will assist with grading and it will help Watts better differentiate between your lab submissions and other email he receives.
    Grading:
    1. Correctness - whether your search engine finds all the documents containing the user-defined search terms. In addition, you must conform to the constraints given above. Namely,
      • The postings file must be written to and read from disk.
      • The load file must be of size 10.
      • You must use lab3-documents.ind to retrieve documents from lab3-documents. That is, you should use the random-access scheme given above rather than reading the entire document database each time you need to access a document.
    2. Documentation - you should use good programming style and good documentation practices. Since we would rather not have to look through code, the description of your algorithm and data structures is very important.
    3. Algorithmic elegance - efficient, elegant code is better than unmaintainable, non-portable, and poorly thought-out code. Use good programming style and plan your code before you start writing it.
    Points: This lab is worth 50 points.