CS 494/594: Information Retrieval Lab 4



Purpose: To begin thinking about methods of displaying documents to help the user interpret the results of his or her search.



Available: Monday, October 2, 1995
Due: Wednesday, October 11, 1995 at 11:59 pm

One of the fundamental problems that constantly arises in information retrieval is that the user does not know what he or she is looking for in the database. However, this is not the fault of the user. Useful databases are usually large collections of fairly heterogeneous information. Even the database maintainers may not know all the types of information contained in the database. Without knowing exactly what is contained in a database, it is impossible to formulate a query with enough detail to extract the proper documents from a database.

Studies have shown that a query from a typical user consists of only a few keywords. For a large database, one or two keywords are not enough to help the query engine find the documents that are useful to the user. Even research librarians, who are specifically trained to formulate queries, have trouble developing useful queries.

Adding to the problems of formulating proper and detailed queries are synonomy and polysemy, two topics we'll discuss later in the course. There are usually many ways to express a given concept (synonomy), so the literal terms in a user's query may not match those of a relevant document. For example, consider the words "display" and "monitor". Both "display" and "monitor" are used to refer to the same device, but if a document refers to that device as a monitor, but the user's query contains the word "display," a literal-text searching algorithm (such as your inverted index in lab 3) won't retrieve that document even though it may be useful. In addition, most words have multiple meanings (polysemy), so a user's query will literally match terms in irrelevant documents. Consider the word "tree." In computer science, "tree" has a very specific meaning. However, a database whose topics range from computer science to the natural world will retrieve documents about binary trees as well as documents about oaks, elms, and sequoias.

A reasonable query engine must not just retrieve all the documents associated with a user's query. It should also help the user select documents that are more relevant to his or her query. We haven't advanced enough to solve the problems of synonomy and polysemy, but we can apply the simple technique of ranking to help the user find relevant information more easily.

In this lab, we will rank documents according to the following premise: the most useful documents are those that contain more of the user-specified query terms. That is, suppose the user wants to search on the keywords "archeology" and "human." A document containing "archeology" five times and "human" three times is more useful to the user than a document that contains "archeology" only once and "human" twice. Thus, the former document should be ranked higher than the latter document, according to our premise (which is a rather shaky premise, by the way). Since the former document is ranked higher than the latter document, we will display the former document before the latter document. We must be careful, though. We are interested in the total number of times all the query terms appear in a document. That is, a document containing the word "human" ten times should be ranked higher than the document containing "archeology" five times and "human" three times, even though it does not contain the word "archeology."

We will use a modified version of your FAST-INV program from lab 3 in this lab. You may have noticed that the DVF file for lab 3 occasionally contained the same tuples multiple times. This meant that the given document contained several instances of that word. While you were instructed in lab 3 to ignore those extra tuples, you'll use them in this lab to retain how many times words appear in documents so you can use that information to rank the documents when a user submits a query to your query engine.

As usual, there are three parts to this lab:

  1. The interface

    We will again use HTML and CGI to provide an interface for this lab. The query.html interface you created in labs 2 and 3 can be used for this lab, although you will have to modify the ACTION URL to point to the query engine you produce in part 3 of this lab.

    The interface should preserve the same functionality as the interfaces in labs 2 and 3. That is,

    The interface will have slightly more functionality than labs 2 and 3, though. The user should be able to specify that the query engine return the top 5, 10, or 20 documents. In HTML, a good way to do this might be the

    <SELECT>...</SELECT>
    tags. See the references to HTML markup commands in the previous labs for more information.

  2. Modifying your FAST-INV program from lab 3 to retain information that can be used to rank the documents

    You should modify your FAST-INV program to keep track of the number of instances of a given term in a document. Thus, when reading the DVF file, instead of ignoring multiple tuples that are the same,

    <0, 2>
    <0, 3> term 3 occurs four times in document 0
    <0, 3>
    <0, 3>
    <0, 3>
    <0, 4>

    you should record the number of similar tuples for a given term in a document and store it in your inverted file. Thus, you should add another field to the records you use in your inverted file to specify the number of times a term appears in a document. The inverted file may look like (although it shouldn't literally look like this - remember, use binary files for efficiency!)
    1:2(3) 4(1)
    2:3(5)
    3:1(1) 2(3) 5(1)
    where the numbers in parentheses provide the number of times a term appears in the document. For example, term 1 appears three times in document 2, but only once in document 4. Term 3 appears once in documents 1 and 5, and three times in document 2.

    This should be a fairly minor change to your program from lab 3. Still, though, your implementation of the modified FAST-INV algorithm must conform to the following two constraints:

    1. Your postings file must reside on disk.
    2. Each load file should contain 10 keywords (except possibly the last load file, which may have fewer).

    Again, we encourage you to use binary files throughout the algorithm for greater time and space efficiency.

    For this lab, the DVF file can be found at ~cs494/public/lab4/lab4-documents.dvf. It has the same format as the DVF file in lab 3. Once again, the script ~cs494/public/lab4/readdvf has been provided to print an ASCII representation of the lab4-documents.dvf file, if you need it for debugging.

  3. The query engine

    The query engine for this lab will be similar to the query engine in lab 3, except this query engine will rank the documents before returning them to the user.

    The query engine should look up the user-supplied query terms in the inverted index and return to the client (with the query terms highlighted) those documents that contain one or more of the terms. However, only the top 5, 10, or 20 documents should be returned to the user, depending on the number he or she specifies to be returned.

    Like lab 3, to find a query term in the inverted index, you'll have to know the unique identifier assigned to each keyword when the DVF was built. An ndbm file will be used to find the integer identifier associated with a given keyword (type man ndbm for details on ndbm). When the DVF was created, the ndbm files lab4dbm.dir and lab4dbm.pag were also created. The keys for lab4dbm are non-null terminated strings and the values associated with those keys are integers. The script readdbm has been provided to list all the < key, value > pairs in the dbm file, in case you need to examine its contents.

    You may assume that if a query term isn't found in the ndbm file, it has not been indexed in your inverted index and can be discarded as a search term. By making this assumption, you won't have to worry about removing the common words (given in the stoplist) from each query.

    The text we will be searching can be found in ~cs494/public/lab4/lab4-documents. Like lab 3, you should not copy this file to your own directory.

    When looking up the documents in the modified inverted file, you should add together the number of occurrences of the user-supplied search terms in each document (let's call this the rank of the document). When you have found all the documents that need to be returned to the client, you should sort the documents according to the rank and display those documents with a higher ranking before those with a lesser ranking. Before you display each document, you should display the document's rank:

    Query Results




    Query String: ancient history



    Rank: 3

    Evolution Comes to Life - SCIENCE IN PICTURES (August 1992)
    by Ian Tattersall
    ======================

    Ancient bones are the objective evidence of biological history. From my standpoint as a paleontologist, they are vastly more informative about extinct creatures than reconstructions or models, in whose creation art plays at least as great a role as science. Yet I am also a museum curator, and from that perspective I am keenly aware that nothing brings the past alive in the public's eye like a well-crafted reconstruction. For the average person, fossil bones are static things: beautiful or majestic, perhaps, but hard to imbue with the attributes of a living, breathing form.

    When I was given the responsibility of curating the American Museum of Natural History's new Hall of Human Biology and Evolution, it -as therefore evident to me and to Willard Whitson, the designer of the hall, that we needed to include some reconstructions of early humans in the exhibition. Furthermore, we wanted to portray these figures dynamically in the context of situations that our ancestors might have faced long ago. Only thus, we thought, could we truly bring these long-departed relatives back to some semblance of life. We hoped that clever sculpting and modern casting materials could provide us with a level of realism rivaling that of the spectacular dioramas of modern animals in the adjacent galleries.

    Ties in ranking should be broken by a random flip of a coin. That is, if two documents have the same ranking, it doesn't matter which document is ranked higher and displayed first.

    Your query engine should record the elapsed wall-clock time required to look up the query terms in the inverted index and retrieve all the documents containing those terms. Thus, after you send the documents to the client, you should send to the client a line similar to

    Elapsed wall-clock time: 0.34 seconds

    You should once again use libcgi.a from ~cs494/public/lab3/libcgi for your dealings with CGI. Again, consult the documentation for further details about the library.



    WARNING: DO NOT MODIFY YOUR SOURCE CODE AND/OR OVERWRITE ANY PART OF LAB 3 WHILE DOING THIS ASSIGNMENT. YOU SHOULD COPY EVERYTHING TO DIFFERENT FILENAMES (AND A DIFFERENT DIRECTORY) BEFORE MODIFYING THE FILES. YOU SHOULD ALSO MODIFY YOUR MAKEFILE SO IT DOES NOT OVERWRITE YOUR LAB 3 EXECUTABLES WITH PARTIALLY COMPLETED LAB 4 EXECUTABLES. IF YOU MODIFY YOUR LAB 3 EXECUTABLES OR SOURCE CODE AFTER THE DUE DATE, YOU WILL LOSE A SIGNIFICANT NUMBER OF POINTS FOR LAB 3 WHEN IT'S TIME STAMP HAS CHANGED OR WHEN IT DOES NOT WORK AS EXPECTED DURING GRADING.



    Summary:

    Due: Wednesday, October 11, 1995 at 11:59 pm
    Deliverables: By the due date, send the URL of your validated interface, the name of the directory that contains the source and executable for your modified FAST-INV algorithm, and a short description of the implementation details (file structures, important algorithms you used, etc.) and data structures that were changed in your FAST-INV algorithm and query engine to hudgens@cs.utk.edu. You should also include instructions on how to run your implementation of the modified FAST-INV algorithm. When you send the URL and description to Watts, please use the subject line
    Subject: IR lab 4 submission

    and put the full URL, only, at the beginning of the message, before any other info, words of greeting, or other comments. This will assist with grading and it will help Watts better differentiate between your lab submissions and other email he receives.
    Grading:
    1. Correctness - whether your search engine finds all the documents containing the user-defined search terms and ranks them correctly. In addition, you must conform to the following constraints as in lab 3:
      • The postings file must be written to and read from disk.
      • The load file must be of size 10.
      • You must use lab4-documents.ind to retrieve documents from lab4-documents. That is, you should use the random-access scheme used in lab 3 rather that reading the entire document database each time you need to access a document.
    2. Documentation - you should use good programming style and good documentation practices. Since we would rather not have to look through code, the description of your algorithm and data structures is very important.
    3. Algorithmic elegance - efficient, elegant code is better than unmaintainable, non-portable, and poorly thought-out code. Use good programming style and plan your code before you start writing it.
    Points: This lab is worth 40 points.