CS 494/594: Information Retrieval Lab 2



Purpose: To be exposed to real-world uses of Perl and to explore the client-server model of communication via Common Gateway Interface (CGI) scripts.



Available: Tuesday, September 5, 1995
Due: Wednesday, September 13, 1995 at 11:59 pm

One of the most important aspects of an information retrieval system (from the user's perspective, anyway) is the interface. The interface is used to hide the details of the query engine and make it easier for the user to enter queries and interpret the results of the query. However, graphical interfaces are inherently difficult to write. In this class, we are less concerned with the interface than the algorithms that perform the queries themselves. On the other hand, every query engine we write will need some sort of interface. WWW browsers such as Netscape provide a convenient, easy way to produce relatively good interfaces.

The WWW client can collect information from the user and pass it to the HTTP server. The HTTP server has the ability to execute scripts and programs that will process the information and send the results back to the user. The forms processing ability of the server is called the Pat Ryan's Introduction to Perl (handed out in lab last Wednesday) and one of the many Let's look at each part in detail:

  1. Creating the interface -- After reading the introduction to CGI document above, and reading Mosaic for X version 2.0 Fill-Out Form Support, you should create an HTML page in your ~/www-home directory called query.html. This page should be an HTML form that allows the user to enter several keywords and submit the form to the second part of this lab (the query engine). For example, your page might appear as follows:

    Anything on this page that needs to be submitted to the HTTP server must be enclosed within the

    <FORM...>...</FORM>
    HTML tags. If some of the input boxes are not enclosed within the
    <FORM...>...</FORM>
    tags, the data the user enters in them will not be properly submitted to the server. Within the beginning
    <FORM...>
    tag, you should specify

    If you would like to test whether your interface is working correctly, you can use a test server that has been set up specifically for that purpose. To do this, change your ACTION URL to http://hoohoo.ncsa.uiuc.edu/htbin-post/post-query. When you submit your query, it will tell you what your form actually submitted to the server.

    As with all HTML documents you create, you should validate your page at the HALSoft HTML Validation Service. Like lab 1, you should set the level of conformance to "Mozilla". Also, remember to make query.html world readable. Otherwise, the server will not be able to access it.

  2. Creating the query engine -- To begin, create the directory ~/www-home/cgi-bin and make it world-readable and world-executable. Then, you should write a Perl script named lab2.cgi that (1) reads the query terms from the HTTP server, and (2) searches the document database for documents that contain those terms and sends those documents to the client. We deal with each step in turn:

    1. Read the query terms from the HTTP server

      When the server receives a CGI request, it uses the ACTION URL to determine the name and location of the script that should be invoked. Once the script is started, it sends the user input (the keywords that make up the query, in this case) to the script in one of two ways:

      • if METHOD="GET", the user input is sent to the script via environment variables
      • if METHOD="POST", the user input is sent to the script through stdin.

      In part 1 of this lab, we specified that you should use METHOD="POST". Since some flavors of UNIX might truncate environment variables longer than an arbitrary length (like 256 characters), POSTing is usually the safest way to send input to the CGI script or program.

      Luckily, though, we don't have to worry about the details of retrieving user input from the HTTP server (although it certainly is something of which one ought to be aware). Steven Brenner has created a library of Perl subroutines that simplify CGI scripts. You should copy the file ~cs494/public/cgi-lib.pl to your ~/www-home/cgi-bin directory and read the comments in that file to determine how the library should be used. Note that the library includes a subroutine called PrintHeader. PrintHeader specifies the content-type of the information that will be sent back to the client. YOU MUST CALL PRINTHEADER BEFORE YOU ATTEMPT TO SEND ANYTHING BACK TO THE CLIENT. Otherwise, the client will assume it never received any worthwhile information and provide an error message stating it received an empty document.

      Since cgi-lib.pl will be needed to execute your query engine, you should make it world readable. To use it in your Perl script, you should include the line

      require 'cgi-lib.pl'; at the top of your Perl script.

      The ReadParse subroutine in cgi-lib is probably the most important routine. When ReadParse is called, it accepts the user input from the HTTP server and places it in the associative array %in. Thus, after ReadParse is called, if your script needs the user input associated with the NAME keywords, you can retrieve it with $input = $in{keywords}.

    2. Search the document database for documents that contain the user-supplied query terms

      The document database you will be searching can be found at /cloud/homes/cs494/public/lab2-documents. YOU SHOULD NOT MAKE A COPY OF THIS FILE! Instead, you should open this file by providing the full path and filename as listed above. This will allow us to easily change the document database during grading. Lab2-documents contains a small collection of documents separated by newline characters. Each document except the first and the last consists of several lines of text surrounded by single newline characters.

      To search the database, you should open the file and search for the user-specified search terms. You should not call Unix utilities (for example, grep) from within your search engine. Instead, you should use the Perl pattern-matching commands to perform the search. If the user has supplied more than one term, you should consider the query to be the logical OR of all the terms. Thus, the query "africa humans" should find all documents that contain either the word "africa", the word "humans", or both. If one of the user-supplied search terms is found in a document, the term should be surrounded by

      <STRONG>...</STRONG>
      directives to highlight the term, and the entire document should be written to stdout (anything written to stdout after the PrintHeader subroutine is called will be sent to the client). Only those documents that contain the user-specified search terms should be returned to the client. The search engine should not return any irrelevant documents to the client.

      Searching the database can be done in one of many ways, but you should not load the entire document database (that is, the entire lab2-documents file) into memory at one time. You should find a way to incrementally examine parts of the database (line-by-line, or document-by-document, for example) that allows the documents containing the user-supplied search terms to be sent to the client with the search terms highlighted without loading the entire document database into memory. If the user doesn't supply any search terms, the script should send an error message to the client.

      Searching should be case-insensitive. In addition, you should match full words (that is, when searching for "Africa," occurrences of "African" should not be highlighted).

      You should not make your query engine specific to the document database in any way. During grading, we will replace lab2-documents with a different and much larger collection of documents. By doing this, we will be able to determine whether your query engine is specific to the documents in lab2-documents and whether you are attempting to load the entire document database into memory at one time.

      You should note that the documents in lab2-documents are not complete HTML documents on their own. Before your script returns the correct documents to the client, the script will have to write the header information required for every HTML document

      (<HTML><HEAD><TITLE>...</TITLE></HEAD><BODY>...)
      Also, the script should append the proper footer information
      (...</BODY></HTML>)
      to the documents that are returned.



Example 1: Suppose the user types the following query...


Then, the query engine should display something like...




Example 2: On the other hand, if the user types the following...


the query engine should display something like...

Query Results




Query String: ancient history



Evolution Comes to Life - SCIENCE IN PICTURES (August 1992)
by Ian Tattersall
======================

Ancient bones are the objective evidence of biological history. From my standpoint as a paleontologist, they are vastly more informative about extinct creatures than reconstructions or models, in whose creation art plays at least as great a role as science. Yet I am also a museum curator, and from that perspective I am keenly aware that nothing brings the past alive in the public's eye like a well-crafted reconstruction. For the average person, fossil bones are static things: beautiful or majestic, perhaps, but hard to imbue with the attributes of a living, breathing form.

When I was given the responsibility of curating the American Museum of Natural History's new Hall of Human Biology and Evolution, it was therefore evident to me and to Willard Whitson, the designer of the hall, that we needed to include some reconstructions of early humans in the exhibition. Furthermore, we wanted to portray these figures dynamically in the context of situations that our ancestors might have faced long ago. Only thus, we thought, could we truly bring these long-departed relatives back to some semblance of life. We hoped that clever sculpting and modern casting materials could provide us with a level of realism rivaling that of the spectacular dioramas of modern animals in the adjacent galleries.


The Power of Maps - SCIENCE IN PICTURES (May 1993)
by Denis Wood
======================

The objectivity of modern maps of the world is so taken for granted that they serve as powerful metaphors for other sciences, on occasion even for scientific objectivity itself. The canonical history of Western cartography reinforces that assumption of objectivity. The history tells of a gradual progression from crude Medieval views of the world to depictions exhibiting contemporary standards of precision. In actuality, all maps incorporate assumptions and conventions of the society and the individuals who create them. Such biases seem blatantly obvious when one looks at ancient maps but usually become transparent when one examines maps from modern times. Only by being aware of the subjective omissions and distortions inherent in maps can a user make intelligent sense of the information they contain.

The putative history of cartography typically begins in earnest at the time of the Egyptian and Babylonian mapmakers. The scene quickly shifts to ancient Greek and Roman contributions, followed by an acknowledgment of those of the Arabs during the Middle Ages. Mapmaking in Medieval Europe has been long regarded as the nadir of the craft. From the 15th century forward, cartography smoothly advanced, culminating in present maps that benefit from sophisticated optics, satellite imaging and digital...





Hints/tips in case you have trouble:


Debugging tips:

Debugging the CGI part of this lab can be somewhat tricky, especially when you don't know whether the search engine itself actually works. Therefore, you should write the search engine first, and then attempt to make it work with CGI. Since the server doesn't return any error messages to the client, it is often difficult to determine why the search engine failed. To aid in debugging, we can circumvent both the client and the server so we can actually see the error messages that are produced. To do this, set the following environment variables in the window in which you would like to run the Perl script:

Suppose you named the text box on your interface (the HTML form we named query.html earlier in the lab) "keywords". Then, we can simulate the server by typing the following on the Unix command line:

% echo "keywords=africa" | lab2.cgi

Lab2.cgi should act exactly as though the user had typed "africa" in the text box on your interface and submitted the form.



Other resources that might be of interest:


A note about ethics: Yes, scripts and forms that do everything in this lab can be found on the web. You may use any references you desire for this lab (except other people's projects), but you are here to learn. Part of learning, unfortunately, is struggling to find solutions to problems you encounter. Please do not cheat yourself and the rest of the class by relying on web sites that have already implemented query engines similar to this lab.

In addition, when you are not actually working on the script and form themselves, you should read-protect them so others won't be tempted to surf through directories to find a project that has already been completed.

Expect to spend at least a few frustrating hours doing this lab. CGI scripts are notoriously difficult to debug since the WWW browser does not report errors in CGI scripts (well, this isn't quite true - a log file for http errors is kept by the server, but mere mortals do not have read access to this file on our system). Because of this, you should start early.



Summary:

Due: Wednesday, September 13, 1995 at 11:59 pm
Deliverables: By the due date, send the full URL of your validated interface (the query.html file) to hudgens@cs.utk.edu. When you send the URL to Watts, please use the subject line

Subject: IR lab2 submission

and put the full URL, only, at the beginning of the message, before any other words of greeting, or other comments. This will assist with grading and it will help Watts better differentiate between your lab submissions and other email he receives.
Grading:
  1. Correctness - whether your search engine finds all the documents containing the user-defined search terms and whether you have followed the searching guidelines (not loading the entire document database into memory at one time, etc.) as given above.
  2. Documentation - since there are many ways to write Perl, you should use meaningful variable names, document any statements that aren't obvious (and in Perl, most statements are not obvious), and provide a blurb at the beginning of the program that describes what it does and how to use it. Also, when you mail your URL to Watts, include a paragraph or two that explains the overall operation/methodology of your code (in the words of Watts himself, a short paragraph saying "This is my design and this is why I chose this design."). It will help us better understand your code without spending the time to look through your source code to determine what algorithms you used.
  3. Algorithmic elegance - there are many ways to write the search engine. Some solutions are definitely better (more efficient, more understandable, etc.) than others. You should use good programming style and (as in all programs you write) find a fairly efficient, understandable algorithm that exudes elegance.
Points: This lab is worth 50 points.