| Purpose: | To prepare for adventures in the land of vector-space approaches to information retrieval |
| Available: | Wednesday, November 1, 1995 at 11:59 pm |
| Due: | November 15, 1995 at 11:59 pm |
Vector-space approaches to information retrieval were developed to improve upon the literal matching that is typical in the Boolean and string search models of automated information retrieval. Techniques such as Latent Semantic Indexing (LSI), a particular vector-space approach, attempt to improve precision and recall by placing the documents in a mathematical space, placing a query in the same space, and retrieving those documents that are "close" to the query in the space.
Before we can attempt query operations, though, we must create a mathematical model of the documents. In this lab, we'll take a large collection of "documents" and create the matrix that represents those documents. As Dr. Berry explained in class, there are three major steps to LSI:
We'll do (1) in this lab. Dr. Berry will take the matrix for each group and perform the SVD (number 2, above), creating the output file that will be used for query matching (number 3, above) in lab 6.
We'll be parsing Nielsen product data in this lab. The data originally contained records of digits:
Each digit or group of digits is a key to a field. For example, the fifty-first digit corresponds to a field called TVitems:
| 1 | Television with no cable |
| 2 | Regular cable only |
| 3 | Regular and pay cable |
| blank | Unclassified |
Dr. Berry has already parsed the data to produce a more humanly readable form:
Your task is to take this more humanly readable representation of the data and produce a sparse matrix representation of the data in compressed sparse column format (also called Harwell-Boeing format). Compressed sparse column format is described in the handout given in class.
In the end, you'll need to automatically generate three files from the data:
The script you write to produce these files must only make a single pass over the data. The data files are much too large for more than one pass over the data.
Since the data files were machine produced, we can guarantee the files conform to a specific pattern. That is, although the data file is too large to be viewed in Netscape, we know it has the following form throughout the file:
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Each table in the data will be a column in our sparse matrix. When you reach the end of a table, you should start a new column in the sparse matrix. The data within the
Unfortunately, we do not know a priori what row headings actually exist in the data. That is, suppose "FHage: 50-64 years" did not exist anywhere in the data. We would not want to allocate a row for it since there would never be any entries in that row. To solve this problem, we could make two passes over the data: the first pass could determine which rows actually exist in the data, and the second pass would create the matrix. On smaller data sets, this approach would be reasonable. On data sets that are as large as the Nielsen data sets, though, making two passes is not acceptable. Thus, we must devise a way to assign rows "on the fly" as we pass through the data and create the matrix. Each time we see a new row heading, we need to assign that row heading the next row number so the next time we encounter that specific row heading we can find the row number to which it corresponds. In Perl, we can use associative arrays to store (row headings, row number) pairs.
As described in class, compressed sparse column format consists of three parts: pointers to the beginning of each column, the row indices, and the non-zeros themselves. Since each of these sections will be expanding as we write the matrix, you should store each section in a separate file and then merge the files into a single file when you are totally done parsing.
Here is an outline of the algorithm I used (remember, this is only an outline - I know it worked for me, but I don't guarantee that it will work for you):
| *_matrix | is the compressed sparse column matrix that was produced |
| *.dir and *.pag | are the dbm files that were produced. Use the readdbm script in that directory to view them |
| *_columnNames | contains the column names that were extracted |
The final goal of this lab is to parse the file
You may use any programming language you desire for this lab, but we recommend that you use Perl. The TA's will not assist with debugging if any languages other than Perl are used. A Perl reference book has been placed in the Computer Science library (room 105). Also, a number of people in the CS department have a copy of the book.
| Due: | November 15, 1995 |
| Deliverables: | Each group should collect in a single directory their
script, the four files it produces for the OJ.html.Z
file (the matrix, the column names files, and the dbm files),
and any other files we specify at a later date (we may
add more data sets (like the cereal and yogurt) later),
and send the path of that directory to hudgens@cs.utk.edu.
When you send the path to Watts, please use the subject line
and put the full path, only, at the beginning of the message, before any other info, words of greeting, or other comments. This will assist with grading and it will help Watts better differentiate between your lab submissions and other email he receives. Notes that only one person in each group needs to submit the lab. |
| Grading: | Correctness - whether your script puts the numbers in the sparse matrix in the same order as our script |
| Points: | This lab is worth 40 points. |