CS 494/594: Information Retrieval Lab 5



Purpose: To prepare for adventures in the land of vector-space approaches to information retrieval



Available: Wednesday, November 1, 1995 at 11:59 pm
Due: November 15, 1995 at 11:59 pm

Vector-space approaches to information retrieval were developed to improve upon the literal matching that is typical in the Boolean and string search models of automated information retrieval. Techniques such as Latent Semantic Indexing (LSI), a particular vector-space approach, attempt to improve precision and recall by placing the documents in a mathematical space, placing a query in the same space, and retrieving those documents that are "close" to the query in the space.

Before we can attempt query operations, though, we must create a mathematical model of the documents. In this lab, we'll take a large collection of "documents" and create the matrix that represents those documents. As Dr. Berry explained in class, there are three major steps to LSI:

  1. Parse the documents and create a matrix that represents the documents
  2. Compute the Singular Value Decomposition (SVD) of that matrix
  3. Use the singular value/vectors that encode the documents and terms to perform query matching

We'll do (1) in this lab. Dr. Berry will take the matrix for each group and perform the SVD (number 2, above), creating the output file that will be used for query matching (number 3, above) in lab 6.

We'll be parsing Nielsen product data in this lab. The data originally contained records of digits:

04103412176191530591611170401217125010104204200046820957601101011560010037812440 04103617174191392191611210001217125010104204200092923757601101011500014434472440 04103941176190632691611210201217125010104204200183420957601101011510010011722440

Each digit or group of digits is a key to a field. For example, the fifty-first digit corresponds to a field called TVitems:

1Television with no cable
2Regular cable only
3Regular and pay cable
blankUnclassified

Dr. Berry has already parsed the data to produce a more humanly readable form:

<td>04107229</td> <td>No female head</td> <td>No female head or unknown</td> ...
View a small subset of the data

Your task is to take this more humanly readable representation of the data and produce a sparse matrix representation of the data in compressed sparse column format (also called Harwell-Boeing format). Compressed sparse column format is described in the handout given in class.

In the end, you'll need to automatically generate three files from the data:

  1. the matrix in compressed sparse column format
  2. a file containing the column headings, each heading on a separate line. The column headings will be the (brand, deal, size, type, and flavor) lines from the data
  3. a dbm file that maps the row numbers of the matrix to a description of what the row represents. Thus, if row number 5 of the matrix represents the subheading "Television with no cable" under the heading "TVitems", fetching a key containing the number 5 should return something like "TVitems: Television with no cable".

The script you write to produce these files must only make a single pass over the data. The data files are much too large for more than one pass over the data.

Since the data files were machine produced, we can guarantee the files conform to a specific pattern. That is, although the data file is too large to be viewed in Netscape, we know it has the following form throughout the file:

Max number of demo records processing per HHID is ##### <html> <head> <title>Orange juice</title> </head> <body>
<table border cellpadding=3> <tr> <th colspan=26>Brand: ... </th> </tr> <tr> <th>HHId</th> <th>FHage</th> <th>FHemployment</th> <th>FHeducation</th> <th>MHage</th> <th>MHeducation</th> <th>MHemployment</th> <th>MHoccupation</th> <th>Race</th> <th>Hispanic</th> <th>Income</th> <th>HHsize</th> <th>HHcomp</th> <th>AgePresenceChild</th> <th>PetOwnership</th> <th>NumCats</th> <th>NumDogs</th> <th>KitchenApplicances</th> <th>NielsenRegion</th> <th>NielsenCounty</th> <th>TVItems</th> <th>MSA</th> <th>MarketSize</th> </tr> Repeat one or more times
<tr> <td>...</td> <td>...</td> ... <td>...</td> </tr> Repeat one or more times
</table>
<br> <hr> </body> </html>

Each table in the data will be a column in our sparse matrix. When you reach the end of a table, you should start a new column in the sparse matrix. The data within the

<td>...</td>
tags will be the rows of the matrix. For example, the second
<td>...</td>
tag represents the FHage (the age of the female head of the household). If the FHage is "50-64 years", you should go to the row of the matrix that corresponds to "FHage: 50-64 years" and increment that counter.

Unfortunately, we do not know a priori what row headings actually exist in the data. That is, suppose "FHage: 50-64 years" did not exist anywhere in the data. We would not want to allocate a row for it since there would never be any entries in that row. To solve this problem, we could make two passes over the data: the first pass could determine which rows actually exist in the data, and the second pass would create the matrix. On smaller data sets, this approach would be reasonable. On data sets that are as large as the Nielsen data sets, though, making two passes is not acceptable. Thus, we must devise a way to assign rows "on the fly" as we pass through the data and create the matrix. Each time we see a new row heading, we need to assign that row heading the next row number so the next time we encounter that specific row heading we can find the row number to which it corresponds. In Perl, we can use associative arrays to store (row headings, row number) pairs.

As described in class, compressed sparse column format consists of three parts: pointers to the beginning of each column, the row indices, and the non-zeros themselves. Since each of these sections will be expanding as we write the matrix, you should store each section in a separate file and then merge the files into a single file when you are totally done parsing.

Here is an outline of the algorithm I used (remember, this is only an outline - I know it worked for me, but I don't guarantee that it will work for you):




open the dbm file open a file for the column names open a file for the non-zero values open a file for the row indices open a file for the column pointers discard the first six lines of the file while (current line is "<table border cellpadding=3>") discard a line get the (brand, size, deal, type flavor) line, strip the HTML tags and write it to the column names file discard some lines while (current line is "<tr>") discard the Household ID get the FHage, strip the HTML tags, and post() it get the FHemployment, strip the HTML tags, and post() it ... get the MSA, strip the HTML tags, and post() it discard a line discard a line writeVector() resetVector() merge the pointers file, the indices file, and the values file ------------------------ sub post() rowName = field name appended to the entry name (for example, if the field name is "FHage" and the entry is "50-64 years", rowName is "FHage: 50-64 years") if (!(defined($index{$rowName}))) $index{$rowName} = next row number store the (row number, rowName) pair in the dbm file increment the row number @vector[$index{$rowName}]++ ------------------------ sub writeVector() write the row index to the row indices file write the non-zero elements of @vector to the values file write the next column pointer to the column pointers file ------------------------ sub resetVector() set all the entries of @vector to 0


To help you, the directory

~cs494/public/lab5
contains the following example files that have already been processed:
teansy_example.html
tiny_example.html
smaller_example.html
small_example.html

*_matrix is the compressed sparse column matrix that was produced
*.dir and *.pag are the dbm files that were produced. Use the readdbm script in that directory to view them
*_columnNames contains the column names that were extracted

The final goal of this lab is to parse the file

~cs494/public/lab5/OJ.html.Z
Your script must read its input from stdin. This allows us to process the file without decompressing by issuing the following command
zcat ~cs494/public/lab5/OJ.html.Z | script name

You may use any programming language you desire for this lab, but we recommend that you use Perl. The TA's will not assist with debugging if any languages other than Perl are used. A Perl reference book has been placed in the Computer Science library (room 105). Also, a number of people in the CS department have a copy of the book.




Summary:

Due: November 15, 1995
Deliverables: Each group should collect in a single directory their script, the four files it produces for the OJ.html.Z file (the matrix, the column names files, and the dbm files), and any other files we specify at a later date (we may add more data sets (like the cereal and yogurt) later), and send the path of that directory to hudgens@cs.utk.edu. When you send the path to Watts, please use the subject line
Subject: IR lab 5 submission (Group ID)

and put the full path, only, at the beginning of the message, before any other info, words of greeting, or other comments. This will assist with grading and it will help Watts better differentiate between your lab submissions and other email he receives. Notes that only one person in each group needs to submit the lab.
Grading: Correctness - whether your script puts the numbers in the sparse matrix in the same order as our script
Points: This lab is worth 40 points.