CS 594/494 COURSE PROJECT

Due on Friday, April 25 at midnight

(Changes made to this page on 4/10/03 - Item 1 under Funct. Specs.)

Only one submission from any group is required on/before the due date (Friday, April 25 at midnight). In addition to software (tarfile of scripts, image files, etc.), your submission should include a 3-5 page report (in HTML) explaining the following:

  1. Design of the IFS (Inverted File System) used; weighting schemes, data structure, and software constructs used (php/perl).
  2. Design of the relational database used (via PostgreSQL tables).
  3. Specification of the routing query used to produce a rank-ordered listing of possible employees.
Remember the project in this class is a team effort!

You are to email to the TA a tarfile attachment of all php/perl/image files needed to implement your text stream processing system. The main script file for your system should be called index.php and accept a path parameter to a UNIX directory of applicant XML datafiles formatted by format.php. The syntax of this path specification should be (no quotes needed)

index.php?path=XML_directory_path

Functional Specifications

Your system must perform the following functions:

  1. Parse all (*.xml) files found in the UNIX directory path for IFS keywords and structured data (for your relational database); you must capture any errors encountered by the expat XML parser and record the error code, error string, line number, and file name for each error. The parser will terminate on the current XML file and no restart will be needed. The results page (providing the rank-ordered list of potential employees) should also contain a list of all errors encountered in the text stream. The file name (*.xml), error code, error string, and line number of the error should be specified for each entry in your output list of errors. All *.xml files must be parsed only once (major point deductions will be applied for multiple reads).
  2. Build an IFS (Inverted File System) using all extracted keywords from all parse *.xml files (of the form FirstName_LastName.xml). You can use any term-weighting scheme discussed in class to perform keyword-based query-matching and document (file) ranking. Php/Perl scripts are recommended for this task.
  3. Create a relational database for Education and Experience metadata extracted from the *.xml files. You must use at least two tables using your group's work area for the Postgres SQL server dbs.cs.utk.edu. Use php files to push/pull data to/from the SQL tables which must be created from scratch. Hence, no pre-existing tables are permitted.
  4. Perform query-matching via a routing query for ranking potential employees using your IFS and relational database (RB). You must use at least 10 keywords for this query along with at least one join operation between two or more tables in the RB. You must create an output (HTML) page for the browser to show your rank-ordered list of applicants. The list should contain the rank scores (in descending order) and provide links of the form

    display.php?file=XML_directory_path/FirstName_LastName.xml,

    where display.php is used to display the corresponding XML resume file, and XML_directory_path is the parameter provided to your index.php script. You may provide additional information next to each item/record listed in the ranked list, such as the terms found in the resume. However, your list should contain only those applicants (resumes) whose content satisfies the SQL portion of your routing query. That is, the RB is used to extract a subset of the *.xml files for ranking with the keyword-based portion of your routing query (using your IFS).

    You may use any additional images or visually pleasing displays to reflect the client need you are responsible for (see clients.txt). Your results page must also include the list of parsing errors encountered (mentioned above), as well as a natural language description of the routing query used. The actual implementation details of the routing query should be provided in your 3-5 page (HTML) report that is submitted along with your source code. Please name the report report.html.

If you have any questions e-mail Lana Mironova or come by office hours on Monday 10:00 - noon.