CS 594/494 COURSE PROJECT
Due on Friday, April 25 at midnight
(Changes made to this page on 4/10/03 - Item 1 under Funct. Specs.)
Only one submission from any group is required on/before the due date
(Friday, April 25 at midnight).
In addition to software (tarfile of scripts, image files, etc.), your
submission should include a 3-5 page report (in HTML) explaining
the following:
- Design of the IFS (Inverted File System) used; weighting
schemes, data structure, and software constructs used (php/perl).
- Design of the relational database used (via PostgreSQL tables).
- Specification of the routing query used to produce a rank-ordered
listing of possible employees.
Remember the project in this class is a team effort!
You are to email to the TA a tarfile attachment of all php/perl/image files
needed to implement your text stream processing system. The main
script file for your system should be called index.php and accept
a path parameter to a UNIX directory of applicant XML datafiles formatted by
format.php. The syntax of this path
specification should be (no quotes needed)
index.php?path=XML_directory_path
Functional Specifications
Your system must perform the
following functions:
-
Parse all (*.xml) files found in the UNIX directory path for IFS keywords and
structured data (for your relational database); you must capture
any errors encountered by the expat XML parser and
record the error code, error string, line number, and file name for each error. The parser will
terminate on the current XML file and no restart will be needed.
The results page (providing the rank-ordered list of potential employees) should
also contain a list of all errors encountered in the text stream. The file
name (*.xml), error code, error string, and line number of the error should be
specified for each entry in your output list of errors. All *.xml files must
be parsed only once (major point deductions will be applied for multiple reads).
-
Build an IFS (Inverted File System) using all extracted keywords from
all parse *.xml files (of the form FirstName_LastName.xml). You can
use any term-weighting scheme discussed in class to perform keyword-based
query-matching and document (file) ranking. Php/Perl scripts are recommended for
this task.
-
Create a relational database for Education and Experience
metadata extracted from the *.xml files. You must use at least two
tables using your group's work area for the Postgres SQL server
dbs.cs.utk.edu. Use php files to
push/pull data to/from the SQL tables which must be created from scratch.
Hence, no pre-existing tables are permitted.
-
Perform query-matching via a routing query for ranking potential employees
using your IFS and relational database (RB). You must use at least 10
keywords for this query along with at least one join operation between
two or more tables in the RB. You must create an output (HTML) page for
the browser to show your rank-ordered list of applicants. The list should
contain the rank scores (in descending order) and provide links of the
form
display.php?file=XML_directory_path/FirstName_LastName.xml,
where display.php is used to display
the corresponding XML resume file, and XML_directory_path is the
parameter provided to your index.php script. You may provide
additional information
next to each item/record listed in the ranked list, such as the terms
found in the resume. However, your list should contain only those applicants
(resumes) whose content satisfies the SQL portion of your routing query.
That is, the RB is used to extract a subset of the *.xml files for ranking with the
keyword-based portion of your routing query (using your IFS).
You may use any additional images or
visually pleasing displays to reflect the client need you are
responsible for
(see clients.txt). Your results page must
also include the list of parsing errors encountered (mentioned above),
as well as a natural language description of the routing query used.
The actual implementation details of the routing query should be provided
in your 3-5 page (HTML) report that is submitted along with your source
code. Please name the report report.html.
If you have any questions e-mail Lana
Mironova or come by office hours on Monday 10:00 - noon.