Highlighting

Michael W. Berry (berry@cs.utk.edu)
Thu, 28 Sep 1995 20:32:59 -0400


Date: Thu, 28 Sep 1995 19:14:20 -0400
From: Todd Letsche <letsche@cs.utk.edu>

In an email sent out last week, I specified the following algorithm:

**********************************************************************
2) The lab specifies that the query engine should highlight
the words that are found, exactly what was required in lab 2.
Doing this in C, though, is somewhat more difficult. I'd
suggest using strtok() and setting the token separator string
to any non-word character (basically, [^A-Za-z0-9_], although
you'll have to specify all the separators individually rather
than in a regular expression).

The algorithm could go something like this:

find the length of the document (for doc #5, subtract the
beginning byte address of doc #6 from doc #5 and add 1 -
the last integer of the index file gives the total length
of the document database, so this works for all the
documents)
malloc enough memory to hold the entire document
seek the byte address in the document database
use read() or getline() to read the document
while (strtok() returns a token)
use strcmp to compare the token to the list of
user-supplied keywords
if (the token is a keyword)
print <STRONG>token</STRONG> to stdout
else
print token
**********************************************************************

Well, that would be fine and dandy if strtok() didn't write overwrite
the characters we use to delimit words (that is, if the word boundaries
are the characters ".,;: \n\t", strtok() would overwrite those
delimeters as it returned the tokens to us). It isn't acceptable
to lose all the punctuation and whitespace in our documents, so...
we'll use lex instead.

For those who have never used lex, do not fear. All of the work
has been done for you (I hope).

First of all, copy the file ~cs494/public/lab3/lex.yy.c to your directory.

Then, ignore the algorithm given above and sent out in email
last week (it sure seemed like a good idea at the time, though). Instead,
use the following algorithm (this is not real code - you'll have to
figure out the correct parameters yourself):

#include <stdio.h>
#include <string.h>

extern FILE *yyin;
extern char *yytext;

char *token

yyin = fopen(documentDatabase);
for each document you are returning to the client
fseek(yyin, byte-offset)

while (yylex())
token = yytext
use strcmp to compare the token to the list of
user-supplied keywords
if (the token is a keyword)
print <STRONG>token</STRONG> to stdout
else
print token

close(yyin)

Every time you call yylex(), yytext will be a pointer to the next
word in the document. Words are considered to contain only [A-Za-z0-9]+.
Thus, anything other than those characters is a word boundary. The lexer
automatically writes non-word characters (including whitespace) to stdout.

Also, yylex() will return false when it reaches the end of the document.
Therefore, you won't have to worry about determining where a document
ends - only where it begins (for the fseek()).

You can find an example that uses the lexer at ~cs494/public/lab3/highlight.c.
Compiling instructions are included in the documentation at the
beginning of the file.

To compile your program, you'll have to add "lex.yy.c" to your compile
line and "-lfl" after all the file names on your compile line.

Thus, your compile line might look like this:

gcc -O -o queryEngine queryEngine.c lex.yy.c -lfl

If you are interested in how "lex.yy.c" was created, look through
"~cs494/public/lab3/html.l". That's the file that was sent to lex
to create "lex.yy.c".

If you have any questions, I'll probably be here all day tommorrow (Friday),
or else send me email.

Todd
letsche@cs.utk.edu