A hash function takes a piece of data as input, and returns an integer as output. Typically, we take this integer modulo the hash table size, and that provides an index to the hash table for the piece of data.
We want our hash functions to have two good properties. First, they should be quick. Second, they should spread their output integers uniformly over the range of possible output values. The reason is that this will minimize collisions, where two pieces of data receive the same hash value.
In this lecture, we are going to explore some hash functions whose data are standard character strings. In particular, we are going to test four input files:
Think about the input files, and how they may present challenges for hash functions. In particular, which input files have more varied data? Which ones are limited?
main(int argc, char **argv)
{
IS is;
int table_size;
int *hashes;
int h;
double ncollisions;
double nl;
ncollisions = 0;
nl = 0;
if (argc != 2) {
fprintf(stderr, "usage: hashx table_size\n");
exit(1);
}
table_size = atoi(argv[1]);
if (table_size <= 0) {
fprintf(stderr, "usage: hash1 table_size (> 0)\n");
exit(1);
}
hashes = (int *) malloc(sizeof(int) * table_size);
if (hashes == NULL) { perror("malloc"); }
for (h = 0; h < table_size; h++) hashes[h] = 0;
is = new_inputstruct(NULL);
while(get_line(is) >= 0) {
nl++;
is->text1[strlen(is->text1)-1] = '\0';
h = hash(is->text1) % table_size;
if (hashes[h] != 0) ncollisions++;
hashes[h]++;
}
printf("%.0lf lines, %.0lf collisions: %.2lf\n", nl,
ncollisions, ncollisions * 100.0 / nl);
} |
This creates a table of a specified size, reads in lines from standard input, hashes them, and keeps track of the number of collisions of each hash vaule. The hash function is hash(), which takes a string, and returns an integer.
If our hash function is a good one, then as the table size increases, the number of collisions should decrease, optimally to zero. Also, the number of collisions should be small -- we could do some mathematical analysis to derive how small, but for now, let's just say that it should be small.
| Input File | Table Size: 4000 | Table Size: 8000 | Table Size: 100000 |
input1.txt input2.txt input3.txt input4.txt |
98.28% 98.78% 99.53% 99.97% |
98.28% 98.78% 99.53% 99.97% |
98.28% 98.78% 99.53% 99.97% |
Another poor feature of hash1, as we can see above, is that the percentage of collisions does not decrease as the table size increases. This is because all the strings are less than 4000 characters in length.
| Input File | Table Size: 4000 | Table Size: 8000 | Table Size: 100000 |
input1.txt input2.txt input3.txt input4.txt |
64.88% 58.65% 78.42% 94.67% |
64.42% 58.50% 78.42% 94.67% |
64.42% 58.50% 78.42% 94.67% |
Certainly, it works better than the first hash function, but there are still some bad limitations. First, again, all the hash values are less than 4000, so the performance of hashing does not improve as the hash table size increases. Second, in the last two files, where the input strings are fairly similar in size and character composition, the performance is especially bad.
| Input File | Table Size: 4000 | Table Size: 8000 | Table Size: 100000 |
input1.txt input2.txt input3.txt input4.txt |
62.77% 67.08% 86.12% 89.60% |
59.08% 64.05% 85.55% 89.53% |
55.55% 61.08% 85.15% 89.42% |
It improves input4.txt, which makes sense, but for the rest of the input files, its performance is worse. This is because there is only a limited number of combinations of the first three letters in each file: 1815 in input1.txt, 1590 in input2.txt, 594 in input3.txt and 481 in input4.txt.
| Input File | Table Size: 4000 | Table Size: 8000 | Table Size: 100000 |
input1.txt input2.txt input3.txt input4.txt |
36.45% 36.02% 36.70% 36.40% |
21.10% 21.05% 21.00% 21.20% |
2.17% 2.38% 2.12% 1.85% |
| Input File | Table Size: 4000 | Table Size: 8000 | Table Size: 100000 |
input1.txt input2.txt input3.txt input4.txt |
52.08% 64.05% 65.25% 72.65% |
38.17% 52.23% 56.35% 57.58% |
8.93% 19.02% 32.25% 9.70% |
To be honest, I'm not sure why the middle two files perform so much worse here. My guess is the nature of the later characters in the string, but I don't have time to prove.
| Input File | Table Size: 4000 | Table Size: 8000 | Table Size: 100000 |
input1.txt input2.txt input3.txt input4.txt |
37.85% 38.15% 36.35% 37.40% |
23.12% 22.55% 20.95% 21.93% |
2.73% 2.25% 1.90% 2.05% |
| Input File | Table Size: 4000 | Table Size: 8000 | Table Size: 100000 |
input1.txt input2.txt input3.txt input4.txt |
36.48% 36.08% 36.65% 35.70% |
20.93% 21.52% 21.30% 21.05% |
2.77% 2.85% 2.60% 2.98% |
In your labs, you should use hash functions 4, 6 or 8.