CS140 Lecture notes -- Hash Functions

  • Jim Plank
  • Directory: /home/plank/cs140/Notes/Hashing
  • Lecture notes: http://www.cs.utk.edu/~plank/plank/classes/cs140/Notes/Hashing/
  • Thu Oct 18 11:14:37 EDT 2007

    Brief Lecture Notes on Hash Functions

    These lecture notes explore hash functions for strings. They are not meant to replace the book as reference material. Instead, they are meant to illuminate hash functions.

    A hash function takes a piece of data as input, and returns an integer as output. Typically, we take this integer modulo the hash table size, and that provides an index to the hash table for the piece of data.

    We want our hash functions to have two good properties. First, they should be quick. Second, they should spread their output integers uniformly over the range of possible output values. The reason is that this will minimize collisions, where two pieces of data receive the same hash value.

    In this lecture, we are going to explore some hash functions whose data are standard character strings. In particular, we are going to test four input files:

    Think about the input files, and how they may present challenges for hash functions. In particular, which input files have more varied data? Which ones are limited?


    Evaluating Hash Functions

    I am going to use the following skeleton function for evaluating hash functions:

    main(int argc, char **argv)
    {
      IS is;
      int table_size;
      int *hashes;
      int h;
      double ncollisions;
      double nl;
    
      ncollisions = 0;
      nl = 0;
    
      if (argc != 2) {
        fprintf(stderr, "usage: hashx table_size\n");
        exit(1);
      }
      table_size = atoi(argv[1]);
      if (table_size <= 0) {
        fprintf(stderr, "usage: hash1 table_size (> 0)\n");
        exit(1);
      }
    
      hashes = (int *) malloc(sizeof(int) * table_size);
      if (hashes == NULL) { perror("malloc"); }
      for (h = 0; h < table_size; h++) hashes[h] = 0;
    
      is = new_inputstruct(NULL);
    
      while(get_line(is) >= 0) {
        nl++;
        is->text1[strlen(is->text1)-1] = '\0';
        h = hash(is->text1) % table_size;
        if (hashes[h] != 0) ncollisions++;
        hashes[h]++;
      }
      printf("%.0lf lines, %.0lf collisions: %.2lf\n", nl, 
             ncollisions, ncollisions * 100.0 / nl);
    }

    This creates a table of a specified size, reads in lines from standard input, hashes them, and keeps track of the number of collisions of each hash vaule. The hash function is hash(), which takes a string, and returns an integer.

    If our hash function is a good one, then as the table size increases, the number of collisions should decrease, optimally to zero. Also, the number of collisions should be small -- we could do some mathematical analysis to derive how small, but for now, let's just say that it should be small.


    Hash1 - Using strlen() as a hash function

    The first hash function that we are going to test is in hash1.c. It uses strlen() as its hash function. As we would expect, it performs very poorly, especially on input4.txt, where all the input lines have the same length. Let's look at the data.

    Input File Table Size: 4000 Table Size: 8000 Table Size: 100000
    input1.txt
    input2.txt
    input3.txt
    input4.txt
    
    98.28%
    98.78%
    99.53%
    99.97%
    
    98.28%
    98.78%
    99.53%
    99.97%
    
    98.28%
    98.78%
    99.53%
    99.97%
    
    Hash Function: hash1

    Another poor feature of hash1, as we can see above, is that the percentage of collisions does not decrease as the table size increases. This is because all the strings are less than 4000 characters in length.


    Hash2 - Adding the ASCII values of each string

    The second hash function (hash2.c) is a rather obvious one -- add up all the ASCII values of the characters. Here's how it performs:

    Input File Table Size: 4000 Table Size: 8000 Table Size: 100000
    input1.txt
    input2.txt
    input3.txt
    input4.txt
    
    64.88%
    58.65%
    78.42%
    94.67%
    
    64.42%
    58.50%
    78.42%
    94.67%
    
    64.42%
    58.50%
    78.42%
    94.67%
    
    Hash Function: hash2

    Certainly, it works better than the first hash function, but there are still some bad limitations. First, again, all the hash values are less than 4000, so the performance of hashing does not improve as the hash table size increases. Second, in the last two files, where the input strings are fairly similar in size and character composition, the performance is especially bad.


    Hash3 - Hashing the first three characters

    The third hash function (hash3.c) comes from the book. Use the following function of the first three characters: (c1 + 27c2 + 27*27c3). See the book for justification. Here's its performance:

    Input File Table Size: 4000 Table Size: 8000 Table Size: 100000
    input1.txt
    input2.txt
    input3.txt
    input4.txt
    
    62.77%
    67.08%
    86.12%
    89.60%
    
    59.08%
    64.05%
    85.55%
    89.53%
    
    55.55%
    61.08%
    85.15%
    89.42%
    
    Hash Function: hash3

    It improves input4.txt, which makes sense, but for the rest of the input files, its performance is worse. This is because there is only a limited number of combinations of the first three letters in each file: 1815 in input1.txt, 1590 in input2.txt, 594 in input3.txt and 481 in input4.txt.


    Hash4 - Shifting each character by a multiple of 27

    Again, this one comes from the book. Basically, for a string of length l, you sum up ci(27)^(l-i) in an incremental manner. The code is in hash4.c.

    Input File Table Size: 4000 Table Size: 8000 Table Size: 100000
    input1.txt
    input2.txt
    input3.txt
    input4.txt
    
    36.45%
    36.02%
    36.70%
    36.40%
    
    21.10%
    21.05%
    21.00%
    21.20%
    
    2.17%
    2.38%
    2.12%
    1.85%
    
    Hash Function: hash4


    Hash5 - Bit-shifting by 5 bits

    The code in hash5.c performs another of the book's examples -- it bit-shifts each character by 5. This performs badly, because it shifts early characters off the left end of the word, and thus, we lose their information.

    Input File Table Size: 4000 Table Size: 8000 Table Size: 100000
    input1.txt
    input2.txt
    input3.txt
    input4.txt
    
    52.08%
    64.05%
    65.25%
    72.65%
    
    38.17%
    52.23%
    56.35%
    57.58%
    
    8.93%
    19.02%
    32.25%
    9.70%
    
    Hash Function: hash5

    To be honest, I'm not sure why the middle two files perform so much worse here. My guess is the nature of the later characters in the string, but I don't have time to prove.


    Hash6 - Bit-shifting by 5 bits, and taking the result modulo a big number

    The code in hash6.c shifts each character by five bits, but then performs a mod operation, which, if you think about it, randomizes the bits somewhat so that we don't lose information from earlier bits:

    Input File Table Size: 4000 Table Size: 8000 Table Size: 100000
    input1.txt
    input2.txt
    input3.txt
    input4.txt
    
    37.85%
    38.15%
    36.35%
    37.40%
    
    23.12%
    22.55%
    20.95%
    21.93%
    
    2.73%
    2.25%
    1.90%
    2.05%
    
    Hash Function: hash6


    Hash7 - Ignore


    Hash8 - Using a hash function from CACM

    The last hash function is one that came from an article in Communications of the ACM in June, 1990 ("Fast hashing of variable-length text strings," by Peter K. Pearson, CACM 33(6), June, 1990, pages 677-676). The author gave this hash function as an example of one that works relatively quickly, and that yields good hash values. You'll have to read the code to figure it out -- it's not too hard (look at hash_string_8 for the basic technique).

    Input File Table Size: 4000 Table Size: 8000 Table Size: 100000
    input1.txt
    input2.txt
    input3.txt
    input4.txt
    
    36.48%
    36.08%
    36.65%
    35.70%
    
    20.93%
    21.52%
    21.30%
    21.05%
    
    2.77%
    2.85%
    2.60%
    2.98%
    
    Hash Function: hash8


    Summary

    The performance of the hash functions is graphed below for a table size of 100,000 elements.
    As you can see, hash functions 4, 6 and 8 were the clear winners in all cases. Note the difference between hash functions 2 and 3 in the first and the last input files -- it is significant indeed -- think about why.

    In your labs, you should use hash functions 4, 6 or 8.