CS61B
21. Dictionaries & Hash Tables
Dictionaries
Suppose we want to look up a two-letter word.
Declare an array of
26*26=676
referencesTo insert a definition, use
hashCode()
to map each work to a unique integer between 0 and 675.
public class Word {
public static final int LETTERS = 26, WORDS = LETTERS * LETTERS;
public String word;
public int hashCode() { // Map a two-letter Word to 0...675.
return LETTERS * (word.charAt(0) - 'a') + (word.charAt(1) - 'a');
}
}
public class WordDictionary {
private Definition[] defTable = new Definition[Word.WORDS];
public void insert(Word w, Definition d) {
defTable[w.hashCode()] = d; // Insert (w, d) into Dictionary.
}
Definition find(Word w) {
return defTable[w.hashCode()]; // Return the Definition of w.
}
}
Hash Tables
The most common implementation of dictionaries.
A hash table maps a huge set of possible keys into N buckets by applying a compression function to each hash code. The obvious compression function is
h(hashCode) = hashCode mod N
Hash codes are often negative. mod is not the same as Java's remainder operator %
(but Python seems the same). Add N if it's negative.
Problem: collision Solution: chaining
- Having each bucket in the table references a linked list of entries
Then how do we know which definition corresponds to which word?
- store each key in the table with its definition
- the easiest way is to have each listnode store an entry that references to both a key (the word) and an associated value (its definition)
--- ----------------------------------------------------------
defTable |.+-->| . | . | X | . | X | . | . | ...
--- ----|-------|---------------|---------------|-------|-----
v v v v v
--- --- --- --- ---
|.+>pus |.+>evil |.+>okthxbye |.+>cool|.+>mud
|.+>goo |.+>C++ |.+>creep |.+>jrs |.+>wet dirt
|.| |X| |X| |.| |X|
-+- --- --- -+- ---
| |
v v
--- ^ ---
|.+>sin < chains > |.+>twerk
|.+>have fun |.+>Miley burping
|X| |X| the wrong way
--- ---
Hash tables usually support at least three operations. An Entry object references a key and its associated value.
public Entry insert(key, value)
Compute the key's hash code and compress it to determine the entry's bucket.
Insert the entry (key and value together) into that bucket's list.
public Entry find(key)
Hash the key to determine its bucket. Search the list for an entry with the given key. If found, return the entry; otherwise, return null.
public Entry remove(key)
Hash the key to determine its bucket. Search the list for an entry with the given key. Remove it from the list if found. Return the entry or null.
What if two entries with the same key are inserted? There are two approaches.
- Following Goodrich and Tamassia, we can insert both, and have find() or remove() arbitrarily return/remove one. Goodrich and Tamassia also propose a method findAll() that returns all the entries with a given key.
- Replace the old value with the new one, so only one entry with a given key exists in the table.
Which approach is best? It depends on the application.
load factor
load factor = n / N = #keys in the table / #buckets
- should avoid that the load factor grows too large (n >> N)
Hash Codes and Compression Functions
- The ideal hash code and compression function would map each key to a uniformly distributed random bucket from zero to N - 1.
- A given key always hashes to the same bucket; "random" means that two different keys, however similar, will hash to independently chosen integers, so the probability they'll collide is 1/N.
For reasons I won't explain (see Goodrich and Tamassia Section 9.2.4 if you're interested),
h(hashCode) = ((a * hashCode + b) mod p) mod N
is a yet better compression function. Here, a, b, and p are positive integers, p is a large prime, and p >> N. Now, the number N of buckets doesn't need to be prime.
I recommend always using a known good compression function like the two above. Unfortunately, it's still possible to mess up by inventing a hash code that creates lots of conflicts even before the compression function is used. We'll discuss hash codes next lecture.