Version: 0.3.38

2. Collisions#

Lecture:

Lecture 4.2 (slides)

Objectives:

Understand the problem of collisions when using a hash function and how to work around it

Concepts:

Hash function, collisions, separate chaining, open addressing

2.1. What Are “Collisions”?#

Hash functions maps a rather very large set of keys onto a much smaller set of array indices. The mere size difference between these two sets implies that there will be several keys mapped onto the same index: This a collision.

Important

A collision occurs whenever a hash function maps two different keys onto the same index.

../_images/collisions.svg

Fig. 2.20 A collision: Two entries whose keys unfortunately maps to the same index#

Fig. 2.20 illustrates these collisions. The first entry, whose key is the email’ john@test.com maps to index 3, but a second entry, whose key is lisa@test.com maps to the very same index. Which value should we store at that index?

This implies that a good hash function therefore minimizes the number of collisions. But collisions will necessarily happens as soon as the set of keys is larger than the set of indices. Otherwise, we would talk about perfect hashing 1Perfect hashing is possible is the set of keys is fixed and known in advance. Then one can allocate enough entry in the table to ensure perfect hashing. .

As shown on Fig. 2.21, minimizing the probability of collision boils down to mapping evenly the keys to the possible indices. The mathematics behind hash function relate to the balls into bins problem. The player is given \(n\) balls, and at each step, she places a ball into one of the \(m\) available bins. The question that connects with hash functions and collisions in particular is: At the end of the game, what is the expected number of balls in a bin?

I see no need to dive into the maths here, but we shall see how can we modify our implementation of hash tables, to guard against collisions. There are two main strategies: Separate chaining and open addressing

Important

There are two common ways to manage collisions: Separate Chaining and Open addressing.

2.2. Separate Chaining#

So far, our hash table uses an array of values. Given a pair key-value, the hash of the key gives us the index where we shall store the given value. What to do to when a collision happens?

The idea of separate chaining is to use an array of linked lists (as opposed to an array of values). This way, when a collision occurs we can simply insert the given value in the corresponding list 2Insertion in a linked list runs in \(O(1)\) and does not requires resizing. . Fig. 2.22 illustrates that. It shows a hash table where two keys have already collided at Index 3, john@test.com and lisa@test.com. The two corresponding values are stored in a dedicated linked list. Now we insert another key, neil@test.com whose hash is also 3, so it is inserted into this same list.

../_images/separate_chaining.svg

Fig. 2.22 Resolving collisions using separate chaining#

2.2.1. Retrieval#

We need to modify our retrieval procedure map.get() to account for these linked list of key-value pairs. We proceed as follows:

  1. We retrieve the list associate with the given key;

  2. If there is no such list, then the key is not in our mapped

  3. Otherwise we search through that list for the given key and we return the associated value. if we cannot find the key, then the key is missing.

Listing 2.8 Retrieving the value associated with a given key under separate chaining#
public Value get(Key key) throws NoSuchKey {
     int index = hash(key);
     var candidates = (List<Pair<Key,Value>>) entries[index];
     if (candidates == null) throw new NoSuchKey(key);
     var pair = search(candidates, key);
     return pair.value;
 }

 private Pair<Key, Value> search(List<Pair<Key, Value>> candidates, Key key) throws NoSuchKey {
     var iterator = candidates.iterator();
     while (iterator.hasNext()) {
         var pair = iterator.next();
         if (pair.key.equals(key)) return pair;
     }
     throw new NoSuchKey(key);
 }

In Listing 2.8 shows how one could do that in Java. We first retrieve the list associated with the given key. If that list is null, then the key is missing. Otherwise, we use linear search to find which item holds the desired key. Note the iterator that speeds up traversing the list (see the search procedure).

How fast is this?

Without diving into the mathematics, it takes as long as the search in the candidates for the selected index. So if there has not yet been any collisions, then its runs in \(O(1)\) , otherwise it runs in \(O(k)\) where \(k\) is the number of “candidates” for that index.

2.2.2. Insertion#

To implement the map.put() function, we proceed in a similar fashion:

  1. We compute the index of the key using the hash function.

  2. We retrieve the list of “candidates” for that index

  3. If there is no candidate yet, we initialize a new empty list

  4. We check for duplicates, that is, we search for the given key in the list of candidates.

  5. If we find it, then, we override the value associated with the given key.

  6. Otherwise, we append a new key-value pair to the list of candidates.

Listing 2.9 Inserting a new key-value pair under separate chaining#
 1public void put(Key key, Value value) {
 2     int index = hash(key);
 3     if (entries[index] == null) {
 4         entries[index] = new LinkedList<Pair<Key, Value>>();
 5     }
 6     var candidates = (List<Pair<Key,Value>>) entries[index];
 7     try {
 8         var pair = search(candidates, key);
 9         pair.value = value;
10
11     } catch (NoSuchKey error) {
12         candidates.add(new Pair<Key,Value>(key, value));
13
14     }
15 }

Listing 2.9 details how we can do that in Java. Note that we reuse the search procedure we created for the get operation. If it throws an exception (see line 11), we know that the given key is missing and we then insert a new key-value pair.

2.2.3. Deletion#

The map.remove() operation must also account for these lists. We proceed as follows:

  1. We compute the index by hashing the given key

  2. If there is no candidate at that index, then the key is missing

  3. Otherwise, we traverse the list of candidate and delete the pair that matches the given key.

  4. If there is no pair with the given key, then, the key is missing.

Listing 2.10 Inserting a new key-value pair under separate chaining#
 1public Value remove(Key key) throws NoSuchKey {
 2     int index = hash(key);
 3     if (entries[index] == null) throw new NoSuchKey(key);
 4     var candidates = (List<Pair<Key, Value>>) entries[index];
 5     var iterator = candidates.iterator();
 6     while (iterator.hasNext()) {
 7         var pair = iterator.next();
 8         if (pair.key.equals(key)) {
 9             iterator.remove();
10             return pair.value;
11         }
12     }
13     throw new NoSuchKey(key);
14 }

Listing 2.10 lays out how we can do that in Java. To efficiently delete in a linked list, we use an iterator and trigger the deletion directly from the node that matches the given key.

How Fast Is It?

Here as well, the time we spend deleting depends on the number of items in the list of candidates. If there is one (or none), it runs in \(O(1)\) , otherwise it runs in \(O(k)\), where k is this number of candidates (because we use an iterator).

2.2.4. Wasted Space#

An important behaviour of separate chaining is the space it takes: For larger hash tables, regardless or the hash function, about a third of the table will remain empty.

Why is that? It relates to the mathematics of the balls and bins. After placing \(n\) balls in \(c\) bins, what is the probability of finding an empty bin? For that to happen, we must “miss” that bin every single time. For a given ball, the probability of missing a bin is \(p=\frac{c-1}{c}\), that is, it is the probability of choosing any of the other \(c-1\) bins. Now, for a bin to be empty after \(n\) balls, we must miss every single ball, which means this probability is now \(p^n\). This gives us:

\[\begin{split}\mathbb{P}[B_i = \varnothing] & = \left(\frac{c-1}{c}\right)^n \\ \lim_{n\to\infty} \left(\frac{c-1}{c}\right)^n & = \frac{1}{e} = 0.3678...\end{split}\]

That indicates, that when using separate chaining, we must pay attention to the load factor. This load factor represents the percentage of entries (in the underlying array) that are occupied. When this load factor approaches 66 %, the performance will gradually degrade: The linked list will start to grow, consuming more memory and increasing the runtime of the put and get operations.

2.3. Open Addressing#

Open addressing is another approach to handle collisions. With open addressing, when facing a collision at a given index, we probe another index, until we find a free entry, or until the underlying table gets full.

../_images/open_addressing.svg

Fig. 2.23 Handling collisions using open addressing: When an entry is occupied, we probe another one#

Fig. 2.23 shows an example. The hash table already contains three entries: Hugo’s record at index 1, John at index 3, and Lisa at index 4. We now try to insert Neil’s record, whose key’s hash is 3. That index already contains John’s record, so we try the next one, which is also taken, so we try the next, which is free. So we insert Neil’s details at Index 5.

As a data-structure, open addressing requires less memory than separate chaining. The downside is that the table will get full at some point and must then be resized. Such a resizing, so called rehashing, resembles resizing a dynamic array (see Lecture 2.3) but requires to recompute the hash of every entry, since the table has a new capacity.

2.3.1. Retrieval#

To retrieve the value associated to the given key, we proceeds as follows:

  1. Compute the “expected” index where the value should be, using the hash function.

  2. We check the key.value pair stored at that index.

  3. If there is no key-value pair at that index, the key was not defined.

  4. If there is a key-value pair that has the given key, we return the value.

  5. Otherwise, we check the next key-value pair in the next entry and we return to Step 2.

    1. If the have reached the last entry, we continue from the first one.

    2. If we are back the initial “expected” index, the key was not defined.

Listing 2.11 details how we can do that in Java. To simplify search, we use an offset, which we add to the “start” index.

Listing 2.11 Retrieving a value from a hash table using open-addressing#
 1public Value get(Key key) throws NoSuchKey {
 2     int start = hash(key);
 3     for (int offset = 0; offset < entries.length ; offset++) {
 4         var index = (start + offset) % entries.length;
 5         var candidate = (Pair<Key, Value>) entries[index];
 6         if (candidate == null)
 7             throw new NoSuchKey(key);
 8         if (candidate.key.equals(key))
 9             return candidate.value;
10     }
11     throw new NoSuchKey(key);
12 }
How Efficient Is It?

To retrieve a keey-value pair, we have to follow a trail of “non-empty” entries. This is the probing sequence and its length affects the runtime. In the worst case, we have to scan the whole table and the runtime degrades to \(O(n)\).

2.3.2. Deletion#

To remove an entry from the hash table requires a bit of care. If we simply delete the entry, we will not be able to find the keys that were inserted beyond. Instead, we shall use a soft deletion (a.k.a. tombstone), where instead of deleting the entry, we will only mark it as deleted. We proceed as follows:

  1. We compute the “expected” index where the value should be, using the hash function. This will be our starting position.

  2. We check the key.value pair stored at that index.

  3. If there is no key-value pair at that index, the key was not defined.

  4. If there is a key-value pair that has the given key, we mark it as a deleted, and we are done.

  5. Otherwise, we check the next key-value pair in the next entry and we return to Step 2.

    1. If the have reached the last entry, we continue from the first one.

    2. If we are back where we started, the key was not defined.

The Listing below illustrates how this be done in Java. We use the same strategy as we did before to iterate through the table from an arbitrary position, using an offset. By contrast, when we find the key, we simply mark it as “deleted”.

Listing 2.12 Retrieving a value from a hash table using open-addressing#
 1public Value remove(Key key, Value value) throws NoSuchKey {
 2    int start = hash(key);
 3    for (int offset = 0; offset < entries.length ; offset++) {
 4        var index = (start + offset) % entries.length;
 5        var candidate = (Pair<Key, Value>) entries[index];
 6        if (candidate == null)
 7            throw new NoSuchKey(key);
 8        if (candidate.key.equals(key)) {
 9            return candidate.markAsDeleted();
10        }
11    }
12    throw new NoSuchKey(key);
13}

To mark a given as “deleted” we can modify the Pair class as shown on Listing 2.13

Listing 2.13 Key-value pair with soft deletion#
 1class Pair<Key, Value> {
 2     Key key;
 3     Value value;
 4
 5     Pair(Key key, Value value) {
 6         this.key = key;
 7         this.value = value;
 8     }
 9
10     Value markAsDeleted() {
11         key = null;
12         return value;
13     }
14
15     boolean isDeleted() {
16         return key == null;
17     }
18 }
How Efficient Is It?

To delete a keey-value pair, we also have to follow a trail of “non-empty” entries. The length of probing sequence also affects the runtime. In the worst case, we have to scan the whole table and the runtime degrades to \(O(n)\).

2.3.3. Insertion#

To insert a new key-value pair, we first try as the index yielded by the hash function. If that entry is taken, we try the next entry, until we find a free one (or one that has been marked as deleted).

  1. We compute the “expected” index where the value should be, using the hash function. This will be our starting position.

  2. We check the key.value pair stored at that index.

  3. If there is no key-value pair at that index, or if the pair has been marked as deleted, we insert a new key-value pair here and we are done.

  4. Otherwise, we check the next key-value pair in the next entry and we return to Step 2.

    1. If the have reached the last entry, we continue from the first one.

    2. If we are back where we started, the key was not defined.

Listing 2.14 Inserting a new key-value pair in a hash table using open-addressing#
 1public void put(Key key, Value value) {
 2    int start = hash(key);
 3    for (int offset = 0; offset < entries.length ; offset++) {
 4        var index = (start + offset) % entries.length;
 5        var candidate = (Pair<Key, Value>) entries[index];
 6        if (candidate == null || candidate.isDeleted()) {
 7            entries[index] = new Pair<Key, Value>(key, value);
 8            return;
 9        }
10    }
11    throw new RuntimeException("Table is full!");
12}
How Efficient Is It?

To insert a keey-value pair, we again have to follow a trail of “non-empty” entries. The length of probing sequence again dictates the runtime. In the worst case, we have to scan the whole table and the runtime degrades to \(O(n)\).

2.3.4. Variations#

In the implementation above, we also probe the direct next entry, but there are other strategies, The most common ones are linear probing, quadratic probing, and double hashing.

Linear Probing

With linear probing to insert a pair \((k,v)\) in a table of capacity \(c\), we use the following function, which computes the index to try for the ith attempt:

\[\textrm{attempt}(k, i) = (\textrm{hash}(k) + i) \mod c\]

This is what we have implemented above. The downside of this is that entries form cluster in the table. Such cluster degrades the performance of the operation, because insertion must scan these large clusters

Quadratic Probing

With quadratic probing, we make jumps that are bigger and bigger. For instance, we first the entry \(\textrm{hash}(k)+1\), then \(\textrm{hash}(k)+4\), then \(\textrm{hash}(k) + 9\), etc. The attempt function goes as follows:

\[\textrm{attempt}(k, i) = (\textrm{hash}(k) + i^2) \mod c\]

Quadratic probing helps reduce clustering but places constraint on the capacity, which must be a prime number.

Double Hashing

With double hashing, the jump we make when probing is computed by another hash function, as follows:

\[\textrm{attempt}(k, i) = (\textrm{hash}_1(k) + i\cdot\textrm{hash}_2(k)) \mod c\]

This also helps reduce clustering but the two hash functions must be chosen carefully. Some pairs of hash function will not play well together and leads to more clustering.