Open-addressed double hashed hash table 11 Jun 2017
While I was still in my undergrad program at the University Politehnica of Bucharest, great place btw - you should check out their courses (https://ocw.cs.pub.ro), I took some low level programming classes, and one of the first classes that was actually raising some difficulties was the Operating Systems one. The first assignment for that class was to write a multi-platform hash table, meaning your C code had to run on both Windows and Linux platforms. I am not going to get into details about how you make that work, or why is important now, but I was thinking about why did we get a hash table as a first assignment and about the importance of hash tables in computing in general. So I came up with the idea of writing a tutorial about how to write a hash table in C. What you will get from this is a deeper understanding of how the data structure works and how and when to use it, why sometimes it’s great to use hash tables and other times it’s not.
What is a hash table?
Suppose we want to design a system for storing car license plate numbers and the name of the person whose name the car is registered in. And we want following queries to be performed efficiently:
- Insert a license plate number and corresponding information.
- Search a license plate number and fetch the information.
- Delete a license plate number and related information.
We could do this in a few different ways, we could store the data using:
- Array of license plate number and corresponding info (this solution is not really working, we are going to search in linear fashion, and if our data is really big that is going to take time. Even if we keep the license plate numbers - made out of letters and numbers - in a sorted fashion the fastest we can search this structure is in O(log n) time using binary search. We can do better. )
- A linked list storing license plate number and corresponding info (this solution has the same limitations as the previous one)
- A direct access table (we create a big array and use the license plate numbers as index in the array will not work always, especially if we have again a lot of data, the extra space required for this will be proportional to the number of entries. But this solution is best when it comes to time complexity, the search time for this is actually O(1) - an entry in the access table is NULL if the license plate is not present, otherwise it stores a pointer to the person whose name the car is registered in)
So we want to be able to search in O(1) time, but without the drawbacks that a direct access table gives us. A hash table is an improvement over the direct access table and consists of an array of “buckets”, each bucket stores a key-value pair. The idea is to use a hash function that converts a given license plate or any other key to a smaller number and uses this smaller number as index in a table called hash table. The process is reversible, if we want to get back the initial key-value pair (in our case the license plate number and the person who has that license registered) we give the key to the same hashing function, get back the index and use that to find the value in the array.
hash function
A hash function converts a license plate number to a small index-integer value that will be used in the hash table as identifier for the key-value pair.A good hash function should have following properties
- Efficiently computable.
- Should uniformly distribute the keys (each table position equally likely for each key)
collision handling
The hash gets as input a key and returns a small index number, so there is a possibility that two keys result in the same value. When we want to insert a new key that maps to an already occupied slot in the hash table we have a collision and we have to fix it (there would be no use to our hash table if we end up having an array of key-value pairs that we need to traverse for each index in the hash table).
Open addressing is a method for handling collisions. In Open Addressing, all elements are stored in the hash table itself. So at any point, size of table must be greater than or equal to total number of keys. Open addressing in a hash table can be done in one of the following 3 ways:
- linear probing - we look linearly for the next free slot where we can insert another key:value pair, this is a consequence of the hash function that usually looks like this, where S is the size of the table, and hash(x) is the index computed by the hash function for key k
The problem of this solution is clustering, which makes searching and finding an empty slot slower.
- quadratic probing - looks for i*i-th slot in the i iteration
- double hashing - we use another hash function hash2(k) and we look for i*hash2(k) in the i-th iteration
hash table operations
The operations that our hash table needs to support are the following:
-
Insert(hT, key): insert a new pair key:value in hash table hT
-
Search(hT, key): return the value associated with key from the hash table
-
Delete(hT, key): delete the pair associated with key. This operation is interesting as in if we simply delete a key then search may fail, so that’s why slots of deleted keys are marked as “deleted”. Insert can insert an item in a deleted slot, but search doesn’t look at those slots.
Hash table implementation
We first define the structures that are going to hold the hash table data, as well as the DELETED element that is going to mark the fact that one element from the hash table has been deleted.
Now we need functions to allocate memory for each element of the hash table as well as for the hash table itself, we also need to be able to free this memory.
Next the hash function and how we are implementing the aspects that we talked about before.
Finally we can write the functions for the operations that our hash table is going to perform: insert, search and delete element.
Conclusion
In conclusion there are advantages and disadvantages of using Hash Tables over other data structures. Some of the advantages are:
- synchronization
- in many situations hash tables turn out to be more efficient than search trees or any other table lookup structure. They are widely used in many places, but they are also vulnerable to collisions and that becomes particularly important when hash tables are used in DNS or other web services that can be attacked with denial of services.
The main disadvantages of hash tables are:
- hash collisions are practically unavoidable when hashing a random subset of a large set of possible keys
- hash tables become quite inefficient when there are many collisions