Lecture 7: Binary Search Trees, Heaps and Hash Tables


I. Binary Search Trees

A tree that insures that every node has only two children. It also insures that the left child is always less than the parent and the right child is always larger than the parent (the order can be reversed but needs to be explicit).

courtesy of wikipedia

Searching takes, on average, log(n) time to perform (way faster than linear search and potentially better than binary search), you only need to walk down one of the tree's branches. Insertion and removal both can be done in average time of log(n)... IF the properties of a BST is kept afterwards. Maintaining the BST property can cost between log(n) average to worst case of n in time. Sorting is the same as heapsort.

II. Heaps

Heaps are trees that is always descending in value (or rank) as the level increases (the order can be reversed but needs to be explicit). Which means each node is always greater or equal in value as their children.

courtesy of wikipedia

Binary Heaps

Binary heap is a heap that allows each node to have only a maximum of two children. All levels of the tree are filled (except maybe the last level), which means that all nodes other than those leading to a leaves all must have two children.

Binary heaps can be stored in an array like the following:

courtesy of wikipedia

The lookup key is basically the following:

Others

III. Heapsort

Heapsort sorts an array by building a binary heap out of it. It runs on an almost constant nlog(n) time. It works by building the binary heap and then extracting the largest values and works its way down.

For detail implementation, refer to Wikipedia .

The basic jest of the algorithm is that you first build the binary heap, which is done via the function "heapify" and then picks off the largest and throws it into the back while keeping the binary heap property going (with additional cost).

Heapify essentially works by taking the first node above the leaves and doing "sift up" (if using max heap) with it. Sift up basically compares all nodes under it to it to make sure that it is retaining the heap property, which means making sure that the children are never larger than the parent.

The sorting happens when the algorithm goes and puts the root at the back of the array. It then sifts up the first element in the array, which should be the new root (but not guaranteed). It is a simple form of heapify, without all the schnaz of it, because the thing was a binary heap already before you took the root off. What this means is that the new root is like a new element added into the binary heap since you aren't sure if it is the root. So it simply just boils down to just one insertion every time you take the root and put it in the back of the array.

IV. Hash Tables

Hash tables are arrays that allows for constant access time... so if you need to add things, remove things, get things, it's always in one operation. This is the fastest data structure for accessing... but potentially the most costly in space.

How hash tables work is that it allocates enough space for all indexes needed. So if you have a 10 unique items with values ranging from 10 to 100, you must make 90 spaces for it. This may sound like it's wasting space... BUT, you can find what you are looking for by simply inputing the value that you are looking for, hence no searching necessary.

courtesy of wikipedia

Choosing a key

So the above example in the previous paragraph is an ideal situation. You have only unique items, always, and it only space 90 in range, making it still feasible to hold in memory... so what if you have data ranges going 0 to 10,000,000,000,... well, let's not thing of that.

One solution is to ditch this and use linked list or something... but does it have to be? Not really... you can make a hash function, or a function that will map value, or keys, to index. So for instance, you have one item for every 100 value... so you have values like 100, 200, 300, 400, ... 100,000,000... so all you need to do is make n, your total, divided by 100 as the size of your array.... now granted it's still large, but you just reduced the space requirement by 100 folds. So every time you want to find the value k , you can just access it by doing a[k/100].

So what if multiple items hash to the same index? This is called collision. Just like on the highway with cars, this is bad. A good hash function is always measured by how little collisions you will have and how much space you will save. The two metrics are opposing forces in mathematics... but knowing your data will allow for great hash functions... sometimes. Hash functions is an ongoing research topic but simple ones can be derived easily at any level.

Collision resolution

So what if you have a collision? Just like with cars on highways... you can't just flee the scene of collision and hope everything goes fine. There are many techniques but the three most common will be described here:

Linear Probing

Besides its reference in nerdcore to really explicit topics... it is a great way of managing collisions in hash tables. What this method proposes to do is to find the next closest empty index in the array.

This is a good technique if you know that you will have more indexes than elements. It is easy to implement and not that hard to maintain. This, however, does not guarantee that everything inserted will have a place to go. If all your indexes gets filled... and you want to add more... you're screwed.

Chaining

This is one of the hardest resolutions to manage... since it creates a linked list on every single index. What this does is that the indexes become linked lists and if there is a collision, you just add to the linked list.

This, despite its difficulties to maintain, does guarantee that absolutely every element can enter the hash table without getting yourself screwed. But it also increases the runtime complexity too... for every add, search and essentially everything you need to access it like a linked list, increasing your operations time to n, instead of just one operation.

Multi-hashing

So you like hashing keys into index in your primary hash table, right? Why not do it to every index too? What this does is that it will just create a hash table within a index of a hash table every time there is a collision (granted that the keys are not the same).

This approach is good in that it keeps it constant time... but it can degrade down to linear... depends on how bad the hash function is... you can just keep going deeper and deeper into your hash tables and creating something like a linked list with a much harder contract to manage.

Application

Hash tables are used everyday in your computers by the CPU cache. The caches are all just hash tables. Your L1 cache has the least amount of indexes and your index count increases as your cache level increases. It also has a lot of terms built in too... so it is not usually just an one way hash table, it usually hashes twice.

Hash tables are also used in operating systems to do virtual memory, memory segmentation and other very important optimizations in your hardware.

A data storage, in some way, are hash tables.

Databases, at many levels, are also hash tables.

For more detail on hash tables, refer to Wikipedia.