In my previous post I analyzed how complexity really makes a difference with some raw mathematical data and visualizations, to help programmers make the right efficiency choices.
I also promised that I wanted to make a post “leafing around the concepts of trees just a little”, and this post does exactly that.
Give a man a tree….
For those unfamiliar with trees and their nomenclature, a tree is a data structure which comprises of nodes. Each node holds pointers to all their children nodes, and their parent node. The topmost node, with no parent, is called the ‘root’, and it is the first node inserted into the tree.
A tree by definition can have any number of nodes as its children, but I feel personally that once the maximum number of children per node exceeds two, the problem at hand becomes more of a graphing problem. Hence I shall limit my discussion to trees with at max two children per node, or binary trees. There are a few reasons for this, and not all logical.
- A binary tree follows many logical and philosophical dualities: yin and yang, yes and no, him and her, true and false, 1’s and 0’s, etc. that make it relevant amongst the various types of trees.
- The gain in insertion, deletion and searching time achieved by implementing a ternary tree (3 children) is minimal over using a similar binary tree (2 children). They both insert, search and delete in logarithmic time, i.e. O(log n). A binary tree performs these three operations in O(log n) base 2, and a ternary tree does it in O(log n) base 3. As the graph from my last post shows, the gain in time achieved by this is pretty minimal..and a binary tree is much easier to implement.
Binary trees also make very pretty fractal designs.
However, search trees with more than two children (called B-Trees) do exist and are very widely used in databases due to their flexibility; but the upkeep is rather difficult for beginners. Hence, I will explore trees like B-trees, Radix trees and Tries, in other posts.
More than just binary trees, I want to focus on binary search trees, a very efficient data structure, and one I am particularly fond of its simplicity of use and design:
A binary search tree (i.e. BST).
This is a binary tree, courtesy of the prestigious Carnegie Mellon University. The operations for searching, insertion and deletion from a binary search tree all take O(h) wost-case time for a search tree, where ‘h’ is the height of the root, also called the height of the tree. It is also the depth of the tree, which is equal (in magnitude) to the maximum level of a tree.
All these terms are very confusing.
No, really. I have no idea.
Luckily, this link and this link provides a correct, full set of terminology for a tree data structure, and it is free for everyone to see on Wikipedia.
There several types of binary search tree. Pay special attention to perfect (not to be confused with complete) and balanced types of binary tree, as they are relevant.
So, why is searching, inserting and deletion from trees so efficient?
This question can be answered by asking why inserting and deleting random data from other data structures is so inefficient;
Inserting data in an ordered way takes O(n) time when we use both arrays and linked lists [in arrays you have to shift the elements around, in linked lists you have to find the position where we want to insert our data if we want to insert it in sorted order].
Meanwhile, inserting, deleting and searching a binary search trees, meanwhile, is O(logn) (base 2 for balanced binary search trees). It is also said to be O(h), h=height of the tree.
But this is just a function, it doesn’t mean much to us. I’m a fan of expressing numbers relative to one another, as it gives a much better perspective when a choice has to be made.
So let’s take an example:
It has been estimated that around 100 billion humans have lived on earth in total. Some people have had no children, and some people, like the utterly badass Mongol warchief Genghis Khan, are thought to be the ancestor to one out of every 200 men alive. But that’s just some people.
Suppose we assume every person of age, has at exactly two children, and all those below age, have none. Let’s also take a break from reality for a sec, and assume that human reproduction is mitotic, like single-celled organisms.
“Don’t look at me, I’m still wondering how the little guy got here.”
If we created records for all these 100 billion people, and inserted them – in the order in which they were born – into a ‘universal tree’, which was sorted according to people’s names, we would get a huge binary search tree.
Now for the fun bit: if we wanted to find any one person’s profile from these 100 billion people, the maximum number of steps we’d have to take from the root node to that person’s node, is O(log n). For base 2, (assuming our tree is balanced) that is log(100 billion) = 37 steps.
We could find the records any single person who ever existed, in less than 38 steps, using a Binary Search Tree. If a new baby is born, we can also insert their records in the correct place in the tree, in 37 or less steps.
It gets better…
One of the best things about a binary search tree is that data is entered in a sorted way, and this lets us obtain the data in sorted order, in O(n) time, by performing an Inorder traversal of the tree.
Inorder traversal: start from F and follow the arrows.
Output: A, B, C, D, E, F, G, H, I
Now, there are several ways to perform such an Inorder traversal; Wikipedia lists one with recursion and one without, both using pseudocode (the one without recursion, uses a stack of node pointers; whereas the recursive one uses the call stack implicitly). Both of these can be easily implemented in a high-level language: the “visit(node)” section of the code could be replaced with a printing statement, or a statement storing that number in an array.
This kind of traversal is what is called a depth-first search (DFS). It is common searching technique, you will often find when dealing with the graph data structure…because the Binary Search Tree is a graph, albeit a simple one.
There are two more noteworthy depth-first traversals: preorder and postorder. I like the preorder more, because:
Note; this only works if you don’t balance your tree.
A quick reference; Note: L always comes before R, i.e. you never right before going left.
Inorder= L-V-R [Left, Visit, Right]
These kind of traversals (depth-first) don’t work on infinite trees, i.e. trees with infinite depth, or trees which loop somewhere (this is actually called a cyclic graph, and you will study it when you study graphs). For infinite trees and cyclic graphs, a Breadth-First Traversal (also called level-order traversal) is the only sensible form of traversal. It is also one of the most convenient ways to print your tree. Algorithm here (uses a queue of pointers).
But there’s more.
Inorder traversal, like all traversals, will touch every node in the tree at least once (if only to print it). Using this property, and a suitably modified ‘Node’ class, you can set the height, depth, maximum height, right or left child-ness, presence of a sibling node, the total number of leaf nodes, number of non-leaf nodes, etc…. for all the nodes of your tree, in one simple O(n) traversal.
I made a post to stackoverflow.com with the code (using recursion). You can view it, and add to the discussion (sign in with Google).
So, Binary Search Trees. They insert, delete, search and sort very efficiently, in O(log n) time.
Yes, they do.
Did I mention they are my favorite data structure 😀 ?
…What’s the catch?
There, er, is no catch.
WHAT’S THE CATCH!?!
This sort of…..sort, isn’t perfect.
I mean, the sort is perfect: it traverses the tree in O(n) time….but the very act of making a tree, inserting each node, is where we have our pitfall. Inserting each node is supposed to be O(log n), or logarithmic time. Inserting ‘n’ nodes should then take O(n.logn) time. This would pit Tree Sort with other optimal comparison sorting algorithms, like Quicksort and Mergesort. But that isn’t always the case.
Suppose you entered your data in a sorted, increasing order. The first element, x, will become the root; then x (greater than x) will be its right child; then x (greater than x) will be x’s right child, and so on and so forth.
So now, you get a tree with no left children. Such a tree is said to be skewed.
Inserting into a skewed tree, will taken O(n) time (big ‘O’ = worst case). Your worst case will be that the new element is greater than all the other elements, so it will be the rightmost, and that this will happen for each of the ‘n’ elements you enter.
Hence, the tree degrades into a linked list, with the extra space of our left pointers wasted as well. Worst of all, with a skewed tree, we lose our incredible O(logn) insertion, deletion and searching time…it now becomes O(n). So, to create a tree with ‘n’ elements, we will take O(n^2) time if the tree is skewed. This is terribly slow.
Side note: the statement
inserting, deleting and searching a binary tree takes O(h) time, where ‘h’ is the height of the tree.
is always true, even for skewed trees. However, that is because max height = Ω(logn), not O(logn), i.e. it is equal to the log (base 2) of the inputs in the best case (i.e. a perfectly balanced tree), but not worst case (i.e. a skewed tree). On average case though, ‘h’ is close of even equal to logn, but it depends on the input.
Trees as arrays:
Implement a tree as an array, as is done in Heapsort, is an interesting concept, though it doesn’t solve our problem of O(n) insertion time for skewed trees.
When you implement a tree as an array, you gain a few things:
Random access time of an array, which makes some operations easy to implement…Heapsort in particular.
Implementing a tree as an array also makes breadth-first traversal extremely easy: you just go through the array from start to finish, and print every non-empty element.
And, if done carefully, you can maintain the O(logn) insertion, deletion and search time. (Note: this is left as an exercise to the reader, which is a way of saying “I think this is possible but I’m too lazy to try it out myself”.)
The disadvantage of this method:
Since the size of an array is fixed, you have to make the assumption that your tree of height ‘h’ is perfect.
It doesn’t solve our problem with O(n) inserting time for a skewed tree.
The space wasted by implementing a skewed tree as an array is enormous; O(2^h) in the worst case, ‘h’ being the height. And as the table from my last post shows, this reaches over a million elements when ‘h’ becomes as little as 20. So for a very skewed tree, you would need one million elements in your array just to store twenty values in their correct positions.
So, is there actually a way to get over the O(n) insertion-deletion-search time barrier?
Yes, thankfully, there is. Every time you insert into a tree, you check if the tree is balanced.
A balanced binary search tree is the opposite of a skewed tree: it tries to maintain its data in order, with each node having the least possible height.
We could create a function to balance a tree separately when we notice that the tree is becoming increasingly skewed, but this is subject to a lot of fine-tuning and is likely time-consuming. Instead, we could implement one of the several self-balancing binary search trees that already exist.
A self-balancing tree is one does the balancing automatically: on insertion of a new node, it is checked if each node of the tree is balanced, using a balancing factor. If the tree is unbalanced, some operations must be performed to get the search tree back into its balanced state.
The most widely-studied amongst these are AVL trees and Red-Black trees, though many others exist. AVL trees were developed first, in 1962, just two years after the first formal binary search tree was defined. Red and Black trees were developed a whole 10 years after AVL trees, in 1972. The two are similar, but not identical.
Both follow the basic same structure of inserting/deleting and balancing:
- Insert/delete node (make appropriate changes for deletion).
- Check if and where the tree is unbalanced using balancing factor (relative heights of children for AVL trees, red/black-ness for Red-Black trees).
- If unbalanced, perform the necessary tree rotation operations until it is balanced.
- Doing so, we get the tree back into a proper shape relative to the number of nodes it has, and not slanting to one side.
L-R and R-L Tree Rotation animations
AVL trees balance in O(log n) time – same as insertion time, so we could insert two elements in that time – but allow for fast searching as they are more ‘perfect’ with a smaller maximum height, which is limited to ~1.44lg(n+2) (mathematically proven).
Red-Black trees balance in O(1) and are built for faster insertion/deletions, with a less time given to balancing the tree tightly. Their height is limited to ~2lg(n+1) (mathematically proven). Hence, their time of lookup is slower.
Both AVL and Red-Black trees balance in O(logn) time! Hence, we can maintain our tree, all the while maintaining our O(log n) search, insertion and deletion time! 😀
Note: B-Trees can also be made self-balancing!
How to implement AVL trees, courtesy GeekForGeeks.
Welp, that’s all folks!
I would really like to go into depth about self-balancing trees, but unfortunately, I’ve run out of time and space! In future posts, I hope to fully explore AVL and Red-Black trees, as well as the all-important B-Trees and the strange-looking Tries. Until then though, keep your browsers open and your thumbs pointed to the starts!