Trees

Trees are a very useful class of data structures. Like (singly) linked lists, trees are acyclic graphs of node objects in which each node may have some attached information. Whereas linked list nodes have zero or one successor nodes, tree nodes may have more. Having multiple successor nodes (called children) makes trees more versatile as a way to represent information and often more efficient as a way to find information within the data structure.

Like lists, trees are recursive data structures. A non-empty tree has a single root node that is the starting point for searching in the tree. All nodes in the tree are reachable from the root by following a path of edges. All nodes except the root have exactly one predecessor node, called its parent. The root is the only node with no parent.

It is customary to draw trees growing downward. The root node may be indicated by an incoming edge.

Because a tree is a recursive data structure, each node in the tree is the root of a subtree. For example, in the following tree, the node G has two children B and H, each of which is the root of a subtree. This tree is a binary tree in which each node has up to two children, which are implicitly distinguished as its left and right children. We say that a binary tree has a branching factor of 2. The children of a node are usually ordered from left to right, so B is the left child of G and root of the left subtree of G, and H is the right child of G and root of the right subtree of G. A node with no children is called a leaf.

Each node in a tree has a height and a depth. The depth of a node is the length of the (unique) path from the root down to that node. The length of a path is the number of edges in the path, so the depth of the root is always 0. The height of a node is the length of the longest path from that node to a leaf below it, so leaf nodes have height 0. The height of the tree is the height of its root, which is the same as the depth of its deepest leaf. Therefore, a tree consisting of a single (root) node has height 0. It is also possible to have an empty tree, with no nodes or edges; we will consider such a tree to have height −1. (It is possible, though less standard, to define the height of a tree as the number of nodes along the longest path to a leaf, so that an empty tree has height 0; for reasoning about asymptotic complexity, it almost never matters which definition is used.)

We can represent a binary tree using a class with instance variables for the left and right children:

BinaryNode

For trees with a larger branching factor, the children can be stored in another data structure such as an array or linked list:

NAry

Analogously to doubly linked lists, tree nodes in some tree data structures maintain a pointer to their parent node. A parent pointer can be useful for walking upward in a tree, though it takes up extra memory and creates additional maintenance requirements.

Why trees?

Tree data structures are important for two main reasons:

  1. Some information has a natural tree-like structure, so storing it in trees makes sense. Examples of such information include parse trees of expressions, inheritance and subtyping hierarchies, game search trees, and decision trees.

  2. Trees make it possible to find information within the data structure relatively quickly, when they can be arranged that all nodes are fairly close to the root. For simplicity, let's think about a complete binary tree of height \(h\) in which all leaves are at equal depth \(h\) and all non-leaves have exactly two children. In a complete binary tree of depth \(h ≥ 0\), there are \(1 + 2 + 4 + ・・・ + 2^h = 2^{h+1} − 1\) nodes. Calling this number of nodes \(n\), we have \(h = \lg(n+1)-1\), so \(h\) is \(O(\log n)\). (Recall that \(\lg x\) is the base-2 logarithm of \(x\), and that \(O(\log_a n) = O(\log_b n)\) for any constant bases \(a\) and \(b\).) Thus, if we are looking for information in a complete binary tree, it can be reached along a path whose length is logarithmic in the total information stored in the tree.

    For large \(n\), logarithmic time is a big speedup over the linear-time performance offered by data structures like linked lists and arrays. For example, \(\lg 1,\!000,\!000\) is approximately 20, a speedup of around 50,000 (ignoring constant factors), and \(\lg 1,\!000,\!000,\!000\) is about 30, a speedup of more than 30,000,000.

Binary search trees

Efficiently searching for information in trees relies on knowing which path to take through the tree. This property can be obtained if the data stored in tree nodes can be ordered according to some ordering relation. For example, we might search for information by using a string value as the key, and order the strings by dictionary order. A search tree is a tree data structure that exploits ordering to allow finding information quickly.

A particularly important form of search tree is a binary search tree, a binary tree that satisfies the following data structure invariant.

Binary Tree Order Invariant: For each node \(n\) in the tree, all data elements stored in \(n\)'s left subtree are less than the element stored at \(n\), and all elements stored in \(n\)'s right subtree are greater than the element stored at \(n\).

We can visualize the tree relative to a given node as in this picture. Note that we've stopped drawing arrowheads on the tree edges, relying instead on the convention that edges go downward:

We can express this data structure invariant as a class invariant:

BST

Since the invariant is defined on a recursive type, it applies recursively throughout the tree, ensuring that data structure invariant holds for every node.

For the invariant to make sense, we must be able to compare two elements to see which is greater (in some ordering). One way to provide such an ordering is to require the parameter type T to extend the interface Comparable:

BST
Here, the ordering is specified by the operation compareTo(). To ensure that the type T has such an operation, we specify in the class declaration that T extends Comparable<T>, where Comparable<T> is the generic interface shown above. The keyword extends signifies that T is a subtype of Comparable<T>. Thus T can be a class that implements the interface. The compiler will prevent us from instantiating the class BinaryNode on any type T that is not a declared subtype of Comparable<T>, therefore the code of BinaryNode can assume that T has the compareTo() method.

A more flexible way to define the ordering is to provide it through a comparator object that implements the Comparator interface:

BST

The comparator approach, which is an example of the Concept design pattern, has slightly higher storage requirements since the comparator must be remembered, but its advantage is that it allows the binary search tree to be used with any type for which a reasonable comparison operation can be defined, and not just with those types T that happen to declare that they extend Comparable<T>. This looser coupling can be helpful when you don't have control over the class that you would like to put into a tree, or when that class defines a notion of comparison that doesn't match the one you want. For example, the class String has a built-in ordering on strings, but for some applications you might want the reverse ordering, or an ordering that is case-insensitive.

Searching in a binary search tree

Consider what happens when we try to find a path down through the tree looking for an arbitrary element x. We compare x to the root data value. If x is equal to the data value, then we've already found it. If x is less than the data value and it's in the tree, it must be in the left subtree, so we can walk down to the left subtree and look for the element there. Conversely, if x is greater than the data value and it's in the tree, it must by in the right subtree, so we should look for it there. In either case, if the subtree where x must be is empty (null), the element must not be in the tree. This algorithm can be expressed compactly as a (tail-)recursive method on BinaryNode:

BST

We've used Java's “ternary expression” here to make the code more compact (and to show off another coding idiom!) The expression b ? e1 : e2 is a conditional expression whose value is the result of either e1 or e2, depending on whether the boolean expression b evaluates to true or false.

Since the method is tail-recursive, we can also write it as a loop. Here is a version where the root of the tree is passed in explicitly as a parameter n:

contains-loop

Adding an element to a binary search tree

To add an element so we can find it later, it has to be added along the search path that will be used. To add an element, we first search for it. If found, it need not be added; if not found, it is added as a child of the leaf node reached along the search path. Again, this can be written easily as a tail-recursive method:

add

To see how this algorithm works, consider adding the element 3 to the tree shown with the black edges in the following diagram. We start at the root (2) and go to the right (5) because 3 > 2. In the recursive call we then go to the left (3 < 5) to node 4. Since 3 < 4, we try to go to the left but observe left == null, and therefore create a left child containing 3, shown by the gray edge.

Asymptotic performance of binary search trees

The time required to search for elements in a search tree is in the worst case proportional to the longest path from the root of the tree to a leaf, or the height of the tree, \(h\). Therefore, tree operations take \(O(h)\) time.

With the simple binary search tree implementation we've seen so far, the worst performance is seen when elements are inserted in order, e.g.: add(1), add(2), add(3), ..., add(n). The resulting binary tree has only right children and has the performance characteristics of a linked list!

For this tree, \(h\) is \(O(n)\), and therefore the main tree operations are \(O(n)\). Our goal is logarithmic performance, which requires \(h\) to be \(O(\log n)\). A tree in which \(h\) is \(O(\log n)\) is said to be balanced. Fortunately, there are many ways to obtain balanced trees, and we will see some of them.

Using binary search trees to implement maps

A map abstraction lets us associate values with keys and look up the value corresponding to a given key. This functionality can be implemented easily using a binary tree. We simply store both the key and its associated value in a tree node and use the keys to order the nodes in the tree. Searching is performed as above, but only comparing keys; when the key is found, so is the associated value. Alternatively, we can view the element being stored in the tree as containing a key and a value, where elements are compared only by looking at their keys. This alternative view can be supported by using a Comparator that only compares keys.

Removing elements from a binary search tree

Removing elements from a tree is generally more complicated than adding them, because the elements to be removed are not necessarily leaves. The algorithm starts by first finding the node containing the value \(x\) to be removed and its parent node \(p\) (if any). Three cases must be considered:

  1. Node \(x\) is a leaf.

    In this case, we prune node \(x\) from the tree by setting the pointer to it from \(p\) to null. The other subtree of \(p\) (shown as a white triangle, though it may be empty) is unchanged.

  2. Node \(x\) has one child.

    We splice out node \(x\) from the tree by redirecting the pointer from \(p\) to \(x\) to now point to the single child of \(x\). Since the BST invariant guarantees that \(A < x < p < B\), this operation preserves the invariant.

  3. Node \(x\) has two children.

    In this case it is not as easy to remove node \(x\). Instead, we replace the data in the node with the data from either the immediate next or the immediate previous element in the tree. Without loss of generality, let's take the immediate next element; call it \(y\). Since \(y\) is the immediate next element, it must occur in the right subtree of node \(x\). To find the immediate next element \(y\) from \(x\), we walk one step down the tree to the right, then walk down to the left as far as possible. The last node we encounter is a node \(y\) that cannot possibly have a left child.

    Having found node \(y\), we remove it from the tree, either by pruning it (if it is a leaf) or by splicing it out (if it is not). Then, we overwrite the data in node \(x\) with the data from node \(y\).

    An alternative to overwriting the data in node \(x\) is to replace the entire \(x\) node with the \(y\) node, which requires updating three pointers: one child pointer in node \(p\) and both child pointers in the \(y\) node. This alternative is preferable if the data to be overwritten is large or if some external client has a reference to the original \(x\) node.

Removing an element from the tree requires walking along a path from the root to a leaf in the worst case, plus some constant number of updates to the tree, so the time required is \(O(h)\), as with searching and adding.

Supporting duplicate elements

For some applications, it may be useful to allow duplicate elements or keys. To allow equal keys to be stored in the same tree, we need to relax the BST invariant slightly. Given a node containing value \(x\), we must know whether to go left or right to find the other occurrences of \(x\). To build a tree where we go left, we relax the BST invariant so that the left subtree contains keys less than or equal to \(x\), whereas the right subtree contains keys strictly greater than x. Given a key \(k\), all of the nodes with the same key can be found in time \(O(h)\) by walking down from the root.

N-ary search trees

It is possible to define search trees with more than two children per node. With a higher branching factor, paths through the tree are shorter, but the algorithms and invariants involved become considerably more complex. B-trees are an example of an N-ary search tree structure. In an N-ary search tree, each node contains up to \(N-1\) elements \(e_1,...,e_{N-1}\) and has up to \(N\) children \(c_0,...,c_{N-1}\) arranged so that the subtrees of the children contain only elements between successive elements at the node. If a node has \(n\) children, the node contains \(n-1\) elements obeying the following ordering invariant:

\[ c_0 < e_1 < c_1 < e_2 < c_2 < e_3 < ・・・ < e_{n-1} < c_{n-1} \]

Note that a formula such as \(c_0 < e_1 <c_1\) is intended to mean that all elements in subtree \(c_0\) are less than \(e_1\) and all elements in \(c_1\) are greater than \(e_1\).

To search for a given element, we do a binary search on the elements \(e_i\) (which are stored in ascending order). If the element is not found, the result of the comparisons to these elements, in conjunction with the invariant, indicates which child subtree to search.

Toward balanced trees

As observed above, binary search tree operations take \(O(h)\) time, which could be as bad as \(O(n)\) time. What are our options for ensuring that trees are balanced?

Algorithms exist for efficiently rebalancing trees, relying on adding additional bookkeeping information to the tree data structure. For example, AVL trees and Red–Black trees are two commonly used augmented versions of binary search trees, with accompanying algorithms for rebalancing them. However, they add significant complexity.

A simple way to obtain usually balanced search trees is to insert elements into the tree in a random order. Random insertion turns out to create trees whose expected height is \(O(\log n)\) (Proving this is outside the present scope, but see [1, Chapter 12.4] or [2, Lecture 13] for a proof). Of course, to be able to shuffle the elements randomly requires that the elements are avaiable ahead of time. And if elements are not arriving in a random order already, we need to know how to shuffle a collection of elements into a random order, which involves a small digression to learn a useful algorithm.

How to place a sequence of elements in random order

The Fisher–Yates algorithm (developed in 1938) shuffles \(N\) elements into random order. Recall that there are \(N! = 1×2×3×...×N\) possible permutations of \(N\) elements; a perfectly random shuffle should have equal probability \(1/N!\) of producing any given permutation. The algorithm works as follows. Assume we have the \(N\) elements in an array \(a\). We iterate the array index \(i\) from \(N-1\) down to \(0\), deciding for each array index which element to place into the array at that index. We randomly choose an index \(j\) between 0 and \(i\), and swap the chosen element \(a[j]\) with the element at the current index:

fisher-yates.java

The first iteration generates one of \(N\) possible values, the second iteration one of \(N-1\) possible values, and so on until the final iteration generates one of two possible values. Therefore, the total number of possible ways for the algorithm to execute is \( N × (N-1) × (N-2) × ・・・ × 2 = N!\), which is equal to the total number of possible permutations of the array elements. Furthermore, given a particular permutation, there is exactly one way for the algorithm to produce it, and all distinct executions of the algorithm are equally likely. Therefore, all permutations are produced with equal probability, assuming the random number generator is truly random.

Traversing trees

Given a tree containing some number of elements, it is sometimes useful to traverse the tree, visiting each element and doing something with it, such as printing it out or adding it to another collection.

The most common traversal strategy is in-order traversal, in which each element is visited between the elements of the subtrees. In-order traverse can be expressed easily using recursion:

inorder

(Here the visit method is just some function that does something with the data.) For example, consider using this algorithm on the following binary search tree:

The elements will be visited in the order 1, 3, 5, 10, 11, 17.

The traversal is not tail-recursive and therefore cannot be easily converted into a loop, but this is not a problem unless the tree is very deep. If iterative traversal is required, it can be done if nodes contain parent pointers, or with a stack to remember the nodes in the path from the root down to the node currently being visited.

Note that an in-order traversal of a binary search tree visits the elements in sorted order. This observation gives us an asymptotically efficient sorting algorithm. Given a collection of elements to sort, we first shuffle them into a random order, then add them one at a time into a BST, taking \(O(hn)\) time. Because the elements are in random order, the expected height h is \(O(\log n)\), so adding all the elements takes \(O(n \log n)\) time. Then we can use an in-order traversal, taking \(O(n)\) time, to extract all the elements in their proper order. While no general sorting algorithm is more efficient asymptotically than \(O(n \log n)\), we will later see some other sorting algorithms that are just as asymptotically efficient but have lower constant factors.

Since it is easy to visit elements in sorted order, search trees also support efficient range queries. Given two elements, we can find all elements between them, in sorted order, by doing modifying the in-order traversal to skip parts of the tree that are outside the desired range. If a node's element is above the range, then its right subtree can be skipped entirely; symmetrically, if the element is below the range, its left subtree can be skipped.

Other traversals than in-order can be done. By visiting a node before visiting either of its children, we obtain preorder traversal. By visiting a node after visiting both of its children, we obtain postorder traversal.

pre
post

For example, a preorder traversal of the tree above would visit the nodes in the order 5, 3, 1, 11, 10, 17, and a postorder traversal would visit the nodes in the order 1, 3, 10, 17, 11, 5.

Postorder traversals have the useful property that a node is visited only after all of its descendants; conversely, a preorder traversal visits a node before all of its descendants.

Parent pointers

So far we have only had pointers going downward in the tree. It is sometimes handy to have pointers going from nodes to their parents, much as nodes in a doubly linked list contain pointers to their predecessor nodes. Parent pointers can simplify some algorithms, but they do take up space and require enforcing additional invariants: in particular, that the parent of the root node is null and that for any node \(n\), the parents of its children (if any) are equal to \(n\):

\[ \text{if \(\texttt{n}\) is not the root node,}~ \texttt{n} ∈ \{\texttt{n.parent.left}, \texttt{n.parent.right}\} \] \[\texttt{n.right.parent} = \texttt{n}\] \[\texttt{n.left.parent} = \texttt{n}\]

Using parent pointers, it is possible to implement tree traversals in an iterative way, without relying on a stack. When the last child of a node has been visited, the algorithm follows the parent pointer from the node to its parent and continues with the next child of the parent.

References

[1] Cormen, Leiserson, Rivest, Stein. Introduction to Algorithms.
[2] Kozen. Design and Analysis of Algorithms.