Random Binary Search Trees and Treeps


RANDOM BINARY SEARCH TREES

All basic operations on a binary search tree run in O(h) time, where h is the height (or depth) of the tree.

A randomly built binary search tree on n distinct keys is a binary search tree that arises from inserting the keys in random order into an initially empty tree, where each of the n! permutations of the input keys is equally likely. Therefore, the construction of the actual binary search tree is done by the standard "insertion" method. The only difference is that the input is entered at random. This works only if all keys are initially available and no additional insertions/deletions take place on the tree.

What is your intuition about the expected maximum depth (~height of root) of a randomly constructed binary search tree?

The depth of a particular node is defined to be the length of the path from the root to the node in question. Therefore, it is obvious from this definition, that there exists a relation between the depth of a node and time. Also, the height of a particular node is the maximal distance from that node to any of the leaves in its subtree. The height of a tree is the height of the root of the tree, and it is equal to the maximum depth of any node in the tree. This proves that there is a connection between height and depth. Now, if Dn is the depth of the nth inserted node, the following shows that the expected value of Dn is not more than 2+2log(n) where log is the natural logarithm.

Let us start by constructing a binary search tree with the following, randomly chosen, list of integers:

3 2 14 15 12 5 6 11 9 10 8.

Using this input sequence, we want to determine which keys (integers) are on the path from the root to the last node. In this example, we arbitrarily chose the number 8 to be the last node. We will denote the last node (8) as xn.

First, observe that every node to the left of xn must be less than x n and every node to the right of xn must be greater than xn . This property must hold because it is the definition of a binary search tree. Therefore, let's divide the random list into two sublists:

Although this may seem obvious, it is important to note that a subset of a random permutation is, in itself, a random permutation. Table 1 (below) shows the results of the divisions.

RANDOM LIST 3 2 14 15 12 5 6 11 9 10 8
LEFT OF Xn 3 2 - - - 5 6 - - - -
RIGHT OF n - - 14 15 12 - - 11 9 10 -
Table 1. Separation of keys into two sublists about the key xn.

Now, let's go through the list one node at a time.

This process is continued until we get to the end of the list. Therefore, the above sequence of random numbers corresponds to the following binary search tree:

Nodes on the path that are to the left of xn make up the up-records to xn and those that are on the path to the right of xn make up the down-records to xn.

In this case, it is easy to find the depth of xn by just looking at the sublists in Table 1 (above). Analyze the left sublist first. Go through this list and circle all keys that are on xn's path. These keys are 3, 5, and 6. Do the same with the right sublist. These keys are 14, 12, 11, and 9. We now have the following:

If we count the number of circled keys, this will tell us the depth of our node (the node with the number 8). In this case, we have 7 circled keys. Therefore, the depth of our node, D 8, is 7.

Mathematically, the expected number of up-records in a random permutation is equal to (count probabilities for each one of the numbers):

Prob ( X1 is a record ) + Prob ( X2 is a record ) +. . . + Prob ( Xn is a record )

= 1 + 1/2 + 1/3 + 1/4 + . . . . . + 1/n

* Notice that this series is a harmonic series ! Why?

Suppose we have a list of rainfall figures for a hundred years. How many record-breaking falls of rain do you expect have taken place over that period? We assume that the rainfall figures are random, in the sense that the amount of rain in any one year has no influence on the rainfall in any subsequent year.


The first year was undoubtedly a record year. In the second year, the rain could equally likely have been more than, or less than, the rainfall of the first year. So there is a probability of 1/2 that the second year was a record year. The expected number of record years in the first two years of record-keeping is therefore 1 + 1/2. Go on to the third year. The probability is 1/3 that the third observation is higher than the first two, so the expected number of record rainfalls in three years is 1 + 1/2 + 1/3. Continuing this line of reasoning leads to the conclusion that the expected number of records in the list of observations is

= 1 + 1/2 + 1/3 + 1/4 + . . . . . + 1/n

< 1 + log(n) ( This is the natural Log)

Therefore, the number of up-records in the set "LEFT OF Xn" is not more than 1+ log (n). Similarily, the number of down-records in the set "RIGHT OF Xn" is not more than 1+ log (n) also.

This implies that the expected value of Dn is less than 2 ( 1 + log(n) ) .

Thus, Expected Value of Dn < ( 2 + 2log(n) )

*The 1+log(n) bound (above) was obtained by a method of summation called "comparison with an integral".


Assume all keys are not initially available and insertions/deletions may occur throughout the lifetime of the BST. Is it possible to still use a randomized BST and take advantage of its height balancing properties?

Treeps

A node to be inserted in a Treep is assigned a random priority. Nodes are inserted into the treep using both the bst property on keys and the priority of the node as in a min heap - thus the name.

So think of a treep in the following way. If nodes, { x1, x2, .., xn}, are inserted, then the resulting treep is a normal binary search tree with the node order given by the randomly chosen priorities. Just as with AVL and Red-Black trees, additional information is maintained for height balancing. See example trace from lecture

Modifications 2/23/06 D. Byrnes

Copyright © 1997, Gaetano Zullo, Gabriel Kevorkian, Chris Foote. All rights reserved. Reproduction of all or part of this work is permitted for educational or research use provided that this copyright notice is included in in any copy.