Sorting and Decision Trees
Sorting a list of numbers is one of those classic problems of computer science, which is both practically relevant as well as a useful showcase for various algorithm design techniques. Perhaps the most natural approach is insertion sort, which builds a sorted list from the unsorted input one entry at a time, for a worst case time complexity of O(n2), where \(n\) is the length of the input. Slightly more sophisticated is merge sort, which is a divide and conquer method easily verified to succeed in O(n log(n)) time. And let’s not forget the classic QuickSort, which is a randomized algorithm that works in worse case O(n2) time (but the average case is O(n log n)).
The curious might wonder: can we do better? Presumably any sorting algorithm must look at the whole list of input, and so requires at least n steps. To reframe the question then, it possible then for our worst case complexity to be somewhere between O(n), which we know to be required, and O(n log n), which we know to be possible?
In general, no. At least not for comparison sorting algorithms, which proceed according to yes/no answers to questions like “is a[i] < a[j]”, where `a` is the input list.
When thinking about such matters, it’s important to pick an appropriate level of abstraction. The usual state of affairs, in discussing the complexity of algorithms, is something like this: one takes for granted that “simple” arithmetic operations, memory accesses, etc take some fixed constant time, and the goal is to count these. It would be too impractical to account exactly for every such operation, so one settles for the dominant behavior, up to a constant factor (e.g. what notation like \(O(n \log n)\) provides).
For our purposes, this is still too fine-grained. For starters, we would like to make a claim about all sorting algorithms, or at least a sufficiently rich subset of them, and so it seems hopeless to resort to counting the number of memory accesses without even knowing what exactly the algorithm is.
So what can we assume? Well, most (but not all!) sorting algorithms end up executing code that at many points looks like this:
if a[i] < a[j]:
<foo>
else:
<bar>
where a
is the input list (an array), i
and j
are some variables, and <foo>
and <bar>
are blocks of code – perhaps containing recursive function calls – which themselves invoke similar if statements.
With this in mind, we make the following simplification: we’re just going to count the number of comparisons, i.e. points where we check something like a[i]
< a[j]
and then branch on the outcomes. This is pretty convenient, since
- as far as lower bounds go, it’s fair game to ignore the other work involved by an algorithm and just count these comparisons
- the runtime of algorithms of interest tends to be proportional to the number of comparisons (in particular, we’re not throwing away too much information with this simplification).
Although seemingly a forced move by our desire to make general statements about sorting algorithms, the focus on comparisons admits another benefit: we can reason about algorithms by considering their corresponding decision trees.
When our algorithm is executed on an input list, a sequence of if statements are navigated. We can represent this as a tree, where each internal node represents a comparison, and branching is based on the results of these comparisons.
On any specific input array, the algorithm has an effect of traversing the tree from the root to a leaf, perhaps performing some other work along the way. Of course, a different input could take us to a different leaf, but the tree captures the full set of possibilities. By the time the algorithm has advanced to a leaf, it has ultimately performed enough swaps in the array to sort it. Since we’re only interested in counting the number of comparisons involved, there’s no loss of generality in just supposing that each leaf node represents some sort of permutation of the original array, without any particular concern for how it is implemented.
The number of comparisons made on a given input is the length of the path from root to leaf. The height of the tree is thus the worst-case query complexity is the height of the tree.
What does this buy us? Consider that a general purpose sorting algorithm should be able to handle any input, and so could in principle result in the application of any permutation. This means that for input of length \(n\), the corresponding decision tree needs at least \(n!\) leaves – if there were any less, then there would be a permutation \(\pi\) that our algorithm couldn’t execute, and so one could imagine an input array that we cannot sort (e.g. take the input array to be \(\pi^{-1} [1,2,\ldots,n]\)).
But these decision trees are binary trees. A binary tree of height \(h\) has at most \(2^h\) leaves (perhaps less, e.g. if it is unbalanced). From our previous considerations. this implies \[ 2^h \geq n!. \]
Aha! With a logarithm we can work out a lower bound for \(h\): \[ h \geq \log_2 n!. \]
The last step is to get this into a slightly more convenient form. By Stirling’s formula, we have \(n! \sim \sqrt{2 \pi n} (n/e)^n\), and from this one can work out \(\log_2 n! \geq c n \log(n)\) where \(c\) is a convenient constant.
So we have seen that the height of the decision tree, and (equivalently) the number of comparisons made in the worst case, is \(\Omega(n \log n)\).
There is a constant \(c > 0\) such that, for any comparison sorting algorithm \(\mathcal{A}\) and length \(n\), there is an input array \(x\) of length \(n\) such that \(\mathcal{A}(x)\) involves at least \(c n \log n\) comparisons.
It’s worth pointing out that this is only a bound on comparison sorts, and can be beaten in special cases where there is some additional structure to the problem (e.g. knowledge that all items being sorted are integers in some fixed range).