de Bruijn Sequences

Posted on September 5, 2022 by Erik Davis
Tags: python, combinatorics

I recently came across a neat bit trick, Using de Bruijn Sequences to Index a 1 in a Computer Word. As mentioned in the paper, the three basic ingredients are

  1. Isolating a single 1 bit in a word. If the word is \(n\) bits, then there are \(n\) possibilities here.
  2. Use a hash function to map from the \(n\) possible words to their corresponding indices.
  3. Such a hash function can be designed using a de Bruijn sequence.

I won’t go into any more details, because the paper is extremely readable. However, it did get me thinking a bit about de Bruijn sequences.

de Bruijn Sequences

Nicolaas de Bruijn was a Dutch mathematician best known for his work in combinatorics and logic. (About the name, “Bruijn” means “brown” and you can hear it pronounced here).

The sequences that bear his name are cyclic sequences with the property that, for some integer \(k\) and alphabet \(\mathcal{A}\), every length \(k\) string of \(\mathcal{A}\) appears as a substring exactly once. The number \(k\) is sometimes called the window size.

To keep things simple, we restrict attention to binary de Bruijn sequences, i.e. those with alphabet \(\mathcal{A} = \{0,1\}\). An example with \(k=3\) is 00011101. Any length 3 binary string appears exactly once as a substring, e.g. 011 appears starting at index 2. Because we consider 00011101 as a cyclic sequence, some of these windows wrap around, e.g. 010 appears at index 6 and ends at index 0. Because each substring appears only once, if we slide a window of length 3 through our string we enumerate all bitstrings of length 3: 000, 001, 011, 111, 110, 101, 010, 100.

The Number of de Bruijn Sequences

de Bruijn sequences always exist, but need not be unique. Naturally, a cyclic shift preserves the properties of the sequence, so that 00011101 shouldn’t be considered different from 00111010. Even so, we can still find distinct sequences, e.g. 10111000 is another \(k=3\) sequence which is not a cyclic shift of 00011101. One of the things that de Bruijn did was count the number of distinct sequences, which turns out to be

\[ 2^{2^{k-1}-k}. \]

Constructing de Bruijn Sequences

There is an easy construction of certain de Bruijn sequences, valid for any \(k\), presented below.

def debruijn(k: int) -> list[int]:
    """ Construct a binary de Bruijn sequence with window size k. """
    n = 2**k
    seq = [0]*k
    cur = 0
    seen = {}
    while True:
        seen[cur] = True
        next = (cur << 1) % n
        if next+1 not in seen:
            cur = next+1
        elif next not in seen:
            cur = next
    return seq[:n]

The basic idea is that at each step, we keep track of a \(k\)-bit number cur, which indicates the current window. We consider the two possible values for the next window (shift and add 1, or shift and add 0), emitting either a 1 or 0 as we transition. A table seen is maintained to avoid revisiting windows that we have already visited. We stop when this gets stuck, and drop the extraneous few bits at the end.

Running debruijn(4) gives [0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1].

A graphical explanation

What’s going on in the above algorithm? It’s useful to look at the following graph (courtesy of Wikipedia)

Each vertex corresponds to a window of length \(3\), and an edge goes from \(x\) to \(y\) if there is a binary string of length \(4\) with prefix \(x\) and suffix \(y\) (e.g. 000 and 001 are joined by an edge labeled 1, indicating the sequence 0001).

With respect to our above algorithm, cur is a vertex in this graph, and next represents candidate neighbor vertices. When we update cur = ... we are traversing an edge. We emit a bit in our sequence when we do so, and we avoid visiting the same vertex twice (until we can’t, at which point we halt).

In short, we are computing an Eulerian cycle in a de Bruijn graph. Because each vertex of the de Bruijn graph has two entry edges and two exit edges, such a cycle is guaranteed to exist. Our start and end vertex is 000, and our policy of first trying the edge labeled 1 before 0 ensures that we “keep away” from 000 to avoid shorter cycles that don’t visit every edge. Without this policy, one could imagine something like the traversal 000 -> 001 -> 000 which is a cycle but is not an Eulerian cycle.