de Bruijn Sequences
I recently came across a neat bit trick, Using de Bruijn Sequences to Index a 1 in a Computer Word. As mentioned in the paper, the three basic ingredients are
- Isolating a single 1 bit in a word. If the word is \(n\) bits, then there are \(n\) possibilities here.
- Use a hash function to map from the \(n\) possible words to their corresponding indices.
- Such a hash function can be designed using a de Bruijn sequence.
I won’t go into any more details, because the paper is extremely readable. However, it did get me thinking a bit about de Bruijn sequences.
de Bruijn Sequences
Nicolaas de Bruijn was a Dutch mathematician best known for his work in combinatorics and logic. (About the name, “Bruijn” means “brown” and you can hear it pronounced here).
The sequences that bear his name are cyclic sequences with the property that, for some integer \(k\) and alphabet \(\mathcal{A}\), every length \(k\) string of \(\mathcal{A}\) appears as a substring exactly once. The number \(k\) is sometimes called the window size.
To keep things simple, we restrict attention to binary de Bruijn sequences, i.e. those with alphabet \(\mathcal{A} = \{0,1\}\). An example with \(k=3\) is 00011101
. Any length 3 binary string appears exactly once as a substring, e.g. 011
appears starting at index 2. Because we consider 00011101
as a cyclic sequence, some of these windows wrap around, e.g. 010
appears at index 6 and ends at index 0. Because each substring appears only once, if we slide a window of length 3 through our string we enumerate all bitstrings of length 3: 000
, 001
, 011
, 111
, 110
, 101
, 010
, 100
.
The Number of de Bruijn Sequences
de Bruijn sequences always exist, but need not be unique. Naturally, a cyclic shift preserves the properties of the sequence, so that 00011101
shouldn’t be considered different from 00111010
. Even so, we can still find distinct sequences, e.g. 10111000
is another \(k=3\) sequence which is not a cyclic shift of 00011101
. One of the things that de Bruijn did was count the number of distinct sequences, which turns out to be
\[ 2^{2^{k-1}-k}. \]
Constructing de Bruijn Sequences
There is an easy construction of certain de Bruijn sequences, valid for any \(k\), presented below.
def debruijn(k: int) -> list[int]:
""" Construct a binary de Bruijn sequence with window size k. """
n = 2**k
seq = [0]*k
cur = 0
seen = {}
while True:
seen[cur] = True
next = (cur << 1) % n
if next+1 not in seen:
seq.append(1)
cur = next+1
elif next not in seen:
seq.append(0)
cur = next
else:
break
return seq[:n]
The basic idea is that at each step, we keep track of a \(k\)-bit number cur
, which indicates the current window. We consider the two possible values for the next window (shift and add 1, or shift and add 0), emitting either a 1
or 0
as we transition. A table seen
is maintained to avoid revisiting windows that we have already visited. We stop when this gets stuck, and drop the extraneous few bits at the end.
Running debruijn(4)
gives [0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1]
.
A graphical explanation
What’s going on in the above algorithm? It’s useful to look at the following graph (courtesy of Wikipedia)
Each vertex corresponds to a window of length \(3\), and an edge goes from \(x\) to \(y\) if there is a binary string of length \(4\) with prefix \(x\) and suffix \(y\) (e.g. 000
and 001
are joined by an edge labeled 1
, indicating the sequence 0001
).
With respect to our above algorithm, cur
is a vertex in this graph, and next
represents candidate neighbor vertices. When we update cur
= ...
we are traversing an edge. We emit a bit in our sequence when we do so, and we avoid visiting the same vertex twice (until we can’t, at which point we halt).
In short, we are computing an Eulerian cycle in a de Bruijn graph. Because each vertex of the de Bruijn graph has two entry edges and two exit edges, such a cycle is guaranteed to exist. Our start and end vertex is 000
, and our policy of first trying the edge labeled 1
before 0
ensures that we “keep away” from 000
to avoid shorter cycles that don’t visit every edge. Without this policy, one could imagine something like the traversal 000 -> 001 -> 000
which is a cycle but is not an Eulerian cycle.