# de Bruijn Sequences

I recently came across a neat bit trick, Using de Bruijn Sequences to Index a 1 in a Computer Word. As mentioned in the paper, the three basic ingredients are

- Isolating a single 1 bit in a word. If the word is \(n\) bits, then there are \(n\) possibilities here.
- Use a hash function to map from the \(n\) possible words to their corresponding indices.
- Such a hash function can be designed using a de Bruijn sequence.

I won’t go into any more details, because the paper is extremely readable. However, it did get me thinking a bit about de Bruijn sequences.

## de Bruijn Sequences

Nicolaas de Bruijn was a Dutch mathematician best known for his work in combinatorics and logic. (About the name, “Bruijn” means “brown” and you can hear it pronounced here).

The sequences that bear his name are cyclic sequences with the property that, for some integer \(k\) and alphabet \(\mathcal{A}\), every length \(k\) string of \(\mathcal{A}\) appears as a substring exactly once. The number \(k\) is sometimes called the *window size*.

To keep things simple, we restrict attention to binary de Bruijn sequences, i.e. those with alphabet \(\mathcal{A} = \{0,1\}\). An example with \(k=3\) is `00011101`

. Any length 3 binary string appears exactly once as a substring, e.g. `011`

appears starting at index 2. Because we consider `00011101`

as a cyclic sequence, some of these windows wrap around, e.g. `010`

appears at index 6 and ends at index 0. Because each substring appears only once, if we slide a window of length 3 through our string we enumerate all bitstrings of length 3: `000`

, `001`

, `011`

, `111`

, `110`

, `101`

, `010`

, `100`

.

## The Number of de Bruijn Sequences

de Bruijn sequences always exist, but need not be unique. Naturally, a cyclic shift preserves the properties of the sequence, so that `00011101`

shouldn’t be considered different from `00111010`

. Even so, we can still find distinct sequences, e.g. `10111000`

is another \(k=3\) sequence which is not a cyclic shift of `00011101`

. One of the things that de Bruijn did was count the number of *distinct* sequences, which turns out to be

\[ 2^{2^{k-1}-k}. \]

## Constructing de Bruijn Sequences

There is an easy construction of certain de Bruijn sequences, valid for any \(k\), presented below.

```
def debruijn(k: int) -> list[int]:
""" Construct a binary de Bruijn sequence with window size k. """
n = 2**k
seq = [0]*k
cur = 0
seen = {}
while True:
seen[cur] = True
next = (cur << 1) % n
if next+1 not in seen:
seq.append(1)
cur = next+1
elif next not in seen:
seq.append(0)
cur = next
else:
break
return seq[:n]
```

The basic idea is that at each step, we keep track of a \(k\)-bit number `cur`

, which indicates the current window. We consider the two possible values for the next window (shift and add 1, or shift and add 0), emitting either a `1`

or `0`

as we transition. A table `seen`

is maintained to avoid revisiting windows that we have already visited. We stop when this gets stuck, and drop the extraneous few bits at the end.

Running `debruijn(4)`

gives `[0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1]`

.

## A graphical explanation

What’s going on in the above algorithm? It’s useful to look at the following graph (courtesy of Wikipedia)

Each vertex corresponds to a window of length \(3\), and an edge goes from \(x\) to \(y\) if there is a binary string of length \(4\) with prefix \(x\) and suffix \(y\) (e.g. `000`

and `001`

are joined by an edge labeled `1`

, indicating the sequence `0001`

).

With respect to our above algorithm, `cur`

is a vertex in this graph, and `next`

represents candidate neighbor vertices. When we update ```
cur
= ...
```

we are traversing an edge. We emit a bit in our sequence when we do so, and we avoid visiting the same vertex twice (until we can’t, at which point we halt).

In short, we are computing an Eulerian cycle in a de Bruijn graph. Because each vertex of the de Bruijn graph has two entry edges and two exit edges, such a cycle is guaranteed to exist. Our start and end vertex is `000`

, and our policy of first trying the edge labeled `1`

before `0`

ensures that we “keep away” from `000`

to avoid shorter cycles that don’t visit every edge. Without this policy, one could imagine something like the traversal `000 -> 001 -> 000`

which is a cycle but is not an Eulerian cycle.