How to write a Bloom filter in C++

April 11, 2016 - April 11, 2016

Bloom filters are data structures which can efficiently determine whether an element is possibly a member of a set or definitely not a member of a set. This article will cover a simple implementation of a C++ bloom filter.

It’s not going to cover what bloom filters are or much of the math behind them, as there are other great resources covering those topics.

The interface

Let’s first define the interface of our Bloom filter. We’ll define three public functions:

A constructor
A function to add an item to the Bloom filter
A function to query whether an item is possibly in the bloom filter

We’ll also declare a few member variables, including a bit vector to hold the state of the filter.

#include <vector>

struct BloomFilter {
  BloomFilter(uint64_t size, uint8_t numHashes);

  void add(const uint8_t *data, std::size_t len);
  bool possiblyContains(const uint8_t *data, std::size_t len) const;

private:
  uint8_t m_numHashes;
  std::vector<bool> m_bits;
};

One thing to note is that std::vector<bool> is a space efficient specialization of std::vector, only requiring one bit per element (as opposed to one byte in typical implementations.)

Update: nate_martin from HN suggested templating this struct as an extension. Instead of hardcoding the key type and hash function, we can template the class with something along the lines of

template< class Key, class Hash = std::hash<Key> >
struct BloomFilter {
  ...
};

This will allow the Bloom filter to be generalized to more complex datatypes. Thanks Nate!

Bloom filter parameters

There are two parameters used to construct the Bloom filter:

The filter size, in bits
The number of hash functions to use

We can calculate the false positive error rate, $p$, based on the size of the filter, $m$, the number of hash functions, $ k $, and the number of elements inserted, $n$, with the formula¹: $$ p \approx (1 - \mathrm{e}^{\frac{-kn}{m}})^k $$

This formula isn’t very useful in its current form. We’d like to be able to know how large the filter should be and how many hash functions to use, given an estimated set cardinality and error tolerance. There are two equations we can use to calculate these parameters: $$ m = -\frac{n \ln{p}}{(\ln{2})^2}$$ $$ k = \frac{m}{n}\ln{2}$$

The implementation

There is just one more bit of theory necessary. You may be wondering how we can implement $k$ hash functions; double hashing may be used to generate $k$ hash values without affecting false positive probability! This is summarized using the formula, where $i$ is an ordinal, $m$ is the size of the Bloom filter, and $x$ is the value to be hashed: $$ \text{hash}_i(x,m) = (\text{hash}_a(x) + i \times \text{hash}_b(x)) \mod{m}$$

You can read more about the proof in this paper.

First, let’s write the constructor. It just records the parameters and resizes the bit array.

#include "BloomFilter.h"
#include "MurmurHash3.h"

BloomFilter::BloomFilter(uint64_t size, uint8_t numHashes)
      : m_bits(size),
        m_numHashes(numHashes) {}

Next, let’s write a function to calculate the 128 bit hash of a given item. We’ll be using MurmurHash3, a 128 bit hash function which has good tradeoffs between performance, distribution, avalanche behavior, and collision resistance. As this function generates a 128 bit hash, and we need 2 64 bit hashes, we can split the returned hash in half to get $\text{hash}_a(x)$ and $\text{hash}_b(x)$!

std::array<uint64_t, 2> hash(const uint8_t *data,
                             std::size_t len) {
  std::array<uint64_t, 2> hashValue;
  MurmurHash3_x64_128(data, len, 0, hashValue.data());

  return hashValue;
}

Now that we have the hash values, we need to write a function to return the output of the $n^{\text{th}}$ hash function.

inline uint64_t nthHash(uint8_t n,
                        uint64_t hashA,
                        uint64_t hashB,
                        uint64_t filterSize) {
    return (hashA + n * hashB) % filterSize;
}

All that’s left to do is write the functions to set/check bits for given items!

void BloomFilter::add(const uint8_t *data, std::size_t len) {
  auto hashValues = hash(data, len);

  for (int n = 0; n < m_numHashes; n++) {
      m_bits[nthHash(n, hashValues[0], hashValues[1], m_bits.size())] = true;
  }
}

bool BloomFilter::possiblyContains(const uint8_t *data, std::size_t len) const {
  auto hashValues = hash(data, len);

  for (int n = 0; n < m_numHashes; n++) {
      if (!m_bits[nthHash(n, hashValues[0], hashValues[1], m_bits.size())]) {
          return false;
      }
  }

  return true;
}

Results

With a 4.3MB Bloom filter and 13 hash functions, inserting 1.8M items took roughly 189 nanoseconds per element on my laptop. The false positive rate also matched the theoretical value.

The Wikipedia article on this topic is great. ↩︎