On Habré already talked a lot about
the Bloom filter . Let me remind you that this is a data structure that allows you to check whether an element belongs to a set without keeping the element itself. There is a possibility of a false positive response, but a negative response is always reliable. In the filter with an accuracy of 1%, only a few bits per element are required.
This structure is often used to limit the number of requests to the data warehouse, cutting off requests for items that are not there. In addition, it can be used for an approximate calculation of the number of unique events, users, views, etc.
More examples of interesting applications.However, there are difficulties that can deter web developers from applying the Bloom filter.
Storage
All web application instances on all servers should have access to some common data set with the ability to quickly change or check several bits in a large bit vector. In addition, the data must be securely saved to disk.
')
Hashing
In order to practice getting closer to the theoretical working parameters of the Bloom filter, you need to do a neat job with hashes from the source element. A filter of length
m bits c
k hash functions requires that you form
k independent values from the source element,
uniformly distributed from 0 to
m -1.
Most implementations that I saw use 32-bit murmurhash at best, adding salt for each successive hash, depending on the number of that hash. There is no evidence that this approach does not increase the asymptotic probability of a false positive.
In addition, if a 32-bit hash is used, this sets the upper limit of the filter size to 512 megabytes. This is not small, but not much, especially if you need a very low percentage of false positives and / or an impressive number of elements in the filter.
In my opinion, there are not very many ways to get an arbitrary number of hash functions with values in the desired range, and so that they use only the information entropy of the original message:
- Cut in equal parts the output of one fairly wide function. In this case, the function must have a parametric output width in bits, so that it does not have to discard the useful bits.
- Have only two independent hash functions of variable width h 1 and h 2 and form the derivatives of the function g i by the formula h 1 + i * h 2 (modulo the number of function values). I learned about this method recently from here .
Performance
In order for this construct to be used in deterring requests to the repository, it must respond at least an order of magnitude faster than the protected repository.
When implemented in interpreted languages that web developers most often operate with, it can become difficult, and most importantly, a non-core and unnecessary task.
Proposed Solution
I want to offer you a
solution that allows you
to conveniently integrate Bloom filters into the infrastructure of your web application.
This is a container server that stores the Bloom filter in RAM and on disk. Access to filter operations is provided via HTTP.
Storage persistence is provided by snapshots that do not block the daemon's work. Snapshot is written to a temporary file with a nontrivial name and replaces the old one only after the end of the recording. The data set is saved by a timer, when the server is shut down and when the process receives a USR1 signal.
The algorithm for generating hashes - cutting MD6 hash with a length of
k *
w bits into
k keys for
w bits. The number of MD6 rounds is standard.
The HTTP interface was chosen for ease of integration on the developer side: most development platforms have advanced tools for making synchronous, asynchronous, and parallel (curl-multi) HTTP requests. Despite the simplicity of the interface, it took over all the advantages of the HTTP standard, and allows pipelined requests within a single connection.
The application I have proposed has been extensively tested on production systems and can be considered as stable.
The server is written in C using libevent2. On an Intel® Pentium® CPU G4400 @ 3.30GHz processor, the application shows ~ 35000 RPS in Apache Benchmark to check and add 300-byte elements.
Supported and tested platforms:
- GNU / Linux (various versions and distributions)
- FreeBSD 8, 9, 10
- Mac OS X 10.11
- Solaris 11
Examples
Starting the daemon with default parameters (1 GB, 10 keys - for 500,000,000 items with a false positive probability of 0.1%):
$ bloom habrahabr.snap
Creating space ...
Initializing new file storage ...
Saving snapshot ...
Initial snapshot written.
Check item:
$ curl http: // localhost: 8889 / check? e = Hello% 20Habr% 21
MISSING
Add an item:
$ curl http: // localhost: 8889 / add? e = Hello% 20Habr% 21
ADDED
Check again:
$ curl http: // localhost: 8889 / check? e = Hello% 20Habr% 21
PRESENT
We check the availability and add one request:
$ curl http: // localhost: 8889 / checkthenadd? e = unexistent
MISSING
$ curl http: // localhost: 8889 / checkthenadd? e = unexistent
PRESENT
Alternative solutions
There are libraries that run on the application side and contain common data in Redis. They manipulate the bits in the Redis bitmaps with the
SETBIT
and
GETBIT
.
In comparison, their advantages:
Hash counting occurs on the web application side, and it does not load the server with computational work. Once at a time it’s not necessary, in some cases they consider a hash stored procedure on Lua.- They use Redis, which is well known to many, has a rich set of functions, the possibility of scripting and clustering.
Minuses:
- Redis limits the length of the bit field to 512 megabytes. All implementations that I could find ( 1 , 2 ) are designed to use only one Redis key. This limits the size of the filter.
- At the moment, these libraries are not for all languages and development platforms.
- These solutions are not suitable for multilanguage projects: each implementation hashes in its own way and such filters are incompatible with each other.
- Performance. Such solutions use two strategies for working with radish: either they set bits for each key through a pipeline, or they call a stored procedure on Lua in a radish, which counts the hash and gives commands. In the first case, even at a Redis speed of 200,000 operations per second, such solutions are already starting to lose on filters with six or more keys, since each bit is a separate operation. In the second case, everything is the same, but in addition, time will be spent on running the script and when checking and adding the hash will be counted twice. However, the situation may improve with the introduction of the
BITFIELD
team in Redis 3.2.0, which was released in May 2016. This command allows you to perform several operations on bits in one command.