📜 ⬆️ ⬇️

Bloom filter for web developers

On Habré already talked a lot about the Bloom filter . Let me remind you that this is a data structure that allows you to check whether an element belongs to a set without keeping the element itself. There is a possibility of a false positive response, but a negative response is always reliable. In the filter with an accuracy of 1%, only a few bits per element are required.

This structure is often used to limit the number of requests to the data warehouse, cutting off requests for items that are not there. In addition, it can be used for an approximate calculation of the number of unique events, users, views, etc. More examples of interesting applications.

However, there are difficulties that can deter web developers from applying the Bloom filter.

Storage

All web application instances on all servers should have access to some common data set with the ability to quickly change or check several bits in a large bit vector. In addition, the data must be securely saved to disk.
')
Hashing

In order to practice getting closer to the theoretical working parameters of the Bloom filter, you need to do a neat job with hashes from the source element. A filter of length m bits c k hash functions requires that you form k independent values ​​from the source element, uniformly distributed from 0 to m -1.

Most implementations that I saw use 32-bit murmurhash at best, adding salt for each successive hash, depending on the number of that hash. There is no evidence that this approach does not increase the asymptotic probability of a false positive.

In addition, if a 32-bit hash is used, this sets the upper limit of the filter size to 512 megabytes. This is not small, but not much, especially if you need a very low percentage of false positives and / or an impressive number of elements in the filter.

In my opinion, there are not very many ways to get an arbitrary number of hash functions with values ​​in the desired range, and so that they use only the information entropy of the original message:
  1. Cut in equal parts the output of one fairly wide function. In this case, the function must have a parametric output width in bits, so that it does not have to discard the useful bits.
  2. Have only two independent hash functions of variable width h 1 and h 2 and form the derivatives of the function g i by the formula h 1 + i * h 2 (modulo the number of function values). I learned about this method recently from here .


Performance

In order for this construct to be used in deterring requests to the repository, it must respond at least an order of magnitude faster than the protected repository.

When implemented in interpreted languages ​​that web developers most often operate with, it can become difficult, and most importantly, a non-core and unnecessary task.

Proposed Solution


I want to offer you a solution that allows you to conveniently integrate Bloom filters into the infrastructure of your web application.

This is a container server that stores the Bloom filter in RAM and on disk. Access to filter operations is provided via HTTP.

Storage persistence is provided by snapshots that do not block the daemon's work. Snapshot is written to a temporary file with a nontrivial name and replaces the old one only after the end of the recording. The data set is saved by a timer, when the server is shut down and when the process receives a USR1 signal.

The algorithm for generating hashes - cutting MD6 hash with a length of k * w bits into k keys for w bits. The number of MD6 rounds is standard.

The HTTP interface was chosen for ease of integration on the developer side: most development platforms have advanced tools for making synchronous, asynchronous, and parallel (curl-multi) HTTP requests. Despite the simplicity of the interface, it took over all the advantages of the HTTP standard, and allows pipelined requests within a single connection.

The application I have proposed has been extensively tested on production systems and can be considered as stable.

The server is written in C using libevent2. On an Intel® Pentium® CPU G4400 @ 3.30GHz processor, the application shows ~ 35000 RPS in Apache Benchmark to check and add 300-byte elements.

Supported and tested platforms:


Examples

Starting the daemon with default parameters (1 GB, 10 keys - for 500,000,000 items with a false positive probability of 0.1%):
 $ bloom habrahabr.snap
 Creating space ...
 Initializing new file storage ...
 Saving snapshot ...
 Initial snapshot written.


Check item:
 $ curl http: // localhost: 8889 / check? e = Hello% 20Habr% 21
 MISSING


Add an item:
 $ curl http: // localhost: 8889 / add? e = Hello% 20Habr% 21
 ADDED


Check again:
 $ curl http: // localhost: 8889 / check? e = Hello% 20Habr% 21
 PRESENT


We check the availability and add one request:
 $ curl http: // localhost: 8889 / checkthenadd? e = unexistent
 MISSING
 $ curl http: // localhost: 8889 / checkthenadd? e = unexistent
 PRESENT


Alternative solutions



There are libraries that run on the application side and contain common data in Redis. They manipulate the bits in the Redis bitmaps with the SETBIT and GETBIT .

In comparison, their advantages:

Minuses:

Source: https://habr.com/ru/post/304800/


All Articles