📜 ⬆️ ⬇️

PHP bloom filter

What is it?


Wikipedia says:
This is a probabilistic data structure invented by Burton Bloom in 1970, which allows storing multiple elements compactly and verifying that a given element belongs to a set. In this case, it is possible to get a false-positive response (there is no element in the set, but the data structure reports that it exists), but not a false-negative one.



And simpler


This is a way to check the existence of an element in a huge sample.


How it works?


Pretty simple. We have a bit object of a certain length. We also define several unique hash functions. Each sample value passes through each hash function, which returns a position in the bit string and sets it to 1.
Later, when we need to find out if there is an element in the sample, we simply skip it through all the hash functions. If at all positions of the bit string was 1, then the object is “possibly present”, if at least in one place was 0, then there is no object.
On Wikipedia, there is a model to determine the length of the bit string and the number of hash functions for a particular sample size and the percentage of false positive response.
')

Where can this be applied?


I apply a filter to check for the existence of a word or phrase in a dictionary of> 10,000,000 words. Google uses it in its search engine.
In general, the positive moment of the Bloom filter in its speed, when the ratio of Insert / Check operations is more than 0.001 and checks is more than 10,000. But this is only compared with the standard in_array in PHP.
The advantages are obvious: speed, less load on the disk, less memory consumption.
Minuses: long loading of values, no possibility of deletion (there is a solution)

Why another bike? I am sure, we have already done the Bloom filter in php.


Yes, they did. I spent benchmarks and my implementation was ahead of everything in speed and memory consumption that I was able to start. On the uniqueness of hash algorithms, I generally keep silent.


What is the salt?


There are actually two problems.
1. How to store a bit string
2. We need a unique hash.

1. The problem here is rather how to store the bit string for the delete option. In this case, the counter is used and the increment is set instead of 1 when the object is added. I decided to use the alphabet. But deleting a thing is not the best, elements are removed to which the filter responds "possibly present" and if it was a false positive response, we will delete a non-existent object.
2. I am not an expert on hashes and selected a hash function for 2 parameters.
The speed of work is probably the most important indicator. Uniqueness , when creating 1000 unique hashes, on one and the same string, they must produce 1000 different values. I, unfortunately, achieved only 64% of uniqueness. The competition did not rise above 40%.
And this is just:
abs( crc32( md5($this->seed[0] . $string) ) ) % $size; 

Uniqueness is created using the random line $ this-> seed [0].

Where to try?


You are welcome on Github .
And try:

 include 'bloom.class.php'; $parameters = array( 'entries_max' => 2 //     2 ,     0.1% ); $bloom = new Bloom($parameters); // ,     $bloom->set('Some string'); //   echo $bloom->has('Some string'); //true // ,   Bloom     counter $bloom->delete('Some string'); 

Available options:

 /* // entries_max (int)  .  : 100. error_chance (float) (0;1)  .  : 0.001. counter (boolean)     ,     .  : false. //     set_size (int)   .  : . hash_count (int)   .  : . //  ,   ['hash'] strtolower (boolean)       .  : true; */ 


Attached are examples, unit tests, and benchmarks (comparing with in_array in two ways and with three other libraries). Many of you are able to write it yourself, but maybe someone will not want to ride a bike.
Bonus: benchmark for a 50,000 sample with 25,000 checks against in_array without a counter and with a counter.

Source: https://habr.com/ru/post/170705/


All Articles