Good day to all.
We launched a new course -
“Algorithms for Developers” , designed for those to tighten their knowledge of various structures and algorithms for data processing, solving algebraic problems and dynamic programming problems for various languages. So today we are sharing a short note about the work of the Bloom filter in Java.
Introduction')
In this article we will look at the structure of the Bloom filter from the Guava library. The Bloom filter is a probabilistic data structure with efficient memory usage that we can use to answer the question “Is this element contained in a set?”.
There
are no false negatives in Bloom’s filter, so if he returns false, you can be 100% sure that this element is not in the set.
However, the Bloom filter
can return false positives , so when returning true, there is a high probability that the element really exists in the set, but the probability is not 100%.
To learn more about how the Bloom filter works, check out
this guide.
Maven addictionWe will use the Guava-implementation of the Bloom filter, so add a Guava-dependency.
The latest version can be found in
Maven Central .
<dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>22.0</version> </dependency>
Why use a bloom filter?Bloom filter is designed to be
economical and fast . When using it, we can specify the
probability of false positive responses, which we are ready to accept , and, in accordance with the configuration, the Bloom filter will take as little memory as possible.
Due to its economy, the Bloom filter fits easily in memory even with a large number of elements. Some databases, including Cassandra and Oracle, use this filter for initial checking before accessing a disk or cache, for example, when a request is received for a specific ID.
If the filter says that the ID does not exist, the database may stop processing the request and return to the client. Otherwise, the query goes to the disk and returns the item found.
Creating a Bloom filterSuppose we want to create a Bloom filter for no more than 500 integers, and we are tripled by a one percent (0.01) probability of false positives.
To do this, we can use the
BloomFilter
class from the Guava library. You must pass the number of elements reported to the filter, and the desired probability of false positives:
BloomFilter<Integer> filter = BloomFilter.create( Funnels.integerFunnel(), 500, 0.01);
Now add some numbers to the filter:
filter.put(1); filter.put(2); filter.put(3);
We added only three elements and determined the maximum number of inserts - 500, so our Bloom filter should produce fairly accurate results. Test it using the
mightContain()
method:
assertThat(filter.mightContain(1)).isTrue(); assertThat(filter.mightContain(2)).isTrue(); assertThat(filter.mightContain(3)).isTrue(); assertThat(filter.mightContain(100)).isFalse();
As the name implies, we cannot be 100% sure that this element really exists in the filter if the method returns true.
In our case, when
mightContain()
returns true, 99%, that the element is in the filter, and 1%, that the result is false positive. If the filter returns false, you can be 100% sure that the item is missing.
Bloom supersaturated filterWhen designing a Bloom filter, it is important to provide a reasonably accurate value for the expected number of elements. Otherwise, our filter will return false positives more often than we would like. Consider an example.
Suppose we create a filter with the desired probability of false positives equal to one percent and the expected number of elements equal to five, but then insert 100,000 elements:
BloomFilter<Integer> filter = BloomFilter.create( Funnels.integerFunnel(), 5, 0.01); IntStream.range(0, 100_000).forEach(filter::put);
Expecting such a small number of elements, the filter will take very little memory.
However, after adding a much larger number of elements, the filter becomes supersaturated and has a higher probability of producing a false positive result than the desired one percent.
Note that the
mightContatin()
method returns true even for a value that we have not previously inserted into the filter.
ConclusionIn this quick tutorial, we looked at the probabilistic nature of the Bloom filter's data structure using the Guava implementation.
The implementation of all these examples and code snippets can be found in the
GitHub project .
This is a Maven project, so import and launch should be easy.
THE ENDWe are waiting for comments and questions, which, as always, you can leave here, and go on an
open day to the course instructor
Mikhail Gorshkov .