Filter Bloom in Java using Guava

Good day to all.

We launched a new course - “Algorithms for Developers” , designed for those to tighten their knowledge of various structures and algorithms for data processing, solving algebraic problems and dynamic programming problems for various languages. So today we are sharing a short note about the work of the Bloom filter in Java.

Introduction
')
In this article we will look at the structure of the Bloom filter from the Guava library. The Bloom filter is a probabilistic data structure with efficient memory usage that we can use to answer the question “Is this element contained in a set?”.

There are no false negatives in Bloom’s filter, so if he returns false, you can be 100% sure that this element is not in the set.

However, the Bloom filter can return false positives , so when returning true, there is a high probability that the element really exists in the set, but the probability is not 100%.

To learn more about how the Bloom filter works, check out this guide.

Maven addiction

We will use the Guava-implementation of the Bloom filter, so add a Guava-dependency.
The latest version can be found in Maven Central .

<dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>22.0</version> </dependency>

Why use a bloom filter?

Bloom filter is designed to be economical and fast . When using it, we can specify the probability of false positive responses, which we are ready to accept , and, in accordance with the configuration, the Bloom filter will take as little memory as possible.

Due to its economy, the Bloom filter fits easily in memory even with a large number of elements. Some databases, including Cassandra and Oracle, use this filter for initial checking before accessing a disk or cache, for example, when a request is received for a specific ID.

If the filter says that the ID does not exist, the database may stop processing the request and return to the client. Otherwise, the query goes to the disk and returns the item found.

Creating a Bloom filter

Suppose we want to create a Bloom filter for no more than 500 integers, and we are tripled by a one percent (0.01) probability of false positives.

To do this, we can use the BloomFilter class from the Guava library. You must pass the number of elements reported to the filter, and the desired probability of false positives:

 BloomFilter<Integer> filter = BloomFilter.create( Funnels.integerFunnel(), 500, 0.01);

Now add some numbers to the filter:

 filter.put(1); filter.put(2); filter.put(3);

We added only three elements and determined the maximum number of inserts - 500, so our Bloom filter should produce fairly accurate results. Test it using the mightContain() method:

 assertThat(filter.mightContain(1)).isTrue(); assertThat(filter.mightContain(2)).isTrue(); assertThat(filter.mightContain(3)).isTrue(); assertThat(filter.mightContain(100)).isFalse();

As the name implies, we cannot be 100% sure that this element really exists in the filter if the method returns true.

In our case, when mightContain() returns true, 99%, that the element is in the filter, and 1%, that the result is false positive. If the filter returns false, you can be 100% sure that the item is missing.

Bloom supersaturated filter

When designing a Bloom filter, it is important to provide a reasonably accurate value for the expected number of elements. Otherwise, our filter will return false positives more often than we would like. Consider an example.

Suppose we create a filter with the desired probability of false positives equal to one percent and the expected number of elements equal to five, but then insert 100,000 elements:

 BloomFilter<Integer> filter = BloomFilter.create( Funnels.integerFunnel(), 5, 0.01); IntStream.range(0, 100_000).forEach(filter::put);

Expecting such a small number of elements, the filter will take very little memory.
However, after adding a much larger number of elements, the filter becomes supersaturated and has a higher probability of producing a false positive result than the desired one percent.

Note that the mightContatin() method returns true even for a value that we have not previously inserted into the filter.

Conclusion

In this quick tutorial, we looked at the probabilistic nature of the Bloom filter's data structure using the Guava implementation.

The implementation of all these examples and code snippets can be found in the GitHub project .

This is a Maven project, so import and launch should be easy.

THE END

We are waiting for comments and questions, which, as always, you can leave here, and go on an open day to the course instructor Mikhail Gorshkov .

Source: https://habr.com/ru/post/433700/

All Articles

Filter Bloom in Java using Guava

More articles: