📜 ⬆️ ⬇️

Set operations using Google Guava

1. Introduction

This article will discuss some theoretical and practical issues of set operations using the free Google Guava library. First of all, the focus will be on its use in behavioral factor analysis systems, which are used to improve the quality (conversion) of large Internet resources.

2. Combinatorics
')
Suppose we are processing personalized statistics. We know the topics of the documents of the Internet resource and the number of views (without failures). Moreover, we know how much each individual user has looked at the documents (in the context of topics). Let each document be an element of one of the five sets (the set is the subject of the document). We observe that all users of the site (in the example of their 6 to save space) are very actively watching only documents from the T1 set:



For a more convenient visual perception, you can display the matrix:





Consider the example of several tasks combinatorics. First, let's try to find out how many combinations there are for viewing documents. This value directly depends on the power of this set (number of elements). If we know it, then we know the number of combinations. For example, the factorial of the number 9 is 362880 (in R this is factorial (9)).

Secondly, there is a need to find out in how many ways I can choose a given number of elements from a set (the binomial coefficient of n through k). We need to know the number of elements to choose and the power of the set. As you remember, the order of the elements is not taken into account. If I used R, I would write:
> choose(19,3) [1] 969 


It turns out that if I read 3 documents out of 19 possible, then I can do it with the help of 969 combinations. Let's return to Java, and more specifically to Google Guava, where it is also necessary to transfer the mentioned information. Let's try:
 logger.trace(LongMath.factorial(9)); // 362880 logger.trace(LongMath.binomial(19, 3)); // 969 


3. Set formation

When filling the collection (usually data is imported from other systems by API or via RabbitMQ, Redis, Tarantool), you can use a very simple scheme: add the set name as a key to the collection, and the element name as a value. As a result, we get:
 Multimap<String, String> sets = ArrayListMultimap.create(); sets.put("main", "a"); sets.put("new", "b"); sets.put("new", "c"); logger.trace(sets.get("main")); // [a] logger.trace(sets.get("new")); // [b, c] logger.trace(sets.asMap()); // {new=[b, c], main=[a]} //       « !» Gson gson = new Gson(); Jedis jedis = new Jedis("localhost"); jedis.set("habr", gson.toJson(sets.asMap())); logger.trace(jedis.get("habr")); // {"new":["b","c"],"main":["a"]} 


Now I will try to get data on the side of a project written in PHP. Usually, a ready-made library is used for this, however, in this example (on “production” it is better not to do this), I will try to manually count the number of arguments and their length to make a direct request on the Redis native protocol:
 $fp = fsockopen('127.0.0.1', 6379); fwrite($fp, "*2\r\n\$3\r\nGET\r\n\$4\r\nhabr\r\n"); echo fgets($fp); echo fgets($fp); 


I see the point of mentioning other simple ways to get elements of a set from local files or network resources. First of all, this is the readLines method, which the Files class has (working with the local file system) and the Resources class (working with a distributed system, but accessible via HTTP to the protocol in free mode). Sometimes you have to get a list of elements from a regular line, where the separation is a comma or a space. For this, too, there are useful methods. Here are examples of some of them:
 List<String> eventNames = Resources.readLines(new URL(url), Charsets.UTF_8); logger.trace(eventNames); List<String> set = Splitter.on(',').trimResults().omitEmptyStrings().splitToList("a,b,b,b,c"); logger.trace(set); // [a, b, b, b, c] String test = Joiner.on(" - ").join(set); logger.trace(test); // a - b - b - b - c 


4. Basic Set Operations

Consider an example and give a description of the operations performed:
 Set<String> a = ImmutableSet.of("a", "b", "c", "d"); Set<String> b = ImmutableSet.of("c", "d", "e", "f"); logger.trace(Sets.union(a, b)); // [a, b, c, d, e, f] logger.trace(Sets.intersection(a, b)); // [c, d] logger.trace(Sets.difference(a, b)); // [a, b] logger.trace(Sets.symmetricDifference(a, b)); // [a, b, e, f] 


In this example, the following operations were performed:


5. Sets in which there can be not unique elements

I want to draw your attention to the HashMultiset collection, which allows you to count the number of duplicate elements as simply as possible. It is quite logical desire to get a sorted list by the number of repetitions:
 HashMultiset<String> habr = HashMultiset.create(); habr.add("habr_7"); habr.add("habr_5"); habr.add("habr_5"); habr.add("habr_5"); habr.add("habr_5"); habr.add("habr_9"); habr.add("habr_9"); habr.add("habr_1"); habr.add("habr_1"); habr.add("habr_1"); logger.trace(habr); // [habr_1 x 3, habr_9 x 2, habr_5 x 4, habr_7] logger.trace(habr.count("habr_1")); // 3 logger.trace(habr.count("habr_5")); // 4 logger.trace(habr.count("habr_0")); // 0 ImmutableMultiset<String> highestRank = Multisets.copyHighestCountFirst(habr); logger.trace(highestRank); // [habr_5 x 4, habr_1 x 3, habr_9 x 2, habr_7] 


6. Conclusion

I really hope that this modest article will help some readers to better understand and quickly understand a number of theoretical and practical issues of working with sets with the help of Google Guava.

Source: https://habr.com/ru/post/278009/


All Articles