⬆️ ⬇️

Meet Apache Mahout

Hey.



My first article on Habré showed that not many people know about the Mahout library. (Perhaps, of course, I am mistaken in this.) And there is no familiarization material on this subject here. So I decided to write a post telling about the possibilities of the library. A couple of samples of the pen showed that the best introduction to the topic would be short excerpts from the book “Mahout in Action” by Owen, Anil, Dunning, Friedman. Therefore, I made a free translation of some places that, I think, speak well about the scope of Mahout.







Meet the Apache Mahout (1)



* Hereinafter, the chapter from the book is indicated in brackets.



...

Mahout contains a number of models and algorithms, many still in development or experimental phase ( algorithms ). At this early stage of the project’s life, the three key topics are most noticeable: recommendation systems (collaborative filtering), clustering, and classification. This is not all that is in Mahout, but these topics are the most noticeable and mature.

...

In theory, Mahout is a project open to the implementation of any kind of machine learning model. In practice, three key areas of machine learning are currently implemented.

...

')

Recommender systems. (1.2.1)



Recommender systems are the most recognizable machine learning model used today. You see services or sites that are trying to recommend books or films, or articles, based on your previous actions. They try to deduce tastes and preferences, and identify unknown objects that are of interest.







Clustering (1.2.2)



Clustering is less obvious, but it turns out to be no less well-known references. As the name suggests, clustering techniques attempt to group large numbers of objects together into clusters that have a common similarity. Thus, hierarchy and order are established in large or difficult-to-understand data sets, and in this way they establish interesting patterns or make the data set easier to understand.





Clustering helps determine the structure and even the hierarchy in a large collection of things that may even be difficult to comprehend. Enterprises can use this technique to identify hidden groups among users, or intelligently organize a large collection of documents, or determine common usage patterns for sites using their logs.



Classification (1.2.3)



Classification models decide whether or not an item is part of a specific category or if it has some attribute. ...





The classification helps to decide whether the new piece of input data or the subject of the previously discussed patterns; and it is often used to classify a behavior or pattern. This can be used to detect suspicious network activity or fraud. And also to find out if a user’s message indicates disappointment or satisfaction.



Each of these models works best when provided with a large amount of good input data. In some cases, these methods should not only work on large amounts of data, but should get results quickly, and these factors make scalability a major task. One of the main reasons to use Mahout is scalability.



As repeatedly noted in the book, there is no ready-made recipe that can be taken and applied to a typical situation. For each case, you need to try different algorithms and input data. Only by understanding the essence of the algorithms can the library be successfully applied.



Run the first recommendation system (2.2)



... Now we are exploring a simple user-oriented recommendation system.



Creating Input (2.2.1)


...

The recommendation system needs input on which the recommendations will be based. This data takes the form of preferences in the Mahout language. Since Since recommender systems are more understandable in terms of recommending items to users, then we will speak of preference as a user-subject association. ... A preference consists of a user ID and an item ID, and usually a number that expresses the degree to which the user prefers this item (rating). The ID in Mahout is always integers. The value of preference can be any, the main thing is that a greater value expresses greater positive affection. For example, these values ​​can be rated on a scale from 1 to 5, where 1 indicates that the user does not like the subject, 5 indicates that the subject is very like.

Create an intro.csv file containing userID, itemID, value information.

...

Now, run the following code.

class RecommenderIntro { public static void main(String[] args) throws Exception { DataModel model = new FileDataModel (new File("intro.csv")); UserSimilarity similarity = new PearsonCorrelationSimilarity (model); UserNeighborhood neighborhood = new NearestNUserNeighborhood (2, similarity, model); Recommender recommender = new GenericUserBasedRecommender (model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 1); for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); } } } 




DataModel stores and provides access to all preferences, users, and items needed for calculations. The implementation of UserSimilarity provides some insight into how users' tastes are alike; it can be based on one of many metrics or calculations. (metrics are described in the first post) The implementation of UserNeighborhood defines the concept of the group of users who are closest to this user. (The first parameter 2 is the number of users in this group.) Finally, the Recommender implementation links the previous three components together to make recommendations to users. The recommend (int userId, int number) method takes two parameters: the user and the number of recommendations to be made to this user.



Output: RecommendedItem [item: XXX, value: Y]. Where Y is the predicted grade given by user 1 to item XXX. This item is recommended to the user, because he has the highest predictable score.

Source: https://habr.com/ru/post/189098/



All Articles