Hey.
My first article on Habré showed that not many people know about the Mahout library. (Perhaps, of course, I am mistaken in this.) And there is no familiarization material on this subject here. So I decided to write a post telling about the possibilities of the library. A couple of samples of the pen showed that the best introduction to the topic would be short excerpts from the book “Mahout in Action” by Owen, Anil, Dunning, Friedman. Therefore, I made a free translation of some places that, I think, speak well about the scope of Mahout.
Meet the Apache Mahout (1)
* Hereinafter, the chapter from the book is indicated in brackets.
- Mahout is an opensource Apache machine learning library. Algorithms that the library implements in the aggregate can be called machine learning or collective intelligence. This may mean a lot, but at the moment it means first of all recommender systems (collaborative filtering), clustering and classification.
- Mahout is scalable . Mahout strives to become a machine learning tool with the ability to process data on one or more machines. In the current version of Mahout, scalable implementations of machine learning are written in java, and some parts are based on the Apache Hadoop distributed computing project.
- Mahout is a java library . It does not provide a user interface, server or installer. This is a framework of tools intended for use and adaptation by developers.
...
Mahout contains a number of models and algorithms, many still in development or experimental phase (
algorithms ). At this early stage of the project’s life, the three key topics are most noticeable: recommendation systems (collaborative filtering), clustering, and classification. This is not all that is in Mahout, but these topics are the most noticeable and mature.
...
In theory, Mahout is a project open to the implementation of any kind of machine learning model. In practice, three key areas of machine learning are currently implemented.
...
')
Recommender systems. (1.2.1)
Recommender systems are the most recognizable machine learning model used today. You see services or sites that are trying to recommend books or films, or articles, based on your previous actions. They try to deduce tastes and preferences, and identify unknown objects that are of interest.
- Amazon.com is probably the most well-known site in e-commerce that applied the recommendation. Based on purchases and activity on the site, Amazon recommends books and other things that may cause interest.
- Netflix also recommends DVDs that may be interesting and offer a $ 1M prize for researchers who can improve the quality of their recommendations.
- Social networks, such as Facebook, use variants of recommender techniques to identify people who are most likely to fit the definition of “not yet related friends”.
Clustering (1.2.2)
Clustering is less obvious, but it turns out to be no less well-known references. As the name suggests, clustering techniques attempt to group large numbers of objects together into clusters that have a common similarity. Thus, hierarchy and order are established in large or difficult-to-understand data sets, and in this way they establish interesting patterns or make the data set easier to understand.
- Google News groups news articles by title using clustering techniques.
- Search engines such as Clusty also group their search results.
- Customers can be grouped into segments (clusters) using clustering techniques based on attributes: revenue, location, shopping habits.
Clustering helps determine the structure and even the hierarchy in a large collection of things that may even be difficult to comprehend. Enterprises can use this technique to identify hidden groups among users, or intelligently organize a large collection of documents, or determine common usage patterns for sites using their logs.
Classification (1.2.3)
Classification models decide whether or not an item is part of a specific category or if it has some attribute. ...
- Yahoo! Mail decides whether or not an incoming message is spam, based on previous emails and spam messages from users, as well as the characteristics of the emails themselves.
- Google's Picasa and other photo management applications can define an image area containing a human face.
- Optical Character Recognition software classifies small areas of scanned text into individual characters.
The classification helps to decide whether the new piece of input data or the subject of the previously discussed patterns; and it is often used to classify a behavior or pattern. This can be used to detect suspicious network activity or fraud. And also to find out if a user’s message indicates disappointment or satisfaction.
Each of these models works best when provided with a large amount of good input data. In some cases, these methods should not only work on large amounts of data, but should get results quickly, and these factors make scalability a major task. One of the main reasons to use Mahout is scalability.
As repeatedly noted in the book, there is no ready-made recipe that can be taken and applied to a typical situation. For each case, you need to try different algorithms and input data. Only by understanding the essence of the algorithms can the library be successfully applied.
Run the first recommendation system (2.2)
... Now we are exploring a simple user-oriented recommendation system.
Creating Input (2.2.1)
...
The recommendation system needs input on which the recommendations will be based. This data takes the form of preferences in the Mahout language. Since Since recommender systems are more understandable in terms of recommending items to users, then we will speak of preference as a user-subject association. ... A preference consists of a user ID and an item ID, and usually a number that expresses the degree to which the user prefers this item (rating). The ID in Mahout is always integers. The value of preference can be any, the main thing is that a greater value expresses greater positive affection. For example, these values can be rated on a scale from 1 to 5, where 1 indicates that the user does not like the subject, 5 indicates that the subject is very like.
Create an intro.csv file containing userID, itemID, value information.
...
Now, run the following code.
class RecommenderIntro { public static void main(String[] args) throws Exception { DataModel model = new FileDataModel (new File("intro.csv")); UserSimilarity similarity = new PearsonCorrelationSimilarity (model); UserNeighborhood neighborhood = new NearestNUserNeighborhood (2, similarity, model); Recommender recommender = new GenericUserBasedRecommender (model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 1); for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); } } }
DataModel stores and provides access to all preferences, users, and items needed for calculations. The implementation of UserSimilarity provides some insight into how users' tastes are alike; it can be based on one of many metrics or calculations. (metrics are described in the first post) The implementation of UserNeighborhood defines the concept of the group of users who are closest to this user. (The first parameter 2 is the number of users in this group.) Finally, the Recommender implementation links the previous three components together to make recommendations to users. The recommend (int userId, int number) method takes two parameters: the user and the number of recommendations to be made to this user.
Output: RecommendedItem [item: XXX, value: Y].
Where Y is the predicted grade given by user 1 to item XXX. This item is recommended to the user, because he has the highest predictable score.