Cloud recommendation system using Hadoop and Apache Mahout

Apache Mahout is a machine learning library designed for use in scalable machine learning applications. Recommendation systems are the most recognizable machine learning applications currently in use. When completing the assignments in this guide, we will use the Million Song Dataset online archive to create recommendations for selecting songs for users based on their musical preferences.

What will be discussed in this guide:

How to use the recommendation system

This manual consists of the following sections.

Examining and formatting data
Doing a Mahout quest

Installation and Setup

When completing the tasks in this guide, you will need an account to access Apache Hadoop-based services for Windows Azure. In addition, you will need to create a cluster. To get an account and create a Hadoop cluster, follow the instructions in the " Getting Started with Microsoft Hadoop on Windows Azure Platform " section in the " Introduction to Hadoop on Windows Azure Platform " article.

Examining and formatting data

Apache Mahout offers a built-in implementation of element-based collaborative filtering. Collaborative filtering based on elements is most often used to analyze data when creating recommendations.

In this example, users perform actions with elements (songs). These users have preferences for these elements, expressed by the number of repeated auditions of songs. A sample data is provided on the Echo Nest Taste Profile Subset web page.

clip_image002
Fig.1. Sample Milion Song Dataset Archive Data

To use a dataset with Mahout, two tasks are required.

Convert identifiers of songs and users to integer values.
Save new values with their ratings to a comma-delimited file.

Start Visual Studio 2010. In the program window, select File -> New Project . In the Installed Templates pane under Visual C #, select the Window category, and then select Console Application from the list. Name the project ConvertToMahoutInput .

clip_image004
Fig.2. Creating a console application

After creating the application, open the Program.cs file and add the following static members to the Program class.

const char tab = '\u0009'; static Dictionary<string, int> usersMapping = new Dictionary<string, int>(); static Dictionary<string, int> songMapping = new Dictionary<string, int>();

Then add the following code to the Main method.

 var inputStream = File.Open(args[0], FileMode.Open); var reader = new StreamReader(inputStream); var outStream = File.Open("mInput.txt", FileMode.OpenOrCreate); var writer = new StreamWriter(outStream); var i = 1; var line = reader.ReadLine(); while (!string.IsNullOrWhiteSpace(line)) { i++; if (i > 5000) break; var outLine = line.Split(tab); int user = GetUser(outLine[0]); int song = GetSong(outLine[1]); writer.Write(user); writer.Write(','); writer.Write(song); writer.Write(','); writer.WriteLine(outLine[2]); line = reader.ReadLine(); } Console.WriteLine("saved {0} lines to {1}", i, args[1]); reader.Close(); writer.Close(); SaveMapping(usersMapping, "usersMap.csv"); SaveMapping(songMapping, "songMapping.csv"); Console.WriteLine("Mapping saved"); Console.ReadKey();

Now create the GetUser and GetSong functions to convert identifiers to integers.

 static int GetUser(string user) { if (!usersMapping.ContainsKey(user)) usersMapping.Add(user, usersMapping.Count + 1); return usersMapping[user]; } static int GetSong(string song) { if (!songMapping.ContainsKey(song)) songMapping.Add(song, songMapping.Count + 1); return songMapping[song]; }

And finally, create a utility program for implementing the SaveMapping method, which saves the dictionaries of the robots program mappings into CSV files.

 static void SaveMapping(Dictionary<string, int> mapping, string fileName) { var stream = File.Open(fileName, FileMode.Create); var writer = new StreamWriter(stream); foreach (var key in mapping.Keys) { writer.Write(key); writer.Write(','); writer.WriteLine(mapping[key]); } writer.Close(); }

Now download the sample data located at this link . After downloading, open the archive train_triplets.txt.zip and extract the file train_triplets.txt .

When running the utility, add a command line argument with the location of the train_triplets.txt file. To do this, right-click the ConvertToMahoutInput project node in Solution Explorer and select the Properties item in the context menu. On the project properties page, add the path to the train_triplets.txt file to the Command line arguments text field.

clip_image006
Fig.3. Setting command line argument

To start the program, press F5 . After it is completed, open the bin \ Debug folder from the location where the project was saved, and view the result of the utility program.

clip_image008
Fig.4. Result of running utility program ConvertToMahoutInput

Doing a Mahout quest

Open the Hadoop Cluster Portal at https://www.hadooponazure.com and click the Remote Desktop icon.

clip_image010
Fig.4. Remote desktop icon

Pack the mInput.txt file from the bin \ Debug folder into the Zip archive and copy it to the c: \ root folder on the remote cluster. After copying, extract the file from the archive.

Now create a file with the user ID for which recommendations will be created. To do this, create a text file with the name users.txt in the root folder c: \ and write the identifier of one user in it.

Note. To create recommendations for other users, add their identifiers to separate lines.

Then upload the mInput.txt and users.txt files to HDFS. To do this, open the Hadoop Command Shell and run the following commands.

hadoop fs -copyFromLocal c: \ mInput.txt input \ mInput.txt
hadoop fs -copyFromLocal c: \ users.txt input \ users.txt

Now you can perform the task with the command:

hadoop jar c: \ Apps \ dist \ mahout \ mahout-core-0.5-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input = input / mInput.txt --output = output - -usersFile = input / users.txt

The Mahout task is executed for several minutes, after which an output file is created. Run the following command to get a local copy of the output file.

hadoop fs -copyToLocal output / part-r-00000 c: \ output.txt

Open the file output.txt from the root folder c: \ and examine its contents. The file has the following structure.

user [song: rating, song: rating, ...]

findings

Recommendation systems are an important feature of many modern social networking sites, multimedia streaming, online stores and other Internet sites. Mahout offers a ready-made recommendation system that is easy to use, contains many useful features and can be scaled on the Hadoop platform.

You can use data processing and the benefits of Hadoop and Apache Mahout cloud scaling on Windows Azure platform. Try windowsazure.com/ru-ru and www.hadooponazure.com today

Source: https://habr.com/ru/post/150359/

All Articles

Cloud recommendation system using Hadoop and Apache Mahout

Installation and Setup

Examining and formatting data

Doing a Mahout quest

findings

More articles: