Apache Mahout is a machine learning library designed for use in scalable machine learning applications. Recommendation systems are the most recognizable machine learning applications currently in use. When completing the assignments in this guide, we will use the Million Song Dataset online archive to create recommendations for selecting songs for users based on their musical preferences.
What will be discussed in this guide:
This manual consists of the following sections.
When completing the tasks in this guide, you will need an account to access Apache Hadoop-based services for Windows Azure. In addition, you will need to create a cluster. To get an account and create a Hadoop cluster, follow the instructions in the " Getting Started with Microsoft Hadoop on Windows Azure Platform " section in the " Introduction to Hadoop on Windows Azure Platform " article.
Apache Mahout offers a built-in implementation of element-based collaborative filtering. Collaborative filtering based on elements is most often used to analyze data when creating recommendations.
In this example, users perform actions with elements (songs). These users have preferences for these elements, expressed by the number of repeated auditions of songs. A sample data is provided on the Echo Nest Taste Profile Subset web page.
Fig.1. Sample Milion Song Dataset Archive Data
To use a dataset with Mahout, two tasks are required.
Start Visual Studio 2010. In the program window, select File -> New Project . In the Installed Templates pane under Visual C #, select the Window category, and then select Console Application from the list. Name the project ConvertToMahoutInput .
Fig.2. Creating a console application
After creating the application, open the Program.cs file and add the following static members to the Program class.
const char tab = '\u0009'; static Dictionary<string, int> usersMapping = new Dictionary<string, int>(); static Dictionary<string, int> songMapping = new Dictionary<string, int>();
Then add the following code to the Main method.
var inputStream = File.Open(args[0], FileMode.Open); var reader = new StreamReader(inputStream); var outStream = File.Open("mInput.txt", FileMode.OpenOrCreate); var writer = new StreamWriter(outStream); var i = 1; var line = reader.ReadLine(); while (!string.IsNullOrWhiteSpace(line)) { i++; if (i > 5000) break; var outLine = line.Split(tab); int user = GetUser(outLine[0]); int song = GetSong(outLine[1]); writer.Write(user); writer.Write(','); writer.Write(song); writer.Write(','); writer.WriteLine(outLine[2]); line = reader.ReadLine(); } Console.WriteLine("saved {0} lines to {1}", i, args[1]); reader.Close(); writer.Close(); SaveMapping(usersMapping, "usersMap.csv"); SaveMapping(songMapping, "songMapping.csv"); Console.WriteLine("Mapping saved"); Console.ReadKey();
Now create the GetUser and GetSong functions to convert identifiers to integers.
static int GetUser(string user) { if (!usersMapping.ContainsKey(user)) usersMapping.Add(user, usersMapping.Count + 1); return usersMapping[user]; } static int GetSong(string song) { if (!songMapping.ContainsKey(song)) songMapping.Add(song, songMapping.Count + 1); return songMapping[song]; }
And finally, create a utility program for implementing the SaveMapping method, which saves the dictionaries of the robots program mappings into CSV files.
static void SaveMapping(Dictionary<string, int> mapping, string fileName) { var stream = File.Open(fileName, FileMode.Create); var writer = new StreamWriter(stream); foreach (var key in mapping.Keys) { writer.Write(key); writer.Write(','); writer.WriteLine(mapping[key]); } writer.Close(); }
Now download the sample data located at this link . After downloading, open the archive train_triplets.txt.zip and extract the file train_triplets.txt .
When running the utility, add a command line argument with the location of the train_triplets.txt file. To do this, right-click the ConvertToMahoutInput project node in Solution Explorer and select the Properties item in the context menu. On the project properties page, add the path to the train_triplets.txt file to the Command line arguments text field.
Fig.3. Setting command line argument
To start the program, press F5 . After it is completed, open the bin \ Debug folder from the location where the project was saved, and view the result of the utility program.
Fig.4. Result of running utility program ConvertToMahoutInput
Open the Hadoop Cluster Portal at https://www.hadooponazure.com and click the Remote Desktop icon.
Fig.4. Remote desktop icon
Pack the mInput.txt file from the bin \ Debug folder into the Zip archive and copy it to the c: \ root folder on the remote cluster. After copying, extract the file from the archive.
Now create a file with the user ID for which recommendations will be created. To do this, create a text file with the name users.txt in the root folder c: \ and write the identifier of one user in it.
Note. To create recommendations for other users, add their identifiers to separate lines.
Then upload the mInput.txt and users.txt files to HDFS. To do this, open the Hadoop Command Shell and run the following commands.
hadoop fs -copyFromLocal c: \ mInput.txt input \ mInput.txt
hadoop fs -copyFromLocal c: \ users.txt input \ users.txt
Now you can perform the task with the command:
hadoop jar c: \ Apps \ dist \ mahout \ mahout-core-0.5-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input = input / mInput.txt --output = output - -usersFile = input / users.txt
The Mahout task is executed for several minutes, after which an output file is created. Run the following command to get a local copy of the output file.
hadoop fs -copyToLocal output / part-r-00000 c: \ output.txt
Open the file output.txt from the root folder c: \ and examine its contents. The file has the following structure.
user [song: rating, song: rating, ...]
Recommendation systems are an important feature of many modern social networking sites, multimedia streaming, online stores and other Internet sites. Mahout offers a ready-made recommendation system that is easy to use, contains many useful features and can be scaled on the Hadoop platform.
Source: https://habr.com/ru/post/150359/