⬆️ ⬇️

Cloud recommendation system using Hadoop and Apache Mahout

image

Apache Mahout is a machine learning library designed for use in scalable machine learning applications. Recommendation systems are the most recognizable machine learning applications currently in use. When completing the assignments in this guide, we will use the Million Song Dataset online archive to create recommendations for selecting songs for users based on their musical preferences.





What will be discussed in this guide:







This manual consists of the following sections.



  1. Examining and formatting data
  2. Doing a Mahout quest


Installation and Setup



When completing the tasks in this guide, you will need an account to access Apache Hadoop-based services for Windows Azure. In addition, you will need to create a cluster. To get an account and create a Hadoop cluster, follow the instructions in the " Getting Started with Microsoft Hadoop on Windows Azure Platform " section in the " Introduction to Hadoop on Windows Azure Platform " article.



')

Examining and formatting data



Apache Mahout offers a built-in implementation of element-based collaborative filtering. Collaborative filtering based on elements is most often used to analyze data when creating recommendations.





In this example, users perform actions with elements (songs). These users have preferences for these elements, expressed by the number of repeated auditions of songs. A sample data is provided on the Echo Nest Taste Profile Subset web page.





clip_image002

Fig.1. Sample Milion Song Dataset Archive Data





To use a dataset with Mahout, two tasks are required.





  1. Convert identifiers of songs and users to integer values.
  2. Save new values ​​with their ratings to a comma-delimited file.


Start Visual Studio 2010. In the program window, select File -> New Project . In the Installed Templates pane under Visual C #, select the Window category, and then select Console Application from the list. Name the project ConvertToMahoutInput .





clip_image004

Fig.2. Creating a console application





After creating the application, open the Program.cs file and add the following static members to the Program class.





const char tab = '\u0009'; static Dictionary<string, int> usersMapping = new Dictionary<string, int>(); static Dictionary<string, int> songMapping = new Dictionary<string, int>(); 


Then add the following code to the Main method.





 var inputStream = File.Open(args[0], FileMode.Open); var reader = new StreamReader(inputStream); var outStream = File.Open("mInput.txt", FileMode.OpenOrCreate); var writer = new StreamWriter(outStream); var i = 1; var line = reader.ReadLine(); while (!string.IsNullOrWhiteSpace(line)) { i++; if (i > 5000) break; var outLine = line.Split(tab); int user = GetUser(outLine[0]); int song = GetSong(outLine[1]); writer.Write(user); writer.Write(','); writer.Write(song); writer.Write(','); writer.WriteLine(outLine[2]); line = reader.ReadLine(); } Console.WriteLine("saved {0} lines to {1}", i, args[1]); reader.Close(); writer.Close(); SaveMapping(usersMapping, "usersMap.csv"); SaveMapping(songMapping, "songMapping.csv"); Console.WriteLine("Mapping saved"); Console.ReadKey(); 


Now create the GetUser and GetSong functions to convert identifiers to integers.





 static int GetUser(string user) { if (!usersMapping.ContainsKey(user)) usersMapping.Add(user, usersMapping.Count + 1); return usersMapping[user]; } static int GetSong(string song) { if (!songMapping.ContainsKey(song)) songMapping.Add(song, songMapping.Count + 1); return songMapping[song]; } 


And finally, create a utility program for implementing the SaveMapping method, which saves the dictionaries of the robots program mappings into CSV files.





 static void SaveMapping(Dictionary<string, int> mapping, string fileName) { var stream = File.Open(fileName, FileMode.Create); var writer = new StreamWriter(stream); foreach (var key in mapping.Keys) { writer.Write(key); writer.Write(','); writer.WriteLine(mapping[key]); } writer.Close(); } 


Now download the sample data located at this link . After downloading, open the archive train_triplets.txt.zip and extract the file train_triplets.txt .





When running the utility, add a command line argument with the location of the train_triplets.txt file. To do this, right-click the ConvertToMahoutInput project node in Solution Explorer and select the Properties item in the context menu. On the project properties page, add the path to the train_triplets.txt file to the Command line arguments text field.





clip_image006

Fig.3. Setting command line argument





To start the program, press F5 . After it is completed, open the bin \ Debug folder from the location where the project was saved, and view the result of the utility program.





clip_image008

Fig.4. Result of running utility program ConvertToMahoutInput



Doing a Mahout quest



Open the Hadoop Cluster Portal at https://www.hadooponazure.com and click the Remote Desktop icon.





clip_image010

Fig.4. Remote desktop icon





Pack the mInput.txt file from the bin \ Debug folder into the Zip archive and copy it to the c: \ root folder on the remote cluster. After copying, extract the file from the archive.





Now create a file with the user ID for which recommendations will be created. To do this, create a text file with the name users.txt in the root folder c: \ and write the identifier of one user in it.





Note. To create recommendations for other users, add their identifiers to separate lines.





Then upload the mInput.txt and users.txt files to HDFS. To do this, open the Hadoop Command Shell and run the following commands.





hadoop fs -copyFromLocal c: \ mInput.txt input \ mInput.txt

hadoop fs -copyFromLocal c: \ users.txt input \ users.txt





Now you can perform the task with the command:





hadoop jar c: \ Apps \ dist \ mahout \ mahout-core-0.5-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input = input / mInput.txt --output = output - -usersFile = input / users.txt





The Mahout task is executed for several minutes, after which an output file is created. Run the following command to get a local copy of the output file.





hadoop fs -copyToLocal output / part-r-00000 c: \ output.txt





Open the file output.txt from the root folder c: \ and examine its contents. The file has the following structure.





user [song: rating, song: rating, ...]





findings



Recommendation systems are an important feature of many modern social networking sites, multimedia streaming, online stores and other Internet sites. Mahout offers a ready-made recommendation system that is easy to use, contains many useful features and can be scaled on the Hadoop platform.





You can use data processing and the benefits of Hadoop and Apache Mahout cloud scaling on Windows Azure platform. Try windowsazure.com/ru-ru and www.hadooponazure.com today

Source: https://habr.com/ru/post/150359/



All Articles