⬆️ ⬇️

Automatic recommendation: some theory and practice

1. Introduction



In this article some basic theoretical and practical questions of automatic recommendation will be considered. Particular attention will be paid to the story of the experience of using Apache Mahout on large portals (written in Yii 2) with high attendance (several million people per day). There will be examples of source code in PHP and JAVA, which will help the reader to better understand the integration process of Mahout.



2. Robustness assessment



First of all, we must make sure that the results are not affected by various interferences. As a rule, the easiest way to recommend a certain entity is the method of ranking by the average value of the user rating. The higher the average grade, the higher the likelihood of recommending an item. However, even in such a simple approach there is a very important point - interference in accuracy. Let's look at a couple of examples. Suppose that a user has the opportunity to rate a product (or another object) on a scale from 1 to 10. Let there be a finite set of user ratings of a certain object: 5, 7, 4, 8, 6, 5, 10, 5, 10, 9, 10, 10, 10, 5, 2, 10. For a more comfortable perception, we will display them in the form of a diagram:







The power of the set is 16. In this particular case, we do not observe a significant difference between the median (7.5) and the arithmetic average value (7.25), and the variance of the random variable is only 7.267. In some cases, we may need to filter emissions (for example, if the variation range can be large). Naturally, we can use robust indicators. Let there be a set: 1, 2, 1, 2, 3, 1, 5, 100, 4. Its graphic representation:

')





In the example mentioned, the arithmetic mean value clearly does not reflect the real situation. Moreover, the variance in the sample is clearly large (in this example, slightly more than 1060). Using robust indicators (median and quartile) helps us avoid such problems.



3. Accounting for correlation



What is the very first way to find correlation? Naturally, this is the linear correlation coefficient of Karl Pearson. Here is another small example. Suppose that two experts need to evaluate the quality of sites. Each of them gives its assessment. For example:







One glance at the graph is enough to see the strong similarity of the opinions of the two experts. In this example, the linear correlation coefficient of Karl Pearson was 0.9684747092. Based on the data obtained, it can be predicted that the probability of an event “giving a similar rating” to other sites will be high enough. Knowing tastes is easier to recommend, right? And can we rely on the estimates of only like-minded users, and not on everything?



4. Automatic recommendation



Let's look at an example of how an interesting free Apache Mahout library works. Suppose that there are three objects. I rated only two of them (I scored only two points for the first object, and five for the second object). In addition to me, there are three more people who evaluated the objects. But unlike me, they rated all three objects. Let's look at the table with all the estimates:







Indeed, if you pass such data to Mahout, he will recommend the third object to me. There is no sense in recommending the first two objects, since I no longer simply know about them, but even gave them a rating. Moreover, Mahout could take into account the similarity of my opinion with three other people - if I give a very different assessment to the first object (say 10), then Mahout will not recommend anything to me. I did not confuse? Now check.



The MySQLJDBCDataModel class can retrieve data from the MySQL database (the table should contain: user_id, item_id, and preference). And the FileDataModel class can load from a file (CSV file, where each line has the form "1,1,2.0", and empty lines are ignored). Theoretically, an application on Yii should not know anything at all about the methods of recommendations, but simply take the necessary information from the database (connection by table: user ID, recommended object ID) and display it for the user.



I had to do quite a lot of tasks on high-load sites (with attendance of several million people per day) on Yii, including integration with various analytical systems and search platforms. Naturally, I often had to understand a huge number of projects in Java, but for the first time I connected Mahout.



Of course, there are many ways to exchange data. Starting from direct downloads (export using Hibernate to the site database) from external systems, to using queues (Gearman, RabbitMQ). I have seen some funny cases when parsing sites with JSOUP and even very slow PhantomJS was used to get data, and sometimes they were unloaded from Excel with the help of POI. But let's not talk about sad things.



By the way, storage methods are also not boring - from MongoDB to search engines (Endeca, Solr, Sphinx, even wonders built into ATG). Of course, such options have the right to exist and are not rarely used by huge projects, however, in this article I would like to consider a more common option.



Suppose we have a website on Yii with attendance of several million people per day. Let MySQL cluster be used as a database (memcached takes on all burdens and hardships). The application does not have write access to the database, and the data is transmitted exclusively through the API (to the Redis cluster), from where it is collected (thanks to the free google-gson and Jedis libraries) analytical system written in Java. It was to her that the Mahout library was added.



But I want to get not just a list of identifiers, but ready (for the widget) data. What do I need? Suppose I want to display a picture. I also need a headline. Of course, I need a link to the recommended object (the page where the user will go if he clicks on the widget). This will be a universal option. In the system responsible for unloading, I can add the logic I need to fill this table. In such a case, the structure of the table may be something like



use yii\db\Schema; use yii\db\Migration; class m150923_110338_recommend extends Migration { public function up() { $this->createTable('recommend', [ 'id' => $this->primaryKey(), 'status' => $this->boolean()->notNull(), 'url' => $this->string(255)->notNull(), 'title' => $this->string(255)->notNull(), 'image' => $this->string(255)->notNull(), 'created_at' => $this->datetime()->notNull(), 'updated_at' => $this->datetime()->notNull(), ]); } public function down() { $this->dropTable('recommend'); } } 


The model should have a method that will allow us to understand to which user we will recommend this entity. Some of the featured objects will tell us Mahout. Of course, from the very beginning we will foresee a situation where Mahout cannot recommend anything to us (or the amount will be insufficient). The model may be something like this:



 namespace common\models; use Yii; use common\models\Api; /** * This is the model class for table "recommend". * * @property integer $id * @property integer $status * @property string $url * @property string $title * @property string $image * @property string $created_at * @property string $updated_at */ class Recommend extends \yii\db\ActiveRecord { const STATUS_INACTIVE = 0; const STATUS_ACTIVE = 1; /** * @inheritdoc */ public static function tableName() { return 'recommend'; } /** * @inheritdoc */ public function rules() { return [ [['status', 'url', 'title', 'image', 'created_at', 'updated_at'], 'required'], [['status'], 'integer'], [['created_at', 'updated_at'], 'safe'], [['url', 'title', 'image'], 'string', 'max' => 255] ]; } /** * @inheritdoc */ public function attributeLabels() { return [ 'id' => 'ID', 'status' => '', 'url' => '', 'title' => '', 'image' => '  ', 'created_at' => '', 'updated_at' => '', ]; } /** * @inheritdoc */ public function behaviors() { return [ [ 'class' => \yii\behaviors\TimestampBehavior::className(), 'value' => new \yii\db\Expression('NOW()'), ], ]; } /** * Status list */ public function statusList() { return [ self::STATUS_INACTIVE => '', self::STATUS_ACTIVE => ' ', ]; } /** * @param integer $userId * @param integer $limit */ public static function getItemsByUserId($userId = 1, $limit = 6) { $itemIds = []; //   get  Api   JSON::decode    //   ID  Recommend,      (, , ) $mahout = Api::get('s=mahout&order=value&limit=' . (int)$limit . '&user=' . (int)$userId); if(!empty($mahout['status']) && $mahout['status'] == true) { $itemIds = $mahout['item-ids']; } if(count($itemIds) < $limit) { //        (   , //   ,    ,     ..).  , //    . $limit = $limit - count($itemIds); $recommend = Api::get('s=recommend&limit=' . (int)$limit . '&user=' . (int)$userId); if(!empty($recommend['status']) && $recommend['status'] == true) { $itemIds = array_merge($itemIds, $recommend['item-ids']); } } return static::find()->where(['id' => $itemIds, 'status' => static::STATUS_ACTIVE])->all(); } } 


And the controller will not be tricky either:



 namespace frontend\controllers; use Yii; use yii\web\Controller; use common\models\Recommend; class MainController extends Controller { private $_itemsLimit = 6; private $_cacheTime = 120; public function actionIndex() { $userId = Yii::$app->request->cookies->getValue('userId', 1); $recommends = Recommend::getDb()->cache(function ($db) use ($userId) { return Recommend::getItemsByUserId($userId, $this->_itemsLimit); }, $this->_cacheTime); return $this->render('index', ['recommends' => $recommends]); } } 


And here is the view (view in MVC):



 <?php use yii\helpers\Html; $this->title = 'Example'; $this->params['breadcrumbs'][] = $this->title; ?> <h3> :</h3> <div class="row"> <?php foreach ($recommends as $recommend) { ?> <div class="col-md-2"> <a href="<?= $recommend->url ?>" target="_blank"> <img src="<?= $recommend->image ?>" class="img-thumbnail" alt="<?= Html::encode($recommend->title) ?>"> </a> </div> <?php } ?> </div> 


The prototype is ready. It remains to transfer the necessary code to the real system. I had to start the task on Monday, and on Saturday I decided to try Mahout on my home computer. A bunch of books read is good, and practice is important too. In a few minutes, I sketched a simple Java application that takes data from a CSV file and writes the result in JSON format.



The interface asks us to implement only one method that will return JSON. In this particular case, we need to provide a link to the CSV data file and a list of user identifiers that need to be recommended:



 package com.api.service; import java.util.List; public interface IService { String run(String datasetFile, List<Integer> userIds); } 


Next, create a factory:



 package com.api.service; public class ServiceFactory { /** * Get Service * @param type * @return */ public IService getService(String type) { if (type == null) { return null; } if(type.equalsIgnoreCase("Mahout")) { return new MahoutService(); } return null; } } 


For example, I will get a list of recommended objects for each user that appears in the list of identifiers:



 package com.api.service; import java.io.IOException; import java.util.List; import org.apache.mahout.cf.taste.common.TasteException; import com.api.model.CustomUserRecommender; import com.api.util.MahoutHelper; import com.google.gson.Gson; import com.google.gson.GsonBuilder; public class MahoutService implements IService { @Override public String run(String datasetFile, List<Integer> userIds) { Gson gson = new GsonBuilder().create(); MahoutHelper mahoutHelper = new MahoutHelper(); List<CustomUserRecommender> customUserRecommenders = null; try { customUserRecommenders = mahoutHelper.customUserRecommender(userIds, datasetFile); } catch (IOException | TasteException e) { e.printStackTrace(); } return gson.toJson(customUserRecommenders); } } 


And here is the “same” class:



 package com.api.util; import java.io.File; import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.mahout.cf.taste.common.TasteException; import org.apache.mahout.cf.taste.impl.model.file.FileDataModel; import org.apache.mahout.cf.taste.impl.neighborhood.ThresholdUserNeighborhood; import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender; import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity; import org.apache.mahout.cf.taste.model.DataModel; import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood; import org.apache.mahout.cf.taste.recommender.UserBasedRecommender; import org.apache.mahout.cf.taste.similarity.UserSimilarity; import com.api.model.CustomUserRecommender; public class MahoutHelper { /** * @param List<Integer> userIds * @param String datasetFile * @return List<CustomUserRecommender> * @throws IOException * @throws TasteException */ public List<CustomUserRecommender> customUserRecommender(List<Integer> userIds, String datasetFile) throws IOException, TasteException { List<CustomUserRecommender> customUserRecommenders = new ArrayList<CustomUserRecommender>(); DataModel datamodel = new FileDataModel(new File(datasetFile)); UserSimilarity usersimilarity = new PearsonCorrelationSimilarity(datamodel); UserNeighborhood userneighborhood = new ThresholdUserNeighborhood(0.1, usersimilarity, datamodel); UserBasedRecommender recommender = new GenericUserBasedRecommender(datamodel, userneighborhood, usersimilarity); for (Integer userId : userIds) { customUserRecommenders.add(new CustomUserRecommender(userId, recommender.recommend(userId, 10))); } return customUserRecommenders; } } 


In a real project, the Mahout library was added to the existing system (a turnkey solution). As I mentioned, an API was chosen as the data transfer method. As practice has shown, adding recommendations to key pages (for example, product card) influences conversion very well. Not infrequently, a personal rating of recommended objects is sent by e-mail, for example, once a week.



If possible, try on each page to make a small form of user polling about the interestingness and usefulness of this or that product for him. At a minimum, two characters can be made (“+” and “-”). A dichotomous classification is usually expressed in numerical estimates (preferably 2 and 10, so that the difference is more obvious). Try to motivate people to leave ratings - the more ratings, the easier it is to give an accurate recommendation. You can take into account orders of goods (once bought, it means highly appreciated). Just be very careful to avoid all kinds of speculation. Please constantly check the data obtained by a series of experiments (A / B tests).



I do not want to remind the obvious things, but the opinion of most people is not always objectively correct. For example, there can be a very beautiful girl of 25 years old who worries because of the complexes that she had in her childhood. Some guys may strongly believe in the effectiveness of NLP and hypnosis as a way of seducing girls. Even a good old woman can smear the wound of her grandson with an alcohol solution of brilliant green, although the use of miramistin will be clearly more reasonable. The list can be continued for a very long time. Ideally, you should add manual filtering of obviously poor recommendations (if we are talking about evaluating other sites) or tighten quality control (if objects on your site are evaluated).

Source: https://habr.com/ru/post/267963/



All Articles