When the recommendation system works with a large amount of content, the main task is not to filter this content, but to rank it. If we talk about the news - every day hundreds of thousands of articles come out, thousands of which can affect the interests of each person reading the news. But mostly users do not read more than 5-10 articles per day (according to News360). What articles to show first?
The answer to this question in the News360 is looking for the third year. We have already found many different answers, but this year we decided to abandon the concept, which has been fundamental for all previous years.
In this article, I will try to tell in simple words about why News360 first worked for several years on the implementation and development of a clustering system of articles on events and ranking of events, and then threw out this approach and decided to implement another one. And also a little about how News360 works, what is under the hood and where to read about it.
')

The main task of News360 is to solve the problem of oversaturation with information through personalization. Imagine a modern user who uses social networks, reads news, tries to follow several news publications in order to have an objective point of view, and, of course, read blogs of friends and other interesting people. Such a user either spends hours keeping an eye on all the news, or skips some information. And with each new blog or friend in social networks, the feeling is becoming increasingly stronger that more and more interesting things are lost sight of. News360 for such a user will select all the articles that are most popular in the selected sources, all the most discussed from social networks, will add new content on topics that the user reads most about, and streamline everything into one convenient and beautiful tape.
Or imagine another modern user - a person who is interested in everything except politics, and he likes to cook. Then News360 can show the user the most interesting non-political world news and add to the feed the latest and most popular culinary blogs and publications.
To compare a personalized and non-personalized approach to organizing news feeds, let's look at different tapes:

CNN shows not a personalized tape, but a collection of the most important and popular news.

Prismatic shows my personalized news feed, in which there is nothing other than what may interest me by profession. Separately, there is also a column with the most sensational world news. (I had to cut out images from this snapshot to fit more articles).

The main screen of News360 tries to combine all the news that may interest me. This is the most important world news, and what may be curious to me, and what directly relates to my interests.
To solve this problem, the news coming into the system from the Internet is analyzed to identify additional information:
- The system recognizes named concepts in the text, such as the main participants in the event, referred to as people, companies, brands, places where the event occurs. To do this, we have implemented an algorithm based on a grammatical approach to the search for entity templates in the text, inspired by the work “Named Entity Recognition: A LocalGrammar-based Approach” .
- News360 classifies news using several different approaches. To classify articles by popular headings, such as Sport, Business, or Politics, the support vector machine is used: the libsvm- based machine is trained in its own marked up corpus, consisting of 1000 media articles, in which the keywords taken from the articles are weighted using TF * IDF.
- To distinguish smaller and narrower text topics, the simplest rule-based classification implementation is used, similar to how it is implemented in Oracle Rule Based Classificaiton and in “A Comparison of Classification Techniques for Technical Text”
For simplicity, named concepts, topics, headings, and all other knowledge about the article we call article tags. In the form of the same tags, News360 determines the interests of the user by analyzing articles that he likes, or when the user explicitly announces his interests:

To further optimize the news feed, News360 groups articles from different sources about the same thing, so that the user does not see replays in the main feed, but, immersed in reading the story, could choose which point of view he is interested in. Such content clustering is performed by a special mechanism using an algorithm close to the
Minimum Cut Trees method, based on graphs.

Count articles News360, clusters highlighted in color.
As a result of the clustering mechanism, each cluster not only includes related articles, but is a collection of articles from different sources about the same event. These events News360 shows the user instead of articles in order to solve the problem of hiding repetitions in the user's news feed and to be able to take into account the resonance of the event in the ranking.
Based on the attributes of the articles that fall into the cluster, its attributes are calculated for it. For example, the most recent article, frequently encountered tags, as well as a unique attribute - the resonance of an event, which shows how actively this event is being discussed in the world.
- CIA faulted for choosing Amazon over cloud on contract
- Resonance: 5
- Tags: CIA, Amazon, IBM, Cloud Computing
- The most recent article: Physorg.com, 06/18/2013
- Global IT Spend Projected to Increase
- Resonance: 5
- Tags: Gartner, Cloud Computing
- The most recent article: The Strategic Sourceror, 06/19/2013
- What I Use: Office 365 and SkyDrive Pro
- Resonance: 1
- Tags: Office 365, SkyDrive, Cloud Computing, Gadgets
- The most recent article: WinSuperSite, 06/20/2013
Clusters are used in News360 as basic units of information, that is, it is their system that recommends.
When a user reads an article, shares it, or likes it, the system tries to find out what the user likes about this article. Thus, the system is trained for each user, forming his “portrait” and uses this portrait in order to select the most interesting, in her opinion, news to the user. For example, my portrait looks like this (I removed the long tail of rare tags for readability):
Cloud computing | 0.95 |
William Gates III | 0.72 |
Steve jobs | 0.62 |
Microsoft | 0.44 |
Music | 0.40 |
Ibm | 0.24 |
Startups | 0.18 |
Richard Branson | 0.17 |
Business | 0.17 |
Small business | 0.16 |
Entertainment | 0.16 |
Weight is the confidence of the system that the subject of Cloud Computing will be interesting to me. This weight is calculated based on how actively the user “interacts” with a specific topic. For example, reading articles on Cloud Computing, I increase the weight of this topic in my profile.
Based on the portrait, the system takes relevant clusters of information, and decides which ones to show me.
All three stories talk about cloud technologies, which means that, according to my portrait, they may be interesting to me. Thus, the process of content personalization can be represented as follows:

That is, the system filters content by user interests.
This approach allows you to save the user from not interesting to him the news, but with modern abundance of content does not guarantee that the user will know all the most important things that happen in areas of interest to him, that is, does not solve the problem of information overload.
With the introduction of the concept of “importance of news to the user,” we introduce a comparative characteristic (that is, some news may be more important for the user, others less), which makes it necessary to rank news according to this characteristic individually for each user.
This technique is called
“content-based recommendations” and is widely used by various products, such as the imdb.com recommendation system.
For each document, a set of attributes is revealed, each of which is weighed in relation to the user, determining how important the news may be for that user. For the example in this article, we will try to use the following parameters:
- Fresh content.
- The number of news tags that are in the user's portrait.
- The likelihood that the news on the relevant tags like the user (coefficient in Table 1).
- Resonance - the number of sources that illuminated this news, i.e. the number of sources whose articles are in the current cluster.
At the moment when I address the system for news, the information known about me (my “portrait”) is used as parameters of a request to the system, as described in, for example, the article
“Fab: Content-based, collaborative recommendation” . So the parameters of the articles and user information are used to rank the news.
Let's try to do it on our data (for example, let's say we are ranking on June 20, 2013):
- CIA faulted for choosing Amazon over cloud on contract
- Freshness: two days (0 points)
- What the user likes: Cloud Computing (0.95), IBM (0.24) (2 points)
- Resonance: 5 (2 points)
- Global IT Spend Projected to Increase
- Freshness: day (1 point)
- What the user likes: Cloud Computing (0.95) (0 points)
- Resonance: 5 (2 points)
- What I Use: Office 365 and SkyDrive Pro
- Freshness: as relevant (2 points)
- What the user likes: Cloud Computing (0.95) (0 points)
- Resonance: 1 (0 points)
For simplicity, I considered the points, sorting out the news for each of the parameters, and assigned 2 for the first place, 0 for the third and 1 for everything else.
Total, we receive the news ranked as follows:
- CIA faulted for choosing Amazon over cloud on contract
- Global IT Spend Projected to Increase
- What I Use: Office 365 and SkyDrive Pro
Here we ranked the clusters (= stories = events). When ranking clusters, there are three undeniable advantages:
- first, as a result of the ranking, a tape immediately appears that can be shown to the user;
- secondly, there are fewer elements for ranking (the cluster contains many articles at once), so it turns out to do the necessary work faster;
- thirdly, at no additional cost, we get such a parameter as the resonance of the event (i.e. how many sources wrote about this event).
But this approach has a problem that has led us to move away from ranking clusters and start ranking articles one by one. The problem is that many of the cluster attributes we choose cannot be matched with the interests of the user.
For example, if there are five articles in a cluster, then the resonance of the cluster is taken as 5, but this does not mean that all five articles are interesting to the user. That is, when ranking a particular cluster for a specific user, each parameter must take into account all the interests of the user. In this case, calculate the resonance by the number of cluster articles of interest to the user (mentioning the interests of the user), and not by the total number of cluster articles.
The same applies to the freshness of the cluster, and the choice of the main article of the cluster (that is, what of this cluster to show to the user on the "cover" of the cluster).
At the same time, the user needs to show exactly the history (clusters), and not articles. Firstly, because the user does not want to see several different articles about the same in his feed, even if they are published in different sources; secondly, because for ranking we necessarily need such a parameter as the resonance of the event.
So we came to a system in which articles are ranked, but the resonance of events is taken into account, and the user is shown stories.
Let's try to take all the articles listed in the column above:
Does the CIA have a Wish List? | Time | CIA, Amazon | 06/18/2013 |
Amazon, IBM wrangle over CIA spying cloud contract | The austrian | IBM, CIA, Amazon | 06/18/2013 |
Report: The CIA Picked Amazon To Build Its Cloud | Finding Out About | CIA, Cloud Computing | 06/18/2013 |
Government IT Spending Globally To Decline | Silicononindia | Gartner, India | 06.19.2013 |
Forecast: Global state agencies spend on Information Technology (IT) this year | YoPakistan | Pakistan | 06.19.2013 |
CIA faulted for choosing Amazon over IBM | Physorg.com | IBM, CIA, Amazon | 06/18/2013 |
Report: The CIA Picked Amazon To Build Its Cloud | Business insider | CIA, Amazon, Cloud Computing | 06/18/2013 |
What I Use: Office 365 and SkyDrive Pro | Winsupersite | Office 365, SkyDrive, Cloud Computing, Gadgets | 06/20/2013 |
Global IT Spend Projected to Increase | The Strategic Sourceror | Gartner, World News | 06.19.2013 |
Mobile and cloud governments' tech shopping lists | Tech Republic | Gartner, Cloud Computing, World News | 06.19.2013 |
And choose the ones that fit the user (filtering):
Amazon, IBM wrangle over CIA spying cloud contract | The austrian | IBM (0.24) | 06/18/2013 |
Report: The CIA Picked Amazon To Build Its Cloud | Finding Out About | Cloud Computing (0.95) | 06/18/2013 |
CIA faulted for choosing Amazon over IBM | Physorg.com | IBM (0.24) | 06/18/2013 |
Report: The CIA Picked Amazon To Build Its Cloud | Business insider | Cloud Computing (0.95) | 06/18/2013 |
What I Use: Office 365 and SkyDrive Pro | Winsupersite | Cloud Computing (0.95) | 06/20/2013 |
Mobile and cloud governments' tech shopping lists | Tech Republic | Cloud Computing (0.95) | 06.19.2013 |
Now we will sort the articles by the sum of points, calculated in the same way as we calculated the cluster weights (the brackets indicate the points for each parameter, in the last column - the sum of points). In this case, as a resonance, we will take into account the number of articles interesting to the user in the cluster, and not the total number of articles in the cluster:
What I Use: Office 365 and SkyDrive Pro | Winsupersite | Cloud Computing (0.95) (2) | 06/20/2013 (2) | ten) | four |
Report: The CIA Picked Amazon To Build Its Cloud | Finding Out About | Cloud Computing (0.95) (2) | 06/18/2013 (0) | 4 (2) | four |
Report: The CIA Picked Amazon To Build Its Cloud | Business insider | Cloud Computing (0.95) (2) | 06/18/2013 (0) | 4 (2) | four |
Mobile and cloud governments' tech shopping lists | Tech Republic | Cloud Computing (0.95) (2) | 06/19/2013 (1) | ten) | 3 |
Amazon, IBM wrangle over CIA spying cloud contract | The austrian | IBM (0.24) (0) | 06/18/2013 (0) | 4 (2) | 2 |
CIA faulted for choosing Amazon over IBM | Physorg.com | IBM (0.24) (0) | 06/18/2013 (0) | 4 (2) | 2 |
At the last stage, we group the output by clusters (in accordance with the graph in the figure above), taking the first article of each cluster, and hiding the rest:
- What I Use: Office 365 and SkyDrive Pro
- Report: The CIA Picked Amazon To Build Its Cloud
- Mobile and cloud governments' tech shopping lists
This result is more like what I’m interested in.
In addition to using tag weights from the user's portrait, the system can also weigh the parameters of the article differently relative to different tags. Parameters - this is the date of the article, the number of sources, the amount of text information, the index of influence on social networks and other similar attributes of articles. For example, a little textual information in analytical articles for the Politics tag is bad. However, exactly the same amount of information for a photoblog is permissible. Thus, the same article will have different weights for different tags. After rationing using the ranking function developed in News360, these parameters are aggregated into the weight of the article relative to the tag.
Considering the user's portrait as the desire to see one or another tag in the article, we aggregate the article weights in those tags that the user has in the portrait, thus obtaining the final total weight of the article relative to the user.
The resulting mechanism allows you to rank stories using any of the following characteristics:
- Objective characteristics of the stories. For example:
- Article date
- Amount of textual information
- Rating in social. networks.
- Subjective preferences of the user for whom a particular ranking is made. For example:
- Preferred article genre (analytical articles, news)
- Weighted list of preferred topics (IBM (0.24), Cloud Computing (0.95))
- View, under which the user explores the area of ​​information:
- That is, the article on the government's budget for IT can be included in both the Politics and the IT headings, but with different weights.
It also provides ample opportunity to choose the option of presenting the tape to the user:
- The ability to accurately select which specific articles (from a series of articles from different publications on the same topic) will be shown to the user.
- Hide duplicate information from the news feed.
This ranking system is currently undergoing test operation inside News360, but the development of the next system update has begun, which includes such things as using news knowledge in ranking (as far as the article is liked by all users in general or users similar to the current one) , recommendations for adding to the profile of a particular interest, automatic adaptation of the algorithm recommendations for a specific user based on an iterative process of testing and evaluating the effect efficiency of different algorithms for this user.
References:
- Wikipedia: Named entity recognition
- Traboulsi, HN (2006). Named entity recognition: A localgrammar-based approach. PhD thesis, University of Surrey, Department of Computing Schools of Electronics, University of Surrey, Guildford, UK. Retrieved from: scribd.com
- Boser, Bernhard E .; Guyon, Isabelle M .; and Vapnik, Vladimir N .; A training algorithm for optimal margin classifiers. In Haussler, David (editor); 5th Annual ACM Workshop on COLT, pages 144–152, Pittsburgh, PA, 1992. ACM Press. Retrieved from: citeseer.ist.psu.edu
- Chang, C., & Lin, C. (nd). Libsvm - a library for support vector machines.
- Classifying documents in oracle text.
- Kornfein, MM, & Goldfarb, H. (2007, July). In MM Kornfein (Chair). A comparison of classification techniques for technical text passages. WCE 2007, London, UK Retrieved from: citeseerx.ist.psu.edu
- Flake, GW, Tarjan, RE & Tsioutsiouliklis, K. (2004). Graph clustering and minimum cut trees. Internet Mathematics, 1, 385-408. Retrieved from: citeseerx.ist.psu.edu
- Wikipedia: Recommender system: Content-based filtering
- Balabanovic, M., & Shoham, Y. (1997). Fab: Content-based, collaborative recommendation. Communications of the Association for Computing Machinery, 40 (3), 66-72. Retrieved from: citeseerx.ist.psu.edu