📜 ⬆️ ⬇️

Rating system in a high load project

The story will be about a content project in which I had to redo the architecture. Previously, the classic lamp scheme (Linux-Apache-MySQL-PHP) was implemented. But the number of visitors increased and increased, it began to approach the 1M hosts and the database server could not cope. First of all, I suggested purchasing another serac, but in this segment conversion in affiliate programs is rather low, so the project management shook a little.

If, I wonder how I had to change the architecture and still tighten the system of rotation and ratings, then welcome under cat.

The peculiarity of this project is that it distributes video content that is on donor sites, such as your pipe (YouTube). The site should display only BB codes (certain HTML). Therefore, there was no need to constantly generate HTML on the fly, but this was done after a certain time, for example, once a day, though it was then replaced with a rotation after 1000 impressions. Apache was replaced by nginx, and nginx itself gave just generated static HTML content.

Every time a visitor comes to the site, they should see something new. A new one, as often happens, is a well forgotten old one. In general, I needed a rotation of the video preview (about them later). There are several rotation algorithms. You can’t even imagine the sophistication of the marketers mind. Therefore, I will tell only about one, the most simple.
')
In the first 10 slots, only new thumbnails are inserted. Next, select 90 previews of this category with a maximum CTR. Who is not familiar with this term, it is an indicator of clickability, from English. click-through rate: the ratio of the number of clicks per picture to the number of its hits.

The video can be potentially popular, but the preview is not presentable. It is very likely, because instead of the student who sits and selects the most juicy moments of the video, the robot sits and generates thumbnails of a randomly selected frame. Therefore, the rating is quite interesting video can go to the “Down”. To diversify the site, and even the effect of a random frame, a local rating is used: three previews are generated from a single video, which are also rotated. In the course of natural selection, the most attractive pictures remain. There is also a voting system: finger up / finger down, but its technical implementation is one-on-one similar to the rotation system.

But, we are not gathered here to listen to SEO tales, but to share those details. In general, the entire lamp technology has been replaced with a saito generator. Nginx worked on static return. It remains only to realize the calculation of CTR.

Since the total number of videos on the site was around 100K, it is quite possible to choose a persistent in-memory storage. What alternatives do we have: Redis, Aerospike, Tarntool.

Because of the good functionality and friendly Russian-speaking support of the guys from MailRu, the choice fell on Tarantool. MySQL doesn’t go anywhere in our country; BB-codes of videos, lists of categories and titles, description of content and other information that is necessary for site-generation continue to be stored in it. But, since the DB was practically not used, he was assigned a minimum of memory.

Now in more detail about Tarantool (hereinafter referred to as T *). Much has been written about him in various articles. I will try to describe how this is applicable in practice, omitting the setting and installation.

A bit of a boring theory to understand what's what: All data in T * is stored in spaces: space. This is an analogue of a table in SQL or a collection in MongoDb. As the table consists of rows, a collection of documents, so the space includes sets of tuples (similar to a string in MySQL).

A tuple of elements or fields. It’s convenient for me to call the elements of a tuple as fields, and I will stick to this terminology, which is not incompatible with the documentation tarantool.org/doc/book/box/index.html . Unlike the rows of the table, the fields in a tuple have no names, but only have a sequence number. Although, as you will see later, it does not matter.

Each tuple must have a primary key. The primary index can be one of the following types: TREE, HASH, BITSET, or RTREE. Also, a secondary index can be superimposed on the space, which allows making such unique samples that are not possible to make in radish.

image

Figure 1 shows the analogy of MySQL and T *.

Stats are created to store ratings. To do this, go to the console and execute the commands:
	 box.cfg {} - loads the default configuration
	 box.schema.space.create ("stats") - creates a new space



Check how our space was created:
 tarantool> box.space 
 --- 
 - stats: 
     temporary: false 
     engine: memtx 
 ... 

And assign it to the stats variable.
 tarantool> stats = box.space 


If we were to make a schema for a DB or MongoDb, then we would choose the following schema:
 1 key - primary key, matches video id
 2 clicks_1 - number of clicks for the first image 
 3 clicks_2 - - ||  - the second picture 
 4 clicks_3 - - ||  - the third picture 
 5 clicks_sum_1 - the total number of clicks for the first image 
 6 clicks_sum_2 - - ||  - the second picture             
 7 clicks_sum_3 - - ||  - the third picture 
 8 show_1 all the same for shows
             ...
 13 show_sum_3   
 14 ctr_1 ctr for the first picture for the last interval
 15 ctr_2
 16 ctr_3
 17 ctr_sum_1 ctr for the first picture for the entire period
 18 ctr_sum_2
 19 ctr_sum_3
 20 ctr ctr in all pictures for the last interval
 21 ctr_sum ctr in all pictures for the entire period

The first column is the field number; we define the field names with constants:
	 - the first field is the primary key
	 clicks_1 = 2 
	 clicks_2 = 3
	 .  .  .
 	 ctr_sum = 22

Let's create a primary key in our space, select the HASH type:
         stats: create_index ('primary', {type = 'hash', parts = {1, 'NUM'}})

Let's check what we have created:
 tarantool> stats.index 
 --- 
 - 0: & 0 
     unique: true 
     parts: 
     - type: NUM 
       fieldno: 1 
     id: 0 
     space_id: 513 
     name: primary 
     type: HASH 
   primary: * 0 
 ... 

Very well, if it worked out, and now we will create a function that will increment the field clicks_1, and insert several records for debugging:
      stats: insert {1,0,0,0,0,0,0}
      stats: insert {2,0,0,0,0,0,0}
      stats: insert {3,0,0,0,0,0,0}

First, we verify that we have:
 tarantool> stats: select {2} 
 --- 
 - - [2, 0, 0, 0, 0, 0, 0] 
 ...

Great, everything works for us! Now we will write the field incrementing code:
 tarantool> stats: update (2, {{'+', 2,1}}) 
 tarantool> stats: select {2} 
 - [2, 1, 0, 0, 0, 0, 0] 
 tarantool> stats: update (2, {{'+', 2,1}}) 
 - [2, 2, 0, 0, 0, 0, 0] 


The update command has the following parameters:
primary key - key number for which the update is performed
the second parameter is a list of actions, each element of which represents a triplet (a list of three elements):
- type of action, in this case addition
- the number of the field over which changes are made
- number

Read more about the update command in the documentation: tarantool.org/doc/book/box/box_space.html#lua-function.space_object.update

We see that with each execution of stats: update, the data for key = 2 of the second field increases by 1. We write it in a more readable form. Earlier we had to ask:
 tarantool> clicks_1 = 2 

Perform:
 tarantool> stats: update (2, {{'+', clicks_1,1}}) 
 - [2, 4, 0, 0, 0, 0, 0] 

Now we wrap this into a function:
 function click_inc (key) stats: update (key, {{'+', clicks_1,1}}) end 

And check:
 tarantool> click_inc (2) 
 tarantool> stats: select {2} 
 --- 
 - - [2, 5, 0, 0, 0, 0, 0] 
 ... 
 tarantool> click_inc (2) 
 tarantool> stats: select {2} 
 --- 
 - - [2, 6, 0, 0, 0, 0, 0] 
 ...

Add the number of the picture to our function (the number starts 0 - the first picture):

 function click_inc (key, img_num) stats: update (key, {{+ +, clicks_1 + img _num, 1}}) end

After checking, we will bring the function to a better view in a separate file: click.lua

function click_inc ( key , img_num )
if img_num > 3 then
return false
end
box . space . stats : update ( key , { { '+' , clicks_1 + img_num , 1 } } )
return true
end


As you can see, the logic of the function execution is quite simple: the first argument is the video id, the next number is its preview. Now consider how all this can be applied. For a web project, this function can be called in three and a half ways:
- using custom API: from PHP / Python / Perl / Java scripts, etc.
- via tarantool-http, which requests will be proxied via nginx
or your own lua script using http.lib or another web server (for example xavante)
- directly from nginx, using the nginx_upstreem module.

If there is interest, I can tell you more about the second method, but in this case we have chosen the third option. There are so many letters in the article, so you can read about installing and configuring the module in the article Building services based on nginx & Tarantool from the authors of T *.

So, our click.lua will be the following:
#! / usr / bin / tarantool

box . cfg {
log_level = 5 ;
listen = 10001 ;
}

click_1 = 2 ;

function click_inc ( key , img_num )
if img_num > 3 then
return 0
end
box . space . stats : update ( key , { { '+' , click_1 + img_num , 1 } } )
return 1
end


Check it out:
         curl http://127.0.0.1:8081/echo --data '{"method": "click_inc", "params": [2,1], "id": 0}'
         {"id": 0, "result": [[1]]}

For verification, let's connect to the running instance of T *:
 tarantool> console = require ("console") tarantool> console.connect ("127.0.0.1:10001") tarantool: connected to 127.0.0.1:10001 - true 127.0.0.1:10001> stats = box.space.stats 127.0. 0.1: 10001> stats: select {2} - - [2, 7, 0, 0] ... 


We can also increment the counter of the second image:
  curl http://127.0.0.1:8081/echo --data '{"method": "click_inc", "params": [2,2], "id": 1}'
 {"id": 0, "result": [[1]]}
 

Check the result:
 127.0.0.1:10001> stats: select {2} 
 - - [2, 7, 1, 0] 
 ... 

We looked at how simple it is to make a click counting system. We now turn to the system of hits.

Each page of a set of thumbnails, let's call it “categories”, should be a hundred or so (we’ll assume that one category page contains one hundred thumbnails from this category), the incrementing procedure of shows: show_inc. But, as we understand, this is not optimal. There is the following option: A variable is generated in the body of the HTML page.
< script >
show_pictupies = ”1,2,3,4,5” / * all id of the displayed pictures are listed here * /
< / script >




and further on AJAX to transfer all this list. But here, in addition to the id of the picture, it is necessary to transfer its display variant, therefore the list can take the form: “1-1, 2-1, 3-1, 4-2”, where the minus sign indicates the display variant.

Unfortunately, there is no analogue of the function like explode in lua; therefore, we used this code to google it.

function split ( inputstr , sep )
if sep == nil then
sep = "% s"
end
local t = { } ; i = 1
for str in string . gmatch ( inputstr , "([^" .. sep .. "] +)" ) do
t [ i ] = str
i = i + 1
end
return t
end


Next, go through the cycle on the table. To implement the loop, we implement an iterator function:
function values ( t )
local i = 0
return function ( ) i = i + 1 ; return t [ i ] end
end

for it in values ( tt ) do show_inc ( it , 2 ) end


As you have seen it, show_inc is very similar to click_inc with the few exceptions that we replace the variable click_1 with show_1. Therefore, it is possible to create a more universal function, stat_inc (key, field, img_number).

function stat_inc ( key , field , img_num )

if img_num > 3 then

return 0
end

box . space . stats : update ( key , { { '+' , field + img_num , 1 } } )
return 1
end



Since we calculate the two types of ctr: the first since the last generation and the general, we will create a click procedure that we will call via nginx:

function click ( key , img_num )
stat_inc ( key , clicks_1 , img_num )
stat_inc ( key , clicks_sum_1 , img_num )
end


and show:

function show ( key_list )
list = slipt ( key_list , ',' )
for it in value ( list )
do
pos = string.find ( it , “ - ” ) ;
key = string.sub ( it , 0 , pos - 1 ) ;
img_num = string.sub ( it , pos + 1 )
stat_inc ( key , shows_1 , img_num )
stat_inc ( key , shows_sum_1 , img_num )
end
end


Thus, we count both clicks and impressions.

If you are interested in this topic, I can describe how to calculate ctr and how to choose pictures to form HTML.

Source: https://habr.com/ru/post/275281/


All Articles