How we wrote the anti-clipping system

Our company works in the field of online advertising. About 2 years ago, we were finally disappointed in the anti-click protection systems built into the content network and decided to make our own, at that time still for internal use.

Under the cat a lot of technical details of the functioning of the system, as well as descriptions of the problems that we encountered in the process of work and their solutions. If you are just interested to look at the system - the main image is clickable.

The first task that needed to be addressed was the identification of unique users.
Those. we need to identify the user even if he changes the browser or cleans the cookies.
After some deliberation and a series of experiments, we started writing not only in cookies, but also in the repository of all possible browser plugins that have such repositories, and other small things, third-party cookies, various JS storage.
As a result, we not only identify the user in most cases, but also have some digital replica of his computer (OS, screen resolution, color depth, presence or absence of certain plug-ins, support for certain JS storage and third-party cookies by browsers) , which makes it possible, with a high degree of probability, to identify the user, even if he manages to clean everything that we have set for him.
At this stage, there were no special problems about which to write.
')
The second task is to transfer all user data to our server.
For the most complete data we use 2 scripts: server (PHP / Python / ASP.NET) and client JS. Thus, we have the opportunity to receive information even about those users who have closed the page, without waiting for the full load and, accordingly, working off the client JS. Such clicks on teaser ads are usually at least 30%, and we have not found any other systems that take them into account. Therefore, we get significantly more data than the same Metric, Analytics and all other statistics systems with JS counters.
The data is sent to us to the server via cross-domain AJAX, and in case the browser does not support it, via the iframe. Sending is made when loading the page, and also on a number of JS events. This allows us to analyze the behavior of users on the site and to distinguish bots from real users by patterns of behavior on the site, as well as imperfect imitation of certain Javascript events. For example, many bots mimic the onClick event, but do not generate onMouseDown, onMouseUp ... events.

Here we smoothly come to the third problem - the choice of iron.
Architecturally at the moment the system consists of 4 segments:

Frontend
Data collection and processing
Indexing landing pages
Store usernames / passwords to third-party services

All domains are tied up with Amazon Route 53 with a ttl of 60 seconds, so that in case of any possible problems with the servers, it will be possible to quickly migrate to the backup ones.
About the frontend nothing special to say. The load on it is small - almost any vps will cope.
With the collection and processing of data, everything is somewhat more complicated, since it is necessary to work with large amounts of data. To date, we process about 200 requests every second.
Due to the correct initial choice of hardware and software, one server does a great job with this volume.
By hardware - 8-core AMD, RAID10 from SAS disks, 16Gb RAM.
Data collection is carried out by tying up a bunch of nginx + php-fpm + mysql, processing - scripts in C + +.
At the beginning, we faced the problem of intensive consumption of CPU resources by the data collection script. The solution was found quite unexpectedly. Replacing all the php ereg_ functions with their preg_ counterparts, we reduced the CPU consumption by about 8 times, which was very surprising.
In case of problems with the current server or the need to scale up, the other server is waiting for its second server of the same configuration with the possibility of commissioning within an hour.
We have a separate server with indexing of landing pages with a dedicated IP block, rather gluttonous on CPU and RAM, however, it is absolutely not demanding on the disk subsystem. Indexing is done by a “search bot” written in Python.
This site is not duplicated in us, however, it takes less than a day to replace or expand it, and it does not affect the quality of traffic analysis directly; in the worst case, several advertising campaigns will not stop if the client’s site goes down or our code disappears from the site.
The storage of logins / passwords to third-party services is a rather specific thing and, in general, is not good from a security point of view.
However, for most ad networks, the API does not provide all the necessary functionality and you have to parse their web interface, which is difficult to do without a password. For example, in Google Adwords, a ban of IP addresses is possible only through a web interface. Bonus, users have the opportunity from the interface of our system with one click to go into their accounts ad networks.
This is the fourth task - ensuring data security when storing them in the clear.
For maximum secure data storage, we created the following scheme:

If the password is obtained by us through the web interface
- We put it in the frontend database, symmetrically encrypting the client's password to our service.
- Also, a password is put in the frontend database, asymmetrically encrypted with the public key on the server
- The repository periodically makes requests to the server database, picking up encrypted ad network passwords, decrypts them with the private key and puts it in its database
If the password is generated by us in the repository
- We put it in the storage base.
- At the next user login, the user's password is put in the frontend database, asymmetrically encrypted with the public key on the server
- The storage periodically makes requests to the server database, taking encrypted passwords, decrypts them with the private key
- Then, the storage symmetrically encrypts passwords from its database with the received user passwords and puts them in encrypted form on the front-end
When a user logs in to our service, his password is stored inside JSa by a specific method and is used to decrypt client-side advertising network passwords and login credentials on client side
Access to the storage is allowed only from a number of IPs to which we have access.
IP storage is kept secret, there are no incoming requests to the storage

Due to the fact that we cannot parse some web interfaces without using full browser emulation, the storage is demanding on RAM and CPU. In another, a contingency center is also waiting for a backup storage server, ready to start work within an hour.

The fifth and final task was integration with ad networks for automatic banning of “bad” IPs and sites.
With conditionally small networks, such as Runner and MarketGuide, there were no problems, all interaction works through the API, if some methods are not enough, the partners promptly add them.
But with Direct and, especially, AdWords has enough problems. Getting the API in AdWords turns into a quest. First, you get it for a month, then it turns out that half of the functions are not there, and you still need to parse the web interface. Then it turns out that even those functions that are there are strictly limited to units that cannot be bought in the Basic API. And the new quest begins with getting the API of the next level of access, with an expanded number of units. As you can see, search giants are doing everything to make it harder for advertisers to optimize advertising costs. However, at the moment, we nevertheless successfully analyze their traffic and clear it automatically.

The bottom line, at the moment, our system has no real competitors, similar in capabilities and, most importantly, comparable in quality detection of poor-quality traffic. In some cases, we see 40-45% more traffic than other analytical systems.
The cost of auditing traffic, on average, is about 100 times less than the cost of purchased advertising, and for individual advertising systems, the service is completely free. In this case, the savings range from 10 to 50% of the advertising budget, and sometimes even up to 90%.
At the moment, in fully automatic mode, the system works with Yandex.Direct, Google AdWords, Begun and Market Guide. With any other advertising systems, the service works in a traffic audit mode with the subsequent manual addition of fraudulent IPs and sites to their blacklist.

Join now!

Source: https://habr.com/ru/post/208484/

All Articles

How we wrote the anti-clipping system

More articles: