It so happened that we at seo11.ru know the traffic of approximately 1 million sites. Data is taken from Liveinternet, Mail, Rambler, Openstat and Hotlog ratings. But a huge number of sites do not participate in these rankings and prefer to measure attendance on Google Analytics or Yandex.Metrica. Analytics has no open informers, so it will not work to get data. And there is a Metric!
Plan
1. We collect base of sites of the RuNet.
2. We look for the Metrics code on them.
3. Check if the Metrics informer is open or not.
4. If open, then parse the picture, recognize and write to the database.
Decision
1. First you need to get a list of all sites runet. The first thought is to bypass
all domains in the zones ru, su and rf. However, many Russian-language sites are hosted on international domains. It would be possible to bypass another Top
Alexa , Yandex.Catalog and the Russian section of
Dmoz , but all this will not give a complete base. I would have to write a full-fledged crawler, but soberly assessing my resources, I began to look for alternative options.
After all, it was not my first need to crawl the sites of the runet. It was decided to turn to colleagues from
Keys.so. They have their own crawler and almost 20 million sites analyzed. They bypass sites for collecting keywords and other SEO data.
')
2. So, there is a base of 20 million sites. It remains to find the metric code on them. JS-code counter has several options. If you search on yandexMetrikaId, then many sites will not be determined. For example, yandex.ru itself has a metric, but yandexMetrikaId does not find it. If you search by yaCounter or Ya.Metrik, then many other sites will not be detected, for example dnevnik.ru
The most correct thing is to focus on the sequence “mc.yandex.ru/watch/”, for example, “mc.yandex.ru/watch/17969140”. Accordingly, 17969140 is the site ID. Thus, Keys.so sees the Metric on 3,846,867 domains.
3. Knowing the site ID, you can request a picture of the informer at:
informer.yandex.ru/informer/37616330/3_0_FFFFFFFF_FFFFFFFF_0_pageviewsFrom top to bottom: views, visits, visitors. If the informer is disabled in the Yandex.Metrica settings, then the picture will look like this:
informer.yandex.ru/informer/17969140/3_0_FFFFFFFF_FFFFFFFF_0_pageviewsThis informer does not make sense to request and recognize. It is enough to get the content-length and weed out unnecessary.
4. Of the 3.8 million sites, the informer is open to just over 1 million sites. Parse and recognize will be using NodeJS. For parsing, I use the
request module to create the
async.queue queue. I recognize the pictures using the
okrabyte OCR library.
The first problem: data can only be obtained from the informer in 24 hours. The solution is to download informers at 23:55. Of course, there will be small discrepancies with real data, but this is better than nothing.
The second problem: the informer is reset to 00:00 according to the time zone selected in the counter settings. How to find out what time zone is selected in the settings? No Therefore, you need to parse the informer in advance at intervals every hour and see when it is reset.
That's all. The result of the work is available on
seo11.ru