Once upon a time 1stat.ru stopped working - frankly, for us it was a tragedy (now it seems to work somehow).
The working day usually began with a cup of coffee and a pleasant view of the values of domain growth. Of course, such metrics do not show the success of the company, nor its capitalization, nor the type of customers that are placed with the company. But despite this, you can indirectly assess the dynamics of growth or degradation of the company - if the number of domains on NS servers grows, and does not fall, this is good. As practice has shown, it somehow reflects the state of affairs in the market.
Without thinking twice, they threw a script with a table, where they collected the values for the NS servers for the domain and made some statistics. The relevance of the table was not the best, since we summarized all the domains on all servers and simply divided them by the number of servers, which over time with some providers changed. But this did not prevent us from arranging corporate events when moving upward by rating =). There was nothing interesting in this realization, it was done in one evening and it was too lazy to talk about it.
')
About a year ago,
pavelodintsov sent interesting statistics on the distribution of domains across autonomous systems.
As well as a link to the code that collects these statistics. The collection itself was designed to run various bash, perl and sqlite scripts. At the same time, the algorithm was quite simple and interesting, in principle, from Pavel, as a talented specialist, nothing should have been expected, although he worked rather slowly and collected data only on AS.
- requested from the registrar a list of domains in the zones ru, su and rf
- for each of the domains requested A record
- Receive BGP fullview and do network conformity to AS
- Based on this data, statistics on the distribution of domains to offline addresses was collected.
The algorithm was improved - it was completely rewritten in Python, historical tables were introduced, collection and storage of all records from the DNS was added. We also wrote a simple wrapper for displaying data on
firststat.ru , which, given the availability of time and interest from the outside, we will definitely improve.
Statistics collection
Original project:
https://github.com/pavel-odintsov/ru_open_statistics
My fork rewritten in Python:
https://github.com/AlexeyManikin/domain_statistic
To work, you need the Python modules:
- Mysqldb
- dnspython
- Subnettree
- psutil> = 2.2
A MySQL server is also needed and a fast DNS resolver is highly desirable, as the bulk of the load falls on it. You must request all records (A, AAAA, SOA, TXT, MX, SRV, NS) for each of the 6,000,000+ domains. At first, I also collected information from similar projects to check the correctness of the results.
The structure of the project to describe, I believe, does not make sense, the code is quite simple to understand.
update_as_info.py - updates data on AS
update_domain.py - updates data by domains
update_statistic.py - aggregates statistics in the database
DB structure
Below are tables with fields that can be useful when retrieving data:
as_list : the table of correspondence of the AS number and its description, the country
- id - AS number
- descriptions - description
- country - country
domain : the table that stores the current state of the domain, I think most of the fields are clear from the title
- domain_name - FQDN
- registrant - through whom the domain was registered
- tld - quick search zone
- register_date - registration date
- register_date_end - registration end date
- free_date - domain release date
- delegated - domain delegated or not
- a1 ... a4 - A domain record
- ns1 ... ns4 - NS domain records, exactly those that DNS gives out
- mx1 ... mx4 - MX domain records
- txt - TXT domain record
- asn1 ... asn4 is the AS number for the corresponding A record
- aaaa1 ... aaaa4 - AAAA (ipv6) domain records
- cname - CNAME domain record
domain_history : contains the history of changes in the records in the domain table, in addition to all the fields that exist in the domain, there are also other fields:
- domain_id - link to the domain entry
- date_start - the date since this information was current
- date_end - the date when the information has ceased to be an actulae
The following tables contain aggregated information:
- domain_count_statistic - the number of domains in different zones by day
- as_count_statistic - the number of domains on AS by days
- mx_count_statistic - the number of domains using MX, by day
- ns_count_statistic - the number of domains on NS servers by day
- registrant_count_statistic - daily registrar statistics
- a_count_statistic - statistics on the number of domains on one IP per day
- cname_count_statistic - statistics on CNAME data by day
All statistics can be selected from the domain and domain_history tables, but the queries are quite long. We decided to make separate tables for speeding up.
Server load
As a co-founder of hosting, after much persuasion, I was still assigned a server =) Accordingly, I started collecting statistics on an ongoing basis from January 2016, although I sometimes forgot to turn it on after a reboot.
The whole thing works on a 2 x Intel® Xeon® CPU E5-2620 0 @ 2.00GHz server with 32 Gb RAM.
It takes approximately 11 hours to collect data and aggregate statistics for all domains. Currently, the domain_history table contains 41436250 records, the total size of the database is 23 gigabytes. The number of threads pdns and parallel handlers was chosen experimentally, stopped at processing 150 threads.
CPU, LA, and memory usage graphs are shown below:
Run the statistics collection script
The easiest way to run through
Docker . Install Docker and git
sudo apt-get update sudo apt-get install docker.io git
download the repository
cd /home git clone https://github.com/AlexeyManikin/domain_statistic.git
collect images
cd domain_statistic/docker docker-compose build
we launch containers
docker-compoces up -d
Access to database
If anyone is interested, created a copy of the database, opened access to the database in ReadOnly mode, you can go play (limit 300 connections). If there is a habraeffekt - will be stupid =).
Server: manikin.beget.ru
Port: 3310
User: amanikin_stat
Password: openstatistic
Base: domain_statistic
If someone does not know how to use the console, install
PhpMyAdmin :
pma.amanikin.ru
Interesting SQL, which are useful to us:
Or how the DNS for the domain
habrahabr.ru changed
Sample result mysql> select * from domain_history where domain_id = '1717034' \G *************************** 1. row *************************** id: 1717033 domain_id: 1717034 date_start: 2015-07-31 date_end: 2015-09-18 domain_name: habrahabr.ru registrant: registrator-ru tld: ru register_date: 2006-04-18 00:00:00 register_date_end: 2016-04-18 00:00:00 free_date: 2016-05-19 00:00:00 delegated: Y a1: 178.248.233.33 a2: NULL a3: NULL a4: NULL ns1: ns1.habradns.net. ns2: ns2.habradns.net. ns3: ns3.habradns.net. ns4: NULL mx1: alt1.aspmx.l.google.com. mx2: alt2.aspmx.l.google.com. mx3: aspmx.l.google.com. mx4: aspmx2.googlemail.com. txt: mailru-verification: 78856df53544c6fd v=spf1 include:spf.habramail.net include:aspmx.googlemail.com ~all asn1: 197068 asn2: NULL asn3: NULL asn4: NULL aaaa1: NULL aaaa2: NULL aaaa3: NULL aaaa4: NULL cname: NULL nserrors: *************************** 2. row *************************** id: 15762868 domain_id: 1717034 date_start: 2015-09-18 date_end: 2016-04-12 domain_name: habrahabr.ru registrant: registrator-ru tld: ru register_date: 2006-04-18 00:00:00 register_date_end: 2016-04-18 00:00:00 free_date: 2016-05-19 00:00:00 delegated: Y a1: 178.248.233.33 a2: NULL a3: NULL a4: NULL ns1: ns1.habradns.net. ns2: ns2.habradns.net. ns3: ns3.habradns.net. ns4: NULL mx1: alt1.aspmx.l.google.com. mx2: alt2.aspmx.l.google.com. mx3: aspmx.l.google.com. mx4: aspmx2.googlemail.com. txt: google-site-verification=kdqnhqj_7jlohipxgkwxizyedshff0askqtc2ovh1em mailru-verification: 78856df53544c6fd v=spf1 include:spf.habramail.net include:aspmx.googlemail.com ~all asn1: 197068 asn2: NULL asn3: NULL asn4: NULL aaaa1: NULL aaaa2: NULL aaaa3: NULL aaaa4: NULL cname: NULL nserrors: *************************** 3. row *************************** id: 35689334 domain_id: 1717034 date_start: 2016-04-12 date_end: 2099-01-01 domain_name: habrahabr.ru registrant: registrator-ru tld: ru register_date: 2006-04-18 00:00:00 register_date_end: 2017-04-18 00:00:00 free_date: 2017-05-19 00:00:00 delegated: Y a1: 178.248.233.33 a2: NULL a3: NULL a4: NULL ns1: ns1.habradns.net. ns2: ns2.habradns.net. ns3: ns3.habradns.net. ns4: NULL mx1: alt1.aspmx.l.google.com. mx2: alt2.aspmx.l.google.com. mx3: aspmx.l.google.com. mx4: aspmx2.googlemail.com. txt: google-site-verification=kdqnhqj_7jlohipxgkwxizyedshff0askqtc2ovh1em mailru-verification: 78856df53544c6fd v=spf1 include:spf.habramail.net include:aspmx.googlemail.com ~all asn1: 197068 asn2: NULL asn3: NULL asn4: NULL aaaa1: NULL aaaa2: NULL aaaa3: NULL aaaa4: NULL cname: NULL nserrors: 3 rows in set (0.04 sec)
The full SQL dump in the form of an archive can be seen from the
link (size 2.2G)
Similar projects
http://1stat.ru/
http://statonline.ru/
http://stat.nic.ru/
https://myip.ms/view/best_hosting/RUS
other…
Someone uses the statistics of visits, someone does not understand how they think, someone has a lot of advertising. In any case, I did not find the source of any service.
Links
Site -
http://firststat.ru
Repository -
https://github.com/AlexeyManikin/domain_statistic
Creators:
pavelodintsov - creator of the idea and prototype
redfenix - implementation in Python, data collection and storage, site prototype
ilin - site refinement
We will be glad to suggestions for improvement. If anyone needs access to the database for your applications - write, open.