The devil is not so bad, or how many domains are actually used.

It took me to collect some kind of web statistics. But the search engines even did not want to give out about the stat, with what requests I fed them. As a result, according to the age-old Russian tradition, he began to write his “search engine”. So, let's begin.

The first thing we need is a list of all registered domains. As a result of lengthy searches, a resource was found offering to download a zone file after registration. Registration took place with the addition of a credit card, and after three days of the trial, they were going to withdraw money from it and even give a ready-made file with a list of domains, but at that time I didn’t want to wait (excitement, you know), so I just downloaded the zone files and wrote a simple parser on python. Something slowly began to fall, perhaps it was the hands ... 106,138,643 registered domains. Suppose that the 1 http response is approximately 100kb. Then I should save about 10 TB in the end. Long reflections and a short dream led me to the idea that thinking globally, it is necessary to act locally. I decided to practice in the zone .us - 100 times smaller than domains (pre-100 GB). Given that the scripts are written in the process, errors and restarting are inevitable.

Secondly, saving resources in general and time in particular. Dns request is cheaper http. I was sure that not all domains have an A record (if among the readers there are just interested ones, then this means that not all domains refer to any server). We are writing a simple scriptwriter that clarifies this question (to be honest, after the restarting of the eleventh restart, the scriptwriter was not so simple, but not the point). Total fed 1 746 769 domain .us zone, the output received 227 051 which do not lead. Already good. Looking through the result I saw a lot of domains with the same ip. For sure! Parking lots! I was playing for 3 hours. As a result, I identified the following IPs (more than 10,000 entries for each):
')

108.179.223.250 89247
184.168.221.96 22095
208.87.35.103 11196
216.21.239.197 13574
97.74.42.79 14107
208.91.197.27 29839
64.202.189.170 144693
68.178.232.100 328476
68.178.232.99 12297

When preparing the list of domains for the final stage, I didn’t touch that:

ip_blacklist = { "74.220.199.6": "domins parking A", "74.220.199.8": "domins parking A", "74.220.199.9": "domins parking A", "74.117.221.143": "parking", "68.178.232.100": "GoDaddy's resale/parking shit (320k in us zone)", "64.202.189.170": "GoDaddy's redirector server, parse this shit later (150k .us)", "108.179.223.250": "zip code shit, don't need at all (90k in .us 00000.us like)", "184.168.221.96": "GoDaddy's parking server (22k .us)", "208.87.35.103": "domain parking B", "216.21.239.197": "domains.com parking", "68.178.232.99": "google's parking?!?!", "208.91.197.23": "parking", "208.91.197.24": "parking", "208.91.197.25": "parking", "208.91.197.26": "parking", "208.91.197.27": "parking", "97.74.42.79": "GoDaddys' site builder or somethink like, let's parse it later", "204.13.160.107": "parking", "64.95.64.218": "probably parking, decide later what to do", "64.95.64.194": "probably parking, decide later what to do (dead serv)", "213.186.33.5": "probably parking, decide later what to do", "216.21.239.197": "parking" }

108.179.223.250 is a funny server. Almost all domains of the Yusa-zip-code.us form on it, I don’t need it in statistics, that's why I filtered it. GoDaddy with strange services, like the site builder and redirector went there too. Eventually:

At the entrance - 1,746,769 domains
Without “A” records - 227 051
Rejected (ip) - 702 459
Balance - 817 262

Opensource projects, if in the .com zone the picture is plus or minus, then all that has fallen will definitely rise. I do a test run on 800k queries, and ... God, how many problems ... urllib2 is a bad solution, sqlite3 is an even more "bad" solution, in debian limit on the number of open files in about 2k !? I thought that only Windows smelled so bad.

The bottom line: you can safely reduce the amount of work that needs to be done, 2 times. To do this, simply turn on the brain. Preliminary estimates are very approximate. The file with the answers to the requests in the end weighs 5 GB, not 100. True, I turned off the redirects “until I found out”, since 301 to another domain I do not need, but from the index to the folder I need to work out, but I will do this later. And in dogonku I give a link to the archive with the final files. If anything, ask the python workers.

Bonus 1 - 91.222.136.77/tmp/us files

Bonus 2 - What to look for in the body of useless pages

 kw_blacklist = { # originally found on terrapinn.us (212.53.89.138) "The domain DOMAIN is registered by NetNames": "blank domain registration", 'pageTracker._trackPageview("/parked/dns/': "domins parking A", 'src="http://return.uk.domainnamesales.com/return_js.php': "domains parking B", 'googlesyndication.com/apps/domainpark/': "Google's parking?!?!?", 'src="http://dsparking.com/': "Parking, check ip", 'src="http://cdn.rooktemplate.com/rmgdsc/newProcess.js': "Parking, check ip", '<h2>This Domain Is Registered with Network Solutions</h2>': "No site on Network Solutions", '<title>Web Page Under Construction</title>': "No site on Network Solutions", # both means that we have to skip this site }

Bonus 3 - Number of answers of various types (of those 800k)
CODE 0 - 207523 (error during timeout request or go to ...)
CODE 200 - 385543
CODE 202 - 1
CODE 204 - 3
CODE 300 - 5
CODE 301 - 77447
CODE 302 - 114727
CODE 303-305
CODE 307 - 180
CODE 400 - 2498
CODE 401 - 1217
CODE 402 - 20
CODE 403 - 14237
CODE 404 - 10475
CODE 405 - 2
CODE 406 - 21
CODE 407 - 1
CODE 409 - 1
CODE 410 - 55
CODE 411 - 2
CODE 418 - 1
CODE 500 - 1807
CODE 501 - 1
CODE 502 - 160
CODE 503 - 965
CODE 504 - 26
CODE 505 - 4
CODE 508 - 1
CODE 509 - 3
CODE 600 - 1
CODE 999 - 1

An important addition - I watched the server logs and was surprised, I was very surprised that with ip 127.0.0.1 there are requests for 80th port. Checked, yes, 127.0.0.1 are referenced by some domains. As well as on 10. *. *. * And other reserved subnets.

Source: https://habr.com/ru/post/272361/

All Articles

The devil is not so bad, or how many domains are actually used.

More articles: