It took me to collect some kind of web statistics. But the search engines even did not want to give out about the stat, with what requests I fed them. As a result, according to the age-old Russian tradition, he began to write his “search engine”. So, let's begin.
The first thing we need is a list of all registered domains.
As a result of lengthy searches, a resource was found offering to download a zone file after registration. Registration took place with the addition of a credit card, and after three days of the trial, they were going to withdraw money from it and even give a ready-made file with a list of domains, but at that time I didn’t want to wait (excitement, you know), so I just downloaded the zone files and wrote a simple parser on python. Something slowly began to fall, perhaps it was the hands ... 106,138,643 registered domains. Suppose that the 1 http response is approximately 100kb. Then I should save about 10 TB in the end. Long reflections and a short dream led me to the idea that thinking globally, it is necessary to act locally. I decided to practice in the zone .us - 100 times smaller than domains (pre-100 GB). Given that the scripts are written in the process, errors and restarting are inevitable.
Secondly, saving resources in general and time in particular. Dns request is cheaper http. I was sure that not all domains have an A record (if among the readers there are just interested ones, then this means that not all domains refer to any server). We are writing a simple scriptwriter that clarifies this question (to be honest, after the restarting of the eleventh restart, the scriptwriter was not so simple, but not the point). Total fed 1 746 769 domain .us zone, the output received 227 051 which do not lead. Already good. Looking through the result I saw a lot of domains with the same ip. For sure! Parking lots! I was playing for 3 hours. As a result, I identified the following IPs (more than 10,000 entries for each):
')
- 108.179.223.250 89247
- 184.168.221.96 22095
- 208.87.35.103 11196
- 216.21.239.197 13574
- 97.74.42.79 14107
- 208.91.197.27 29839
- 64.202.189.170 144693
- 68.178.232.100 328476
- 68.178.232.99 12297
When preparing the list of domains for the final stage, I didn’t touch that:
ip_blacklist = { "74.220.199.6": "domins parking A", "74.220.199.8": "domins parking A", "74.220.199.9": "domins parking A", "74.117.221.143": "parking", "68.178.232.100": "GoDaddy's resale/parking shit (320k in us zone)", "64.202.189.170": "GoDaddy's redirector server, parse this shit later (150k .us)", "108.179.223.250": "zip code shit, don't need at all (90k in .us 00000.us like)", "184.168.221.96": "GoDaddy's parking server (22k .us)", "208.87.35.103": "domain parking B", "216.21.239.197": "domains.com parking", "68.178.232.99": "google's parking?!?!", "208.91.197.23": "parking", "208.91.197.24": "parking", "208.91.197.25": "parking", "208.91.197.26": "parking", "208.91.197.27": "parking", "97.74.42.79": "GoDaddys' site builder or somethink like, let's parse it later", "204.13.160.107": "parking", "64.95.64.218": "probably parking, decide later what to do", "64.95.64.194": "probably parking, decide later what to do (dead serv)", "213.186.33.5": "probably parking, decide later what to do", "216.21.239.197": "parking" }
108.179.223.250 is a funny server. Almost all domains of the Yusa-zip-code.us form on it, I don’t need it in statistics, that's why I filtered it. GoDaddy with strange services, like the site builder and redirector went there too. Eventually:
At the entrance - 1,746,769 domains
Without “A” records - 227 051
Rejected (ip) - 702 459
Balance - 817 262
Opensource projects, if in the .com zone the picture is plus or minus, then all that has fallen will definitely rise. I do a test run on 800k queries, and ... God, how many problems ... urllib2 is a bad solution, sqlite3 is an even more "bad" solution, in debian limit on the number of open files in about 2k !? I thought that only Windows smelled so bad.
The bottom line: you can safely reduce the amount of work that needs to be done, 2 times. To do this, simply turn on the brain. Preliminary estimates are very approximate. The file with the answers to the requests in the end weighs 5 GB, not 100. True, I turned off the redirects “until I found out”, since 301 to another domain I do not need, but from the index to the folder I need to work out, but I will do this later. And in dogonku I give a link to the archive with the final files. If anything, ask the python workers.
Bonus 1 - 91.222.136.77/tmp/us filesBonus 2 - What to look for in the body of useless pages kw_blacklist = {
Bonus 3 - Number of answers of various types (of those 800k)CODE 0 - 207523 (error during timeout request or go to ...)
CODE 200 - 385543
CODE 202 - 1
CODE 204 - 3
CODE 300 - 5
CODE 301 - 77447
CODE 302 - 114727
CODE 303-305
CODE 307 - 180
CODE 400 - 2498
CODE 401 - 1217
CODE 402 - 20
CODE 403 - 14237
CODE 404 - 10475
CODE 405 - 2
CODE 406 - 21
CODE 407 - 1
CODE 409 - 1
CODE 410 - 55
CODE 411 - 2
CODE 418 - 1
CODE 500 - 1807
CODE 501 - 1
CODE 502 - 160
CODE 503 - 965
CODE 504 - 26
CODE 505 - 4
CODE 508 - 1
CODE 509 - 3
CODE 600 - 1
CODE 999 - 1
An important addition - I watched the server logs and was surprised, I was very surprised that with ip 127.0.0.1 there are requests for 80th port. Checked, yes, 127.0.0.1 are referenced by some domains. As well as on 10. *. *. * And other reserved subnets.