📜 ⬆️ ⬇️

Internet mapping

Sitting on the couch and once again coming up with a crazy idea, from the category of what is global, and I haven’t done it at the hobby level yet, this idea all the same came to me :).

Having estimated that the person has the information, and the opportunity to work with the audience, I wondered why there are so few search engines on the Internet. Well, Google, Yandex, Rambler, well, and there is something else that is easy to count on fingers. But they accumulate almost overwhelming majority of Internet users. A large number of users pass through them and it depends to a certain extent on where to send the user. And firms, to a certain extent, are being promoted not by clever ways of influencing the bots of the same Google.

Is there a result? Does anyone have any idea how many Russian-speaking resources exist? And is it possible to see their ranked lists both in terms of frequency of use and thematic division. They are trying to talk about the semantic Internet, but there seems to be no such elementary order in structuring. Having said to myself “who if not us,” I went to implement this idea and approaches to its solution. But the main thing is to understand the main difficulty, which, like many places, simply rests on resources, in this case, CPU time. Those who are interested in finding a novice in the designated area, but with a fresh look, please under the cat.
')


IP as a basis for identification



Well, what is so difficult I said to myself, you just need to get a list of all sites, and then otranzhirovat them, but at least for the same PageRang ʻu from Google. Well, I went and composed a tricky program on C # pinging 80 ports over IP, and in case of success, I got the name of the domain and its country (using the GeoIPService web service). I started my uncomplicated bot of dialing and after an hour or two to see how many sites I had assembled, I discovered almost 1 unique ... And that was all. I decided to calculate how IP variations can be known to be 256 * 256 * 256 * 256 = about 4 billion. Well, I did not think so much, but then I looked at how long it took one ping, it turned out something about 0.1 seconds. By the same timeout, I limited the receipt of the answer, since by default it is substantially larger. Now it was possible to finish the counting 4971 days. Well, 14 years on this, I do not, I said to myself. At this point, I could marvel at the miracle of Google’s technology, understand that they are doing a lot of work and not alone to compete with them. But perseverance got the better :)

Country Restriction



Well, I said to myself, Russian-language sites are enough for me, the rest can not be analyzed. But how to understand the compliance of IP and country? Does the Internet have at least some kind of structure? These questions I had to attend to. because The brute force method did not suit me.

After reading what people are doing for this, I found that there is a whole frame for this simple task - with a beautiful name geo-targeting, and was amazed at the articles on Habré: Defining a city by IP address , GeoIP Base - countries and cities and a whole bunch of similar ones. In general, even paid bases, etc., a whole branch of business :)

But most importantly, it was some sort of secondary information and there was no need or desire to use it. Therefore, I wanted to understand where the wind is blowing from - where is the primary information? Where does information come from in such geolocation bases?

Regions of the network


I had to take care to understand who controls the Internet, reading popular information on Wikipedia, you can understand that since the creation of the network until his death in 1998, John Postel was in charge of address allocation in accordance with the agreement of the US Department of Defense. Now, a certain non-profit organization Internet Assigned Numbers Authority (IANA) controls the issue of IP addresses, which after Postel’s death was joined to ICANN, founded by the US government, which in turn received a contract from the US Department of Commerce. In general, the scheme is confusing, and there is still disassembly at the UN, so that the United States would give control of the Internet to the UN, which naturally refused to do. But all of this we are solely interested in only in terms of whether there is any order in the distribution of IP addresses and which subnets can not be scanned by search engines.

And here is the most important document from the IANA organization with the hope of order: IANA IPv4 Address Space Registry .

It lists who is responsible (read "who owns") regions of the network. From this, it is more convenient to introduce a more convenient terminology: the regions of the network are the IP addresses determined by the first number of the IP address, and the sector by the first two numbers.

From the document above it follows that the regions: 0, 10, 127 are reserved by the IANA organization for themselves, and from 224 until the end of 255 is reserved for the so-called. Multicast and future use. Further, a substantial part belongs to large telecommunications and information companies of the USA, the military of the USA and Great Britain - I counted 35 regions.

A total of 70 regions out of 256 are inaccessible to mere mortals, and there is no need to scan them. The rest are divided into 5 regional zones: America North, America South, Africa, Indonesia with China, and Europe with Asia. Already other regional offices are engaged in their distribution, and we are interested in the European RIPE NCC . Actually whois services are provided by them and distributed to these regional organizations.

35 organizations of IP addresses are assigned to the management of the European organization for distribution conditionally call them providers (although licensed according to the rules of the countries) and +4 regions as I understand it with special administration.

Only at this level it is possible to say with certainty which territorial region the IP addresses belong to. Further depends on the kind of public information provided by regional organizations. But instead of 14 years, it’s better to scan 256 regions of the Internet, we will need to scan 39 regions of Europe / Asia with just a little more than 2 years of operation of a single processor.

You can go below. Unfortunately, there is no further order (and there are exceptions at the regional level). Sectors may belong to different countries. But you can download the actual Ripe whois database and find information about countries in it. Cities are sometimes there, but they are hardly suitable for machining, since as I understand, this is what negligent network administrators enter when they receive IP addresses for the subnet, and there are often confused fields (for example, instead of city addresses) or not specified at all. But the country code is stable.

Having processed their text file from 3 gigs, and selecting from there the ones belonging to Latvia and Russia, I singled out 2004 sectors for Russia and 307 for Latvia (selectivity for Latvia is determined by the motherland :)). And the folding mathematics is different here 2004 + 307 = 2063 unique sectors. Those. As mentioned earlier, the sectors naturally overlap, and there may be other European countries, but we got a rough estimate.

Namely, from 109 minutes to ping one sector. And for 2063 sectors, it is about 156 days (the minimum estimate, since it is necessary to spend some time to search for a domain in the event of an open 80 port success).

This is already lifting, even for me, if I cut in all of my 8 cores - in a month I will receive the Internet card of Russia.

But what exactly is this kind of Internet mapping for?


Well, in fantasy, I do not want to limit you. And at first there were outlined ideas for what it might be necessary. I want to emphasize that this time is required only on the first scan to identify all computers with an open 80 port (that is, potential candidates for the provision of web services), among which very few real domains (DNS names).

But we really get all the domains for 2015, which can later be analyzed and you never know what else. Those who want to help me and see this as a necessity, or just don’t feel sorry for the processor time - write me and I will give you a ready-made program that will perform the analysis of the sector, you will drop the result and create a public database - an Internet map .

And maybe this is how a new Google is born :)

PS In the case of a positive response, create the appropriate resource for this.

Source: https://habr.com/ru/post/247291/


All Articles