URL history: domain, protocol and port

On January 11, 1982 , twenty-two computer scientists met to discuss “computer mail” (now known as “e-mail”). Participants included the future founder of Sun Microsystems , the guy who made Zork , the guy who created NTP , and another who convinced the government to pay for Unix . Their task was to solve the problem: there were 455 hosts in the ARPANET, and the situation was getting out of control.

The problem arose from the fact that ARPANET was switching from the original NCP protocol to the TCP / IP protocol on which the Internet now exists. After such a transition, a lot of interconnected networks (inter ... net) should quickly appear that require a hierarchical domain system so that ARPANET can resolve its domains and other networks its own.

At that time there were other networks: “COMSAT”, “CHAOSNET”, “UCLNET” and “INTELPOSTNET”. They were served by groups of universities and companies in the United States who wanted to exchange information and who could afford to rent 56,000 connections from a telephone company and buy PDP-11 for routing.

In the original ARPANET architecture, the main Network Information Center (NIC) was responsible for a special file containing a list of all hosts on the network. It was called HOSTS.TXT , similar to the /etc/hosts on modern Linux or macOS. Any change in the network required the NIC to connect to each network node via FTP (protocol created in 1971 ), which greatly increased the load on the infrastructure.

Storing all Internet hosts in one file, of course, could not be considered a scalable method. But the priority was email. She was the main task of addressing. As a result, they decided to create a hierarchical system in which they could request information about a domain or set of domains from a remote system. In other words: "The solution was to extend the existing mail identifier of the form 'user @ host' to 'user@host.domain', where 'domain' could be a hierarchy of domains." So the domain was born.

It is important to note that such architectural decisions were made without any assumptions about how the domains will be used in the future. They were due only to the simplicity of implementation: this required minimal changes to existing systems. One suggestion was to generate the email address as <user>.<host>@<domain> . If there were no points in the mail names of those days, today you would probably write to me at zack.eager@io

UUCP and Bang Path

It is believed that the main function of the operating system is to define several different names for the same object, so that the system is constantly busy tracking all the relationships between different names. Network protocols seem to behave the same way.

- David D. Clark, 1982

Another declined proposal was to separate the domain with an exclamation mark. For example, to connect to an ISIA host on an ARPANET network, you would write! ARPA! ISIA. One could use the templates:! ARPA! * - all hosts of the ARPANET network.

This method of addressing is not that insanely different from the standard, on the contrary, it was an attempt to support the standard. A separation system using an exclamation mark was used in the UUCP data transfer tool, created in 1976. If you are reading this article in macOS or Linux, then uucp is most likely still installed and available in the terminal.

ARPANET appeared in 1969 and quickly became a powerful information sharing tool ... among several universities and government organizations that had access to it. The Internet in a familiar form became available to the public in 1991, twenty-one years later. But this does not mean that computer users have not exchanged information.

In the pre-Internet era, a method was used to directly connect one machine to another via the telephone network. For example, if you wanted to send me a file, your modem would call my modem, and the file would be transferred. To make this a kind of network, UUCP was invented.

In this system, each computer had a file with a list of hosts that he knows about, their phone numbers and the login and password for the host. Further, it was required to make a route from the current machine to the target through a set of other hosts:

sw-hosts! digital-lobby! zack

This address was used not only to transfer files or directly connect to a computer, but also as an email address. In the era before the mail servers, you would not be able to send me an email if my computer was turned off.

While ARPANET was only available to top universities, UUCP allowed the “underground” Internet to appear for ordinary people. It was the basis for both Usenet and the BBS system.

DNS

The DNS system that we still use was proposed in 1983. If you make a DNS query using the dig utility, for example, the answer will be something like this:

;; ANSWER SECTION:
google.com. 299 IN A 172.217.4.206

This means that google.com is available at 172.217.4.206 . As you probably know, A means that it’s an address entry that links a domain to an IPv4 address. The number 299 is the time to live (TTL), how many more seconds this record is considered valid until you need to make a new request. But what is IN ?

IN is "Internet". Along with other details, this field came from the past, in which there were several competing networks, and they needed to communicate with each other. Other potential values for this field are CH for CHAOSNET, HS for Hesiod (the name of the Athena service). CHAOSNET has long been closed, but a heavily modified version of Athena is still used by MIT students. A list of all DNS classes is available on the IANA website, but it is not surprising that only one of the potential options is commonly used.

Tld

The likelihood that some other TLD will be created is extremely low.

- John Postel, 1994

When it was decided that domain names should be arranged hierarchically, it remains to determine the root of the hierarchy. Such a root is denoted by a dot . . To end all domain names with a dot is semantically correct, and such addresses will work in a browser: google.com.

The first TLD was .arpa . Users could use their old ARPANET hostnames during the transition period. For example, if my car was registered as hfnet in the past, then my new address would be hfnet.arpa . This was a temporary fix for the system change period. Server administrators had to make an important decision: which of the five TLDs should they choose? “.Com”, “.gov”, “.org”, “.edu” or “.mil”?

When they say that DNS has a hierarchy, they mean that there is a set of root DNS servers that are responsible, for example, for converting .com to DNS server addresses in the .com zone, which in turn will answer the question “how to get to google.com . The DNS root zone consists of thirteen clusters of DNS servers. There are only 13 clusters , because it no longer fits into a single UDP packet. Historically, the DNS works through UDP, so the response to the request can not be longer than 512 bytes.

Hidden text

 ; This file holds the information on root name servers needed to ; initialize cache of Internet domain name servers ; (eg reference this file in the "cache . " ; configuration file of BIND domain name servers). ; ; This file is made available by InterNIC ; under anonymous FTP as ; file /domain/named.cache ; on server FTP.INTERNIC.NET ; -OR- RS.INTERNIC.NET ; ; last update: March 23, 2016 ; related version of root zone: 2016032301 ; ; formerly NS.INTERNIC.NET ; . 3600000 NS A.ROOT-SERVERS.NET. A.ROOT-SERVERS.NET. 3600000 A 198.41.0.4 A.ROOT-SERVERS.NET. 3600000 AAAA 2001:503:ba3e::2:30 ; ; FORMERLY NS1.ISI.EDU ; . 3600000 NS B.ROOT-SERVERS.NET. B.ROOT-SERVERS.NET. 3600000 A 192.228.79.201 B.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:84::b ; ; FORMERLY C.PSI.NET ; . 3600000 NS C.ROOT-SERVERS.NET. C.ROOT-SERVERS.NET. 3600000 A 192.33.4.12 C.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:2::c ; ; FORMERLY TERP.UMD.EDU ; . 3600000 NS D.ROOT-SERVERS.NET. D.ROOT-SERVERS.NET. 3600000 A 199.7.91.13 D.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:2d::d ; ; FORMERLY NS.NASA.GOV ; . 3600000 NS E.ROOT-SERVERS.NET. E.ROOT-SERVERS.NET. 3600000 A 192.203.230.10 ; ; FORMERLY NS.ISC.ORG ; . 3600000 NS F.ROOT-SERVERS.NET. F.ROOT-SERVERS.NET. 3600000 A 192.5.5.241 F.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:2f::f ; ; FORMERLY NS.NIC.DDN.MIL ; . 3600000 NS G.ROOT-SERVERS.NET. G.ROOT-SERVERS.NET. 3600000 A 192.112.36.4 ; ; FORMERLY AOS.ARL.ARMY.MIL ; . 3600000 NS H.ROOT-SERVERS.NET. H.ROOT-SERVERS.NET. 3600000 A 198.97.190.53 H.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:1::53 ; ; FORMERLY NIC.NORDU.NET ; . 3600000 NS I.ROOT-SERVERS.NET. I.ROOT-SERVERS.NET. 3600000 A 192.36.148.17 I.ROOT-SERVERS.NET. 3600000 AAAA 2001:7fe::53 ; ; OPERATED BY VERISIGN, INC. ; . 3600000 NS J.ROOT-SERVERS.NET. J.ROOT-SERVERS.NET. 3600000 A 192.58.128.30 J.ROOT-SERVERS.NET. 3600000 AAAA 2001:503:c27::2:30 ; ; OPERATED BY RIPE NCC ; . 3600000 NS K.ROOT-SERVERS.NET. K.ROOT-SERVERS.NET. 3600000 A 193.0.14.129 K.ROOT-SERVERS.NET. 3600000 AAAA 2001:7fd::1 ; ; OPERATED BY ICANN ; . 3600000 NS L.ROOT-SERVERS.NET. L.ROOT-SERVERS.NET. 3600000 A 199.7.83.42 L.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:9f::42 ; ; OPERATED BY WIDE ; . 3600000 NS M.ROOT-SERVERS.NET. M.ROOT-SERVERS.NET. 3600000 A 202.12.27.33 M.ROOT-SERVERS.NET. 3600000 AAAA 2001:dc3::35 ; End of file

Root DNS servers are located in safes, inside closed cells. There is a clock on the safe to check if the video surveillance has been hacked. Considering how a DNSSEC implementation is slow, hacking one of these servers will allow a hacker to redirect all the traffic of some Internet users. This is the scenario of the coolest movie that has never been shot.

Not surprisingly, the top TLD DNS servers rarely change. 98% of requests to root DNS servers are requests by mistake, usually due to the fact that clients do not cache their results as they should. This became such a problem that several root server operators added special servers only to answer “go away!” To all people who make reverse DNS queries on their local IP addresses.

TLD servers are administered by different companies and governments around the world ( Verisign is responsible for .com ). When you buy a domain in the .com zone, approximately 18 cents go to ICANN, and $ 7, 85 cents go to Verisign.

Punycode

It rarely happens that the funny name that the developers came up with for their project becomes the final, public name of the product. You can call the database of companies Delaware (because all companies are registered in this state), but you can be sure that when it comes to production, it will be called Company MetadataDatastore. But occasionally, when all the stars converge and the boss is on vacation, one name sneaks through the gap.

Punycode is a system for encoding unicode to domain names. The problem that she solves is simple: how to write .com, when all the Internet systems are built around the ASCII alphabet, in which the most exotic character is a tilde?

You can't just switch domain names to unicode . The source documents , which are responsible for the domains, oblige to encode them in ASCII. Every device on the Internet over the past forty years, including the Cisco and Juniper routers that were used to deliver this page, works with this condition in mind.

The web itself has never been limited to ASCII . Initially, the standard was ISO 8859-1 , which included all the characters from ASCII, plus a set of special characters like ¼ and letters with accents like ä. But this standard included only Latin characters.

This HTML restriction was finally excluded in 2007 , and in the same year Unicode became the most popular standard for character encoding on the web. But domains still worked through ASCII.

As you might guess, Punycode was not the first attempt to solve this problem. You most likely heard about UTF-8, a popular way of encoding Unicode into bytes (the number 8 in the name "UTF-8" means 8 bits in one byte). In 2000, several members of the Internet Engineering Task Force invented UTF-5. The idea was to encode Unicode five-bit sets. Then every five bits were associated with the permitted symbol (AV and 0-9) in the domain names. For example, if I had a website about learning Japanese, then the address 日本語 .com would have turned into a mysterious M substrateM72COA9E.com.

This method has several drawbacks. First, the characters AV and 0-9 are used in the encoded output. That is, if you need to use one of these characters in the domain name itself, they would have to be encoded like other characters. Very long names are obtained, and this is a serious problem: each domain segment is limited to 63 characters. A domain in Burmese would be limited to 16 characters. But this whole undertaking has interesting consequences, for example, in this way Unicode can be transmitted via Morse code or by telegram.

A separate question was how customers will understand that the domain was encoded in this way, and you can show the user the real name instead of M gradeM72COA9E.com in the address bar. There were several suggestions , for example, to use the unused bit in the DNS response. But it was “the last unused bit in the header,” and the people in charge of the DNS “really didn’t want to give it away.”

Another idea was to add ra-- to the front of every domain name that uses this encoding method. At that time (mid-April 2000) there was not a single domain that began with ra-- . I know the Internet, so I am sure: someone immediately bought a domain with ra— spite immediately after the publication of that proposal.

The final decision was made in 2003 - use the Punycode format. It included delta encoding, which helped significantly shorten the coded domain names. Delta encoding is a particularly good idea in the case of domains, because most likely all the characters of the domain name are in approximately one area of the Unicode table. For example, two Farsi characters will be closer to each other than a Farsi character and a Hindi character. To understand how this works, let's take a look at an example with such a meaningless phrase:

يذؽ

In unencrypted form, these are the three characters [1610, 1584, 1597] (based on their position in Unicode). To code them, let's sort them first (remembering the original order): [1584, 1597, 1610] . Now you can save the smallest value ( 1584 ), and the distance (delta) to the next character ( 13 ), and to the next ( 23 ), so you need to transfer and store less information.

Punycode then efficiently (very efficiently!) Encodes these integers into characters allowed in domain names, and adds xn— to the beginning of the line to inform clients about the name coding. You may notice that all Unicode characters are collected at the end of the domain. They store not only the meanings of the characters, but also their position in the name. For example, the address 熱狗 sales.com will turn into xn--sales-r65lm0e.com . When you enter Unicode characters in your browser’s address bar, they are always encoded in this way.

This transformation could be transparent, but then a serious security issue may arise. There are Unicode characters that are displayed identically to existing ASCII characters. For example, you most likely will not notice the difference between the Cyrillic letter “a” and the Latin “a”). If I register Cyrillic amazon.com (xn--mazon-3ve.com), then you may not understand that you are on another site. For this reason, the .ws site looks boring in your browser: xn--vi8hiv.ws .

Protocol

The first part of the URL is the protocol by which you want to connect. The most common protocol is http . This is a simple document transfer protocol that Tim Berners-Lee designed specifically for the web. This was not the only option. Some thought it was necessary to use Gopher. Gopher was designed specifically for sending structured data, similar to the file tree structure.

For example, when requesting /Cars you can get this answer:

 1Chevy Camaro /Archives/cars/cc gopher.cars.com 70 iThe Camero is a classic fake (NULL) 0 iAmerican Muscle car fake (NULL) 0 1Ferrari 451 /Factbook/ferrari/451 gopher.ferrari.net 70

It represents two cars, additional meta-information about them and an indication of the address where you can get more information. The idea was that the client would process this information and bring it into a convenient form, where the entries are related to the end pages.

The first popular protocol was FTP. It was created in 1971 to get lists and download files on remote machines. Gopher was a logical continuation of this idea, as it offered a similar listing, but also included mechanisms for obtaining meta-information about records. This means that it could also be used for other tasks, such as a news feed or a simple database. However, he lacked the freedom and simplicity that characterize HTTP and HTML.

HTTP is a very simple protocol, especially when compared to alternatives like FTP or even HTTP / 2 , which is growing in popularity today. First, HTTP is completely text-based; it does not use magic binary elements (which could significantly improve performance). Tim Berners-Lee correctly decided that the text format will allow generations of programmers to more easily develop and debug applications using HTTP.

HTTP also makes no assumptions about the content. Although it was designed specifically for transmitting HTML, it allows you to specify the type of content (using the MIME Content-Type , which was a new invention at one time). The protocol itself is pretty simple.

Request:

 GET /index.html HTTP/1.1 Host: www.example.com

Possible answer:

 HTTP/1.1 200 OK Date: Mon, 23 May 2005 22:38:34 GMT Content-Type: text/html; charset=UTF-8 Content-Encoding: UTF-8 Content-Length: 138 Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux) ETag: "3f80f-1b6-3e1cb03b" Accept-Ranges: bytes Connection: close <html> <head> <title>An Example Page</title> </head> <body> Hello World, this is a very simple HTML document. </body> </html>

To grasp the context, remember that the network is based on IP, Internet Protocol. IP is responsible for transferring a small data packet (about 1500 bytes) from one computer to another. On top of this is TCP, which is responsible for transferring larger blocks of data like entire documents or files. TCP provides guaranteed delivery using multiple IP packets. Over it lives a protocol like HTTP or FTP, which indicates which data format to use for sending via TCP (or UDP or another protocol) to transfer meaningful and understandable data.

In other words, TCP / IP sends a bunch of bytes to another computer, and the HTTP level protocol explains what these bytes are and what they mean.

You can make your protocol if you want, collecting bytes from TCP messages as you like. The only requirement is that the recipient speak the same language. Therefore, it is customary to standardize these protocols.

Of course, there are less important protocols. For example, there is a protocol for the quote of the Day (port 17), and random random characters (port 19). They may seem funny today, but they help to understand the importance of a universal protocol for transmitting documents, which was HTTP.

Port

The history of Gopher and HTTP can be traced by their port numbers. Gopher is 70, HTTP 80. The port for HTTP was set (most likely by John Postel from IANA) at the request of Tim Berners-Lee between 1990 and 1992 .

The concept of registering "port numbers" appeared before the Internet. In the original NCP protocol, on which the ARPANET network was running, the remote addresses were identified using 40 bits. The first 32 pointed to the remote host, this is similar to how IP works today. The last 8 bits were also called AEN (“Another Eight-bit Number” or “Another eight-bit number”), and were used for port-like purposes: to separate messages that have different purposes. In other words, the address pointed to the machine where the message was to be delivered, and AEN (or the port number) pointed to the application to which the message was to be delivered.

, « » . 16 TCP/IP, .

, , . www. . , - «» -. ( dx3.cern.ch ), . ( www.cern.ch ), , .

,

, , , URL ( // ) :

 http://eager.io

Apollo , . Apollo , -: , . :

 //computername/file/path/as/usual

. , . , ( example.com ) :

 http:com/example/foo/bar/baz

All the rest

URL, - . URL, , . , , .

, . . URL, -, GET-, 15 URL.

Source: https://habr.com/ru/post/305484/

All Articles