My URL is not your URL

When I started working on httpget, the predecessor of the Curl project, long ago in 1996, I wrote my first URL parser. Just then, this universal address was named URL : Uniform Resource Locator (single resource index). Its specification was published by the IETF in 1994. The URL abbreviation was then used as an inspiration for the tool name and the Curl project.

The term “URL” was later changed; it came to be called the URI (Uniform Resource Identifier - Uniform Resource Identifier), according to the specification published in 2005, but the main preserved: the syntax for the string that specifies the online resource and specifies the protocol for obtaining this resource. We require curl URLs, as defined by this RFC 3986 specification. Below, I’ll tell you why this is not quite true.

There was also a related RFC describing IRI : Internationalized Resource Identifier (international resource identifier). IRI is essentially the same as a URI, but IRI allows non-ASCII characters to be used.
')
The WHATWG consortium later created its own URL specification , mainly bringing together formats and ideas from URI and IRI with a strong focus on browsers (which is not surprising). One of the goals they declared was “Upgrade RFC 3986 and RFC 3987 in accordance with current implementations and gradually eliminate them.” They want to go back to using the term “URL”, rightly stating that the terms URI and IRI simply confuse the situation and that people never understand them (or often don’t even know that these terms exist).

The WHATWG specification is written in the spirit of the good old browser mantra: be as liberal as possible with users, always try to guess what they mean, and turn inside out trying to do it. Although at the same time we all know now that the law of Postel is not the best approach to business. In practice, this means that the specification allows you to use too many slashes, spaces and non-ASCII characters in the URL.

From my point of view, such a specification is also very difficult to read and comply with, since it does not describe the syntax or format in great detail, but at the same time imposes a mandatory parsing algorithm. To verify my statement: see what this specification says about the end point after the host name in the URL.

In addition to all these standards and specifications, in the interface of all browsers there is an address bar (which is often called differently), which allows users to enter any funny lines and convert them to URLs. If you enter " http://localhost/%41 " in the address bar, then the segment with the percentage will be converted to "A" (since 41 in hexadecimal is the capital letter A in ASCII), but if you enter " http://localhost/AA ", then in fact the outgoing HTTP GET request will be sent to" /A%20A "(with a space in the URL encoding). I say this because people often think that all that can be entered on this line is the URL.

The above is mainly my (distorted) representation, with which specifications and standards we have to work with. Now let's add reality and see what problems we get when my URL is not your URL.

So what is a URL?

Or more specifically, how do we write them? What syntax is used?

I think one of the biggest mistakes in the WHATWG specification (and the reason why I oppose this specification in its current form with the firm conviction that they are wrong) is that they believe that only they are allowed to work with the URL and to define them; they limit their view of the URL only to browsers, HTML, and address strings. Of course, WHATWG was created by large companies representing browsers that almost everyone uses, and URLs are widely used in these browsers, but the URLs themselves are much more significant.

The representation of the URL that exists in the WHATWG is not widely accepted outside of browsers.

Colon-slash slash

If you ask users — ordinary people without any particular knowledge of the protocols or the network — what is a URL, what will they answer? The sequence ": //" (colon-slash-slash) would be at the beginning of the list of responses; a few years ago, when browsers showed the URL more fully, it would be even more noticeable. After seeing this sequence, we immediately understand that we are the URL.

However, let's move away from users and look around - there are mail clients in the world, terminal emulators, text editors, Perl scripts, and many other things that can recognize URLs and work with them. For example, open a URL in a browser, turn it into an active link in the generated HTML, and so on. A huge number of named scripts and programs will use exactly the sequence "colon-slash-slash" as the main feature.

The WHATWG specification says that there must be at least one slash and that the parser must accept any number of slashes. This means that " http:/example.com " and " http:///////////////example.com " are completely suitable options. RFC 3986 and many others disagree. Well, indeed, most of the people I have been arguing with over the past few days, even those who work on the web, speak, think and are convinced that the URL has two slashes. Just take a closer look at the screenshot of the Google image search result for the “URL” above in this article.

We just know that a URL has two slashes (although, yes, file: type URLs usually have three slashes, but for now let's ignore this). Not alone. Not three. Two. But the WHATWG disagrees.

“Is there at least one real reason for accepting more than two slashes for non-file URLs?” (I ask irritably at the WHATWG members)

" The fact is, all browsers do this. "

The specification says this because browsers have implemented it that way.

No better explanation was given even after I indicated that this statement is incorrect and not all browsers do this. Perhaps this thread of discussion will seem very informative.

In the Curl project, we just recently started discussing how to handle URLs that have a number of slashes other than two, because it turns out that there are already servers that send back such URLs in the “Location:” header , and some browsers accept without objection their. Curl is not, just like most of the many other libraries and command line tools. Who do we support?

Spaces

The space character (code 32 in ASCII, hex code 0x20) cannot be part of the URL. If you want to send it, you should use the URL encoding, as is done with any other invalid character that needs to be made part of the URL. URL encoding is a byte value in hexadecimal with a percent sign in front of it. Thus, "% 20" means a space. It also means that the parser, such as scanning text for URLs, finds out that it has reached the end of the URL when it detects an invalid character. For example, a space.

Browsers usually convert all% 20 in their address lines to a space character so that the links look decent. When copying the address to the buffer and pasting it into a text editor, we see spaces as% 20, as required.

I'm not sure if this is the reason, but browsers also accept spaces as part of the URL, receiving, for example, redirection in an HTTP response. Such URLs are transmitted from the server to the client in the “Location:” header. Browsers easily allow spaces in their URLs, encoding them in the form of% 20 and sending the next request. This causes curl to accept spaces in the redirected "URL".

Non-ascii

Supporting languages in languages that include non-ASCII characters is of course important, especially for non-Western communities, and I agree that the IRI specification has never been good enough. I personally am far from being an expert in internationalization, so I am guided by what I have heard from others. But, of course, users of non-Latin alphabets and printing systems should be able to write their “Internet addresses” into resources and use them as links.

In the ideal case, we would have an internationalized version for display to the user, and an ASCII version for internal use in network requests.

For international domain names, the name is converted to punycode so that it can be read by regular DNS servers that do not know anything about names in non-ASCII encoding. URIs do not have IDN names; IRI and WHATWG URLs are. Curl supports IDN hostnames.

WHATWG states that URLs can use UTF-8, whereas URIs are only ASCII. Curl does not accept non-ASCII characters in the path part of the address, but encodes them as a percentage in outgoing requests; this causes “interesting" side effects when non-ASCII characters are represented in code other than UTF-8, which is, for example, standard for Windows.

Like what I wrote above, this results in servers sending back non-ASCII codes in HTTP headers that browsers readily accept, and non-browsers also have to work with them.

URL standard does not exist

I did not try to present a complete list of problems or inconsistencies - here is just some selection of the difficulties that I recently encountered. A “URL” issued in one place, of course, will not necessarily be accepted or understood in another place as a “URL”.

Nowadays, even curl no longer follows a strictly published specification - we are slowly degrading in favor of “web compatibility”.

There is no uniform URL standard, and there is no work in this direction. I cannot believe that the WHATWG makes real efforts for this, since it writes the specification to a closed group without serious attempts to involve the wider community.

Source: https://habr.com/ru/post/301516/

All Articles