URI - tricky about simple (Part 1)


Hi habr!

A certain amount of time appeared, and I decided to write this post, the idea of ​​which had arisen a long time ago.
It will be connected with such a seemingly simple thing as a URI, to which a little attention is paid to a detailed review of which in RuNet.
Pff, the links are in Africa and links, what is there to understand? ” - you will say, then I ask the question:

What is what and where will lead us?

If you do not know a definite answer or you are just wondering and if you are not afraid of a huge number of three-letter abbreviations , you are welcome under the cat.

Before starting, I would like to indicate that there is a post on a similar topic, in which everything is indicated more simply and a little more clearly. The purpose of this post, I put a deeper study of the issue and the collection of information about the URI in one place, so as not to “lose”. Well, almost in one place, the article will be divided into two parts.
And for the convenience of Bachn, the table of contents, which works not without features URI, which we consider later in this article.

  1. URI
    1.1. Syntax
    1.2. URI components
  2. URL
    2.1. Structure
  3. URN
    3.1. Structure


1. URI

Uniform Resource Identifier , in common - URI
The most recent description of what these notorious URIs are all the same dates back to January 2005, namely RFC3986 , written by Tim Benes-Lee himself, the ancestor of all of our beloved tyrnet .
Summarizing Clause 1.1, we can formulate the definition:

A URI is a sequence of characters that identifies a physical or abstract resource that does not need to be accessible through the Internet, and the type of resource to be accessed is determined by the context and / or mechanism.
For example:

In the modern Internet, two types of URI are most often used - URL and URN .
The main difference between them is in the tasks:

Simplifying: URL - answers the question: "Where and how to find something?", URN - answers the question: "How to identify something."
A couple of interesting things about URI
Many of you have noticed that on different resources the links call the URL, the URI, and, probably, it became interesting - which of the options is correct?
The fact is that the URL saw the light and was documented in 1990, while the URI was documented only in 1994. And until 2002, before the release of RFC3305 , both naming options were appropriate, which sometimes confused.
RFC3305, clause 2, states that a term such as a URL is obsolete, applies to links, and that the URI naming will now be correct, since W3C uses the term URI in all documents. Based on this, applying the term URL to the corresponding links, you do not make a semantic error, but make it in terms of proper naming.

Also noteworthy is the moment that up until the release of RFC2396 , in 1997, the URI was decoded as a Universal Resource Identifier, which can be seen in RFC1630

Summarizing all sorts of options, the URI is as follows:

Looking ahead, it is worth noting that not all three components are strictly required. For a link to be considered a URI, you must have:

1.1. Syntax

According to RFC3986, clause 2:
A URI is made up of a limited set of characters consisting of numbers, letters, and several graphic characters, all of which fit into the US-ASCII (ASCII) encoding. The reserved subset of characters can be used to delimit the syntax components in a URI, while the remaining characters: not a reserved set and including those reserved characters that do not act as delimiters in a given URI component, identify each component's identification data.

Reserved characters
Reserved characters are divided into two types:

Unreserved characters
Based on the previous paragraph, non-reserved characters are symbols that are not included in the gen-delims , as well as sub-delims that are not significant for this component. But in general it is:
 ALPHA, DIGIT, "-", ".", "_", "~" 
For this case, according to ABNF :
ALPHA - any letter of the upper and lower registers of the ASCII encoding (in regExp [A-Za-z])
DIGIT - any digit (in regExp [0-9])
HEXDIG is a hexadecimal digit (in regExp [0-9A-F])

Percent coding
If characters are used that are outside the limits of ASCII encoding, then a so-called mechanism is used. " Percentage Coding ". It is also used to transfer reserved characters in the data. Reserved characters, according to the rules, do not participate in percentage coding.
The percentage-coded character is a character triplet consisting of the "%" sign and two hexadecimal numbers following it:
 pct-encoded = "%" HEXDIG HEXDIG 
Thus,% 20, for example, means a space.

1.2. URI components

The following list contains descriptions of the major components that make up the URI:

On this, perhaps, familiarity with the URI can be completed and begin to delve into the individual subspecies of the URI, namely

2. URL

The URL standard is documented in RFC1738 .
From item 2:
URLs are used to locate resources, providing abstract identification of the location of the resource. Having determined the location of the resource, the system can perform many operations on the resource, which can be characterized by such words as 'access', 'update', 'replacement', 'attribute search'. In general, only the access method must be defined for any URL scheme.
So: A URL is designed to solve a wide range of tasks, starting from receiving and ending with changing data on a resource, and the required parameter for access is the method, that is, any full (absolute) URL can be reduced to the form:
 <scheme>:<   > 

2.1. Structure

In general, the URL has a similar structure for all schemes, although for each individual scheme, the structure may differ from the general pattern.
Graphically, it can be expressed as:

And so, at about this moment, one can consider the differences between absolute (absolute) and relative (relative) URLs, these definitions apply not only to the URL, but also to the URI as a whole.

After we figured out what are relative and absolute paths, you can answer the question asked at the beginning of the post:

We have already reviewed the major components, and now let's delve into the details of building a URL.

3. URN

The URN standard is documented in RFC2141 .
From item 1:
Uniform Resource Names (URNs) are intended to serve as permanent, location-independent resource identifiers and are designed to simplify the mapping of other namespaces (which share URN properties) into a URN space. Thus, the URN syntax provides a means to encode character data in a form that can be sent using existing protocols, written using most keyboards, etc.
That is, unlike the URL, which refers to a place where the document is stored, the URN refers to the document itself, and when you move the document to another location, the link does not change.
Due to the fact that the URN conceptually differs from the URL, then its name resolution system is different - DDDS , which converts the URN to URLs where you can find a resource / object or whatever the URN refers to.

3.1. Structure

The URN is as follows:
 "urn:" <NID> ":" <NSS> 

Self-identifying URN
Such URNs contain the name of the hash function in NID, and in NSS the value of the hash calculated for the identified object. Such links are used in magnet links and Gnutela2 p2p network headers.
For example, URN from a magnet link from a single torrent tracker:

With theory, everything, in the second part, we will consider what can and should be done with URIs if we process them, namely, normalization, parsing, etc.

I will take my leave for this, thanks for reading, I hope it was not boring, good luck!

