Hi habr!
A certain amount of time appeared, and I decided to write this post, the idea of which had arisen a long time ago.
It will be connected with such a seemingly simple thing as a URI, to which a little attention is paid to a detailed review of which in RuNet.
')
“
Pff, the links are in Africa and links, what is there to understand? ” - you will say, then I ask the question:
What is what and where will lead us?
http://example.com
www.example.com
//www.example.com
mailto:user@example.com
If you do not know a definite answer or you are just wondering
and if you are not afraid of a huge number of three-letter abbreviations , you are welcome under the cat.
Before starting, I would like to indicate that there is a
post on a similar topic, in which everything is indicated more simply and a little more clearly. The purpose of this post, I put a deeper study of the issue and the collection of information about the URI in one place, so as not to “lose”. Well, almost in one place, the article will be divided into two parts.
And for the convenience of Bachn, the table of contents, which works not without features URI, which we consider later in this article.
TABLE OF CONTENTS
- URI
1.1. Syntax
1.2. URI components
- URL
2.1. Structure - URN
3.1. Structure
Introduction
1. URI
Uniform Resource Identifier , in common -
URI
The most recent description of what these notorious URIs are all the same dates back to January 2005, namely
RFC3986 , written by Tim Benes-Lee himself, the ancestor of all of our beloved
tyrnet .
Summarizing Clause 1.1, we can formulate the definition:
A URI is a sequence of characters that identifies a physical or abstract resource that does not need to be accessible through the Internet, and the type of resource to be accessed is determined by the context and / or mechanism.
For example:
- going to
http://example.com
- we will get to the http-server of the resource identified as example.com
- at the same time,
ftp://example.com
will lead us to the ftp server of the same resource - or for example
http://localhost/
- URI identifying the machine from where it is being accessed
In the modern Internet, two types of
URI
are most often used -
URL
and
URN
.
The main difference between them is in the tasks:
- URL - Uniform Resource Locator , helps to find any resource
- URN - Uniform Resource Name , helps to identify this resource.
Simplifying:
URL
- answers the question: "Where and how to find something?",
URN
- answers the question: "How to identify something."
A couple of interesting things about URIMany of you have noticed that on different resources the links call the URL, the URI, and, probably, it became interesting - which of the options is correct?
The fact is that the URL saw the light and was documented in 1990, while the URI was documented only in 1994. And until 2002, before the release of
RFC3305 , both naming options were appropriate, which sometimes confused.
RFC3305, clause 2, states that a term such as a URL is obsolete, applies to links, and that the URI naming will now be correct, since W3C uses the term URI in all documents. Based on this, applying the term URL to the corresponding links, you do not make a semantic error, but make it in terms of proper naming.
Also noteworthy is the moment that up until the release of
RFC2396 , in 1997, the URI was decoded as a
Universal Resource Identifier, which can be seen in
RFC1630
Summarizing all sorts of options, the URI is as follows:
Looking ahead, it is worth noting that not all three components are strictly required. For a link to be considered a URI, you must have:
- or
scheme+authority+path
, - either
sheme+path
, - or only
path
.
1.1. Syntax
According to
RFC3986, clause 2:
A URI is made up of a limited set of characters consisting of numbers, letters, and several graphic characters, all of which fit into the US-ASCII (ASCII) encoding. The reserved subset of characters can be used to delimit the syntax components in a URI, while the remaining characters: not a reserved set and including those reserved characters that do not act as delimiters in a given URI component, identify each component's identification data.
Reserved characters
Reserved characters are divided into two types:
- gen-delims , they are also the "main separators", i.e. characters dividing the URI into large components.
":", "/", "?", "#", "[", "]", "@"
- Sub-delims , they are also “under delimiters” - symbols that divide the current large component into smaller components, they are different for each component, here’s a list of the most common:
"!", "$", "&", "'", "(", ")", "*", "+", ",", ";", "="
Unreserved characters
Based on the previous paragraph, non-reserved characters are symbols that are not included in the
gen-delims
, as well as
sub-delims
that are not significant for this component. But in general it is:
ALPHA, DIGIT, "-", ".", "_", "~"
For this case, according to ABNF :
ALPHA
- any letter of the upper and lower registers of the ASCII encoding (in regExp [A-Za-z])
DIGIT
- any digit (in regExp [0-9])
HEXDIG
is a hexadecimal digit (in regExp [0-9A-F])
Percent coding
If characters are used that are outside the limits of ASCII encoding, then a so-called mechanism is used. "
Percentage Coding ". It is also used to transfer reserved characters in the data. Reserved characters, according to the rules, do not participate in percentage coding.
The percentage-coded character is a character triplet consisting of the "%" sign and two hexadecimal numbers following it:
pct-encoded = "%" HEXDIG HEXDIG
Thus,% 20, for example, means a space.
1.2. URI components
The following list contains descriptions of the major components that make up the URI:
On this, perhaps, familiarity with the URI can be completed and begin to delve into the individual subspecies of the URI, namely
2. URL
The URL standard is documented in
RFC1738 .
From item 2:
URLs are used to locate resources, providing abstract identification of the location of the resource. Having determined the location of the resource, the system can perform many operations on the resource, which can be characterized by such words as 'access', 'update', 'replacement', 'attribute search'. In general, only the access method must be defined for any URL scheme.
So: A URL is designed to solve a wide range of tasks, starting from receiving and ending with changing data on a resource, and the required parameter for access is the method, that is, any full (absolute) URL can be reduced to the form:
<scheme>:< >
2.1. Structure
In general, the URL has a similar structure for all schemes, although for each individual scheme, the structure may differ from the general pattern.
Graphically, it can be expressed as:
And so, at about this moment, one can consider the differences between absolute (absolute) and relative (relative) URLs, these definitions apply not only to the URL, but also to the URI as a whole.
- A relative reference uses hierarchical syntax to express a URI reference relative to the namespace of another hierarchical URI.
Relative links are also divided into several subspecies:
- Network path reference
It looks like: //<authority> <path> [<query>] [<fragment>]
This type of links is used infrequently, the point is to follow the link with the current scheme.
Ie: being, for example, on http://example.com
and following the link //domain.com
- we will get on http://domain.com
And if we follow the same link from ftp://example.com
, we ftp://domain.com
get to ftp://domain.com
- Absolute path reference
It looks like: /<path> [<query>] [<fragment>]
This time we will stay within the current host, but we will get on the path in any case, no matter what path we are now.
Ie: even being on http://example.com/just/some/long/path
and following the link /path
, we will get to http://example.com/path
- Relative path reference
It looks like: <path> [<query>] [<fragment>]
Now, we will move within the current position.
Ie: being on http://example.com/just/some/long/path
and following the path
link, we will get to http://example.com/just/some/long/path/path
- Link of the same document
In fact, these are links that consist only of the fragmentary part of the URI, or links, in which all components except the fragmentary ones coincide with the original one.
Those. #fragment
and http://habrahabr.ru/topic/232385/#fragment
are references of the same document.
- Absolute Link - View Link
<scheme> <authority> [<path>] [<query>] [<fragment>]
Applying absolute links, we will get to the resource we need, regardless of where we come from.
Ie: we are on http://example.com/just/some/long/path
or ftp://example.com
, by going to http://domain.com/path
, we are anyway let's get on http://domain.com/path
After we figured out what are relative and absolute paths, you can answer the question asked at the beginning of the post:
- http://example.com - will open
http://example.com
- www.example.com - in theory should open
http://habrahabr.ru/topic/232385/www.example.com
, but habr itself corrects the link, although according to the standards www.example.com
is the relative path reference - //www.example.com - will open
www.example.com
with the scheme with which you are viewing the current page, i.e. likely to be open http://example.com
- mailto: user@example.com - the browser settings already take effect here, he will offer you to open this link using the mail program and send an email to the recipient
user@example.com
, and this is the absolute URL with the mailto
scheme
We have already reviewed the major components, and now let's delve into the details of building a URL.
- Scheme - as mentioned earlier: the scheme determines the method of access to the resource. A list of current schemes can be found here .
- Userinfo is the authority sub-component used to authorize a user on a resource. It consists of a username and an optional password, separated from the rest of the authority by the "
@
" symbol. Despite the fact that the password parameter is specified in the specification, its use is highly discouraged, since the password is actually transmitted to the username account, in unencrypted form.
Allowed characters: , -, sub-delims, ":"
An example is the following:
There is a test folder on LAN that is accessed by a couple of login-password. That is, going to http://localhost/test/
, I will see the following:
And if I follow the link http://admin:admin@localhost/test/
, then the authorization procedure will occur automatically, with the data specified in the userinfo block:

- Host is the authority component used to determine the target node (or a resource, if you like, but the concept of a “node” will be more precise), which can be located both on the Internet and outside it, depending on the specified scheme. This component is not case sensitive.
The host can be either an IP address or a registration name (reg-name) and, optionally, the next following port (port).
It provides for support for existing IP address formats (IPv4, IPv6), and all kinds of future ones that will be described later.
Registration name - familiar to us, so called. domain names are a sequence of characters, usually intended to be searched in a locally defined node or service name registry, although the schema-specific semantics of a URI may require that a specific registry (or fixed name table) be used instead.
The most common name registry mechanism is the Domain Name System (DNS).
The domain name used for DNS search consists of domain tags, separated by ".", Each domain tag may contain the following characters: , -, sub-delims
The registration name syntax allows the use of percent-encoded characters to represent non-ASCII characters in a single order, independent of name resolution technology. Non-ASCII characters must first be encoded in UTF-8, and then each octet of the UTF-8 sequence must be percent-coded.
If the registration name with non-ASCII characters is a multilingual domain name resolvable through DNS, it must be converted to IDNA encoding ( RFC3490 ) before searching for the name and, as a result, by domain name registrars such registration names must be provided in IDNA encoding .
Port (Port) - the decimal port number, separated from the hostname by one colon ":", can consist only of numbers. The scheme can define the default port that will be used if the port is not specified. For example, for the HTTP scheme, the default port is 80, which corresponds to the port 80 / TCP reserved for it. The type of port, as well as the assigned port number, is determined by the scheme. - The components Query and Fragment are fully described previously.
3. URN
The URN standard is documented in
RFC2141 .
From item 1:
Uniform Resource Names (URNs) are intended to serve as permanent, location-independent resource identifiers and are designed to simplify the mapping of other namespaces (which share URN properties) into a URN space. Thus, the URN syntax provides a means to encode character data in a form that can be sent using existing protocols, written using most keyboards, etc.
That is, unlike the URL, which refers to a place where the document is stored, the URN refers to the document itself, and when you move the document to another location, the link does not change.
Due to the fact that the URN conceptually differs from the URL, then its name resolution system is different -
DDDS , which converts the URN to URLs where you can find a resource / object or whatever the URN refers to.
3.1. Structure
The URN is as follows:
"urn:" <NID> ":" <NSS>
- "Urn:" - mandatory, case-insensitive part of the URN
- NID - Namespace Identifier, this component defines the syntactic interpretation of the NSS component.
Minimum length - 2 characters, maximum - 32, allowed characters:
, , "-"
NID must begin with only a letter or number.
Also, the word “urn” for NID is reserved in order to avoid ambiguity in determining the URN as a whole.
List of officially registered NID can be found here. - NSS - Namespace Specific String, this component serves directly to transfer any data.
Allowed characters :
, , -, "(", ")", "+", ",", "-", ".", ":", "=", "@", ";", "$", "_", "!", "*", "'"
Reserved characters :
"%", "/", "?", "#"
Illegal characters must be percent coded. If the specified character is encountered explicitly, its position will be considered the end of the URN:
0-32 (0-20 hex), "\", """, "&", "<", ">", "[", "]", "^", "`", "{", "|", "}", "~", 127-255 (7F-FF hex)
Self-identifying URN
Such URNs contain the name of the hash function in NID, and in NSS the value of the hash calculated for the identified object. Such links are used in magnet links and Gnutela2 p2p network headers.
For example, URN from a magnet link from a single torrent tracker:
magnet:?xt=urn:btih:c68abc1ba9b8c7c4bc373862cad1a8c01d69e53d...
With theory, everything, in the second part, we will consider what can and should be done with URIs if we process them, namely, normalization, parsing, etc.
I will take my leave for this, thanks for reading, I hope it was not boring, good luck!