📜 ⬆️ ⬇️

Never check email addresses for RFC standards.

Many sites require the user to enter an email address, and we, as cool and scrupulous developers, always strive to check the format of the entered addresses strictly according to RFC standards . Due to this, our applications and websites check the format of the e-mail correctly and have no problems with usability, and we sleep sweetly, because we are sure that everything works as it should.
Yeah, how not so!
The above arguments sound cool and reinforced, but the problem here is that the address of the mail can contain completely meaningless things, and, in fact, checking the addresses according to RFC standards can, on the contrary, be terribly confusing.
Why is that? There are many ways to create an email address that is both correct and delusional. This is partly due to the fact that, for backward compatibility, some postal services allow you to submit addresses in formats that are outdated long ago. For example, this e-mail existed before the advent of DNS and before the advent of the modern user@domain.tld format: then UUCP ” bang path ” was used — addresses that represented a list of all nodes along the route responsible for delivery.

Mail address internals


The e-mail address looks like this:
mailbox@hostname 

Here, the mailbox can be a local user account, a role account or an automated system router, such as a mailing list, and any host can be used as the hostname if it is known to the DNS server that the mailer calls upon delivery.
In addition, some systems allow you to add tags to the address . This usually happens in the following format:
 mailbox+tag@hostname 

where the tag and separator (usually "+", but qmail uses "-" by default, although it can be configured and otherwise) are ignored during delivery. This is usually used to filter mail by folder and automate, but it can also be used to separate entered addresses by recipients and detect personal data abuse.
So, the address in the format “mailbox @ hostname”, “mailbox” is a user account, application or account of the system role, but it may contain such extravagant things as information for further routing or identifiers used for sorting, automation or tracking, and “ hostname ”is usually a domain name, but it can also be a subdomain, server, service, ip-address or just a host name.

Correct Mailbox Names from RFC Point of View


Specification favors rather strange addresses, and it would be expensive to support them all because some are too complicated and not too many people have enough knowledge to create such pirouettes in naming. Supporting such addresses makes it difficult for your employees to support such accounts, and they are almost never used in everyday life.
The box may contain spaces. As I recall, pre-Internet AOL resolved spaces in Imya Polzovatelya, which were also used as mailboxes with spaces cut out: imyapolzovatelya@aol.com, but according to the RFC, you can use double quotes around boxes containing spaces:
 "Alan Turing"@example.com <==  ,     

By the way, according to this logic, a box containing only a space is correct:
 " "@example.com <==  ,    

And here is another correct address, it is created from valid characters for the address:
 !#$%&'*+-/=?^_`{}|~@example.com <==       

By the way, check apostrophes, apostrophes should be supported:
 Miles.O'Brian@example.com <==   

Apostrophes should not be quotable or escaped, but when you save such addresses to the database or transfer them somewhere else, make sure that all the chik-bunches are .
There are a lot of examples on Wikipedia .
Do you need full RFC compatibility? You choose, but I do not advise - spaces and non-standard characters in the address is a rather unusual thing and most often are just a typo. Large e-mail providers do not allow this for about the same reasons; thus, it is usually sufficient to allow letters, numbers, periods, underscores, hyphens, apostrophes, and pluses.
')

Register-specific addresses


According to the RFC, the uniqueness of the address is determined by its case-sensitive uniqueness, however, 99.9% of providers consider it different and do not allow registering VasyaPetrov@example.com if vasyapetrov@example.com is already registered. Consider that the mailbox name is case-insensitive:
 ALLEN@example.com Allen@example.com allen@example.com 

A handful of systems use a full register check, allowing only the address Allen@example.com and discarding the incoming correspondence of all the other AlLen, but this does not work in practice, because the user is not used to distinguishing the register in mail addresses.
Should you keep RFC compatibility here? Converting addresses to lower case before saving you can cause problems to a small number of users (you can not send them letters), but having sent millions of e-mails I ran into this only a few times.
Converting to lowercase addresses is a good idea in terms of data normalization, since the domain is always case-insensitive and should be in lowercase. If you decide to save the address as it is entered, add a field in which it will store the canonical version.

Non-standard characters


Gmail is notable here: while the standard includes a dot as a standard character, Gmail doesn’t distinguish between mailboxes with and without dots. These addresses point to the same mailbox:
 first.last@gmail.com firstlast@gmail.com firstlast@gmail.com 

Please note that Google Apps allows you to use Gmail on any domain.
The main problem here is to find the address in the database in the form in which it was originally entered, which can deliver a lot of hemorrhoids to both the user and the support service, as well as programmers with testers. Then the second canonical form of the address is useful to you, but more on that later.

Expanded form of the name of the boxes using tags.


As mentioned above, most email delivery systems ( MTA ), including sendmail, Postfix, qmail, Yahoo Plus and Gmail, support the extended box name. It allows the user to add letters by sorting the tag. This may allow me to create a bunch of accounts on one site or in an application:
 allen+one@example.com allen+two@example.com 

But do I need to clean the tags from the box address?
NOT! Be friendly to your users, and users will be filled with the belief that you will not steal and sell their personal data for profit. Even if you are trying to prohibit the registration of additional accounts with an existing mailbox, imagine how stupid it is nowadays to simply register another mailbox so that you can register again with you - it’s just as easy to create an alias or folder (but about aliases, folders and tags) knows).
So again. Creating a second, canonical, form of storing an address in the database may well cover yours for you in case of trouble. Make sure that you have eliminated all tags, points, etc. from it, and can compare the newly added addresses with it.

Unicode and internationalized box names


Box names do not support extended ASCII characters (8-bit) and Unicode characters. This restriction has its roots in the SMTP specification, at the time when it appeared, all this simply did not exist; however, 8-bit values ​​defined locally, for example, from ISO-8859-x family encodings, can still be used, but you will never know what kind of encoding it is. In fact, I only saw 8-bit spam boxes.
After all, you are storing your data in UTF-8, right? So, in any case, you will not be able to transfer them back to the locale that was used if you do not know it.

Domain names


Mail domains have the same restrictions as in HTTP: they are case-insensitive, so they should be normalized to lower case.

Subdomains

Some addresses contain unnecessary subdomains: for example, “email.msn.com” and “msn.com” are the same email domain, in addition, such stories often happen in a corporate environment (and this is another good candidate for canonicalization).

Internationalized Domains ( IDN )

IDNs were created to use local Unicode characters in domain names, and it is also possible to create a domain with special characters:
 postmaster@→→→→→→→.ws 

This cool describes the water cycle in nature.
Like HTTP, SMTP only supports 7-bit encoding, and in order to cope with this misfortune, IDNs are converted to Punycode , which allows the domain name to be converted to Unicode representation and back:
 postmaster@xn--55gaaaaaa281gfaqg86dja792anqa.ws 

Too bad, but there is a possibility of phishing when using IDN. Unicode contains several different instances of some ASCII characters. This allows an attacker to create a site whose name looks exactly the same as the original due to the fact that some of the characters in the title match externally, but not internally.
This raises several questions that should be answered:
Should we allow IDNs? Can we provide support for users with support service (where does support come from, for example, keyboards with Chinese characters?) Should we save them in Unicode or Punycode? If we save canonical addresses, then in what encoding do we do this? Does our mailer (MTA) IDN support it at all, and in what form does it wait for an address when sending letters?

Ip address syntax

The use of IP addresses is valid:
 allen@[127.0.0.1] allen@[IPv6:0:0:1] 

However, such addresses look suspicious and are unlikely to be trusted.

Temporary mailing addresses


There are many services that provide users with temporary email addresses. This is usually used for anonymity or to register on untrusted sites.
Even services such as Hotmail and Yahoo provide aliases that can be used in much the same way, that is, destroyed after a while. There is no single technique for identifying such addresses - in the end, that’s what they are intended for. They use a huge set of domain names with constant rotation in order to be one step ahead of those who are trying to stop their activities.

Whitelisted features


E-mail addresses can be monstrously complex, but, offhand, 99.99% (and maybe more) adhere to simple principles, and the rest is too tedious to maintain.
So, you should probably refrain from maintaining an address if it contains:

Of course, this may create problems for some users, but in this case they will most likely try to use some other address that works. In addition, it will allow your support to provide better support, regardless of the user locale.
I also believe that you should support tags.
If necessary, you can create another field in the database with a canonical address, even if you think that all this RFC-shny compatibility should be maintained. The address in this field can be:

Although this advice may seem too radical, it is still better than blindly obeying standards. Who knows, may such simplified notation ever become a new standard?

Source: https://habr.com/ru/post/224623/


All Articles