📜 ⬆️ ⬇️

I knew how to validate an email address. Not yet read the RFC

From the translator: having read the article , I began to reply in the comments, but decided that the text I was going to refer to was worthy of a separate publication. Meet
If you know how to validate an email address, raise your hand. Those of you who raised your hand - lower it immediately, until someone saw you: it's stupid enough to sit alone at the keyboard with your hand raised; I spoke in a figurative sense.

Until yesterday, I would also raise my hand (figuratively). I needed to check the validity of the email address on the server. I have already done this hundreds of thousands of times (no kidding - I counted) with the help of a cool regular expression from my personal library.

This time, for some reason, I was drawn again to reflect on my assumptions. I have never read (or even flipped through) RFCs by email addresses. I simply based my implementation based on what I meant by the correct email address. Well, you know what they usually say about who they mean . [ approx. trans. The author has in mind a play on words: “when you assume , you and me ” - “when you (something) mean, you do /./ any way from yourself and from me” ]
')
And I found something interesting: almost all regular expressions, presented on the Internet as “checking the correctness of email addresses”, are too strict.

It turns out that the local part of the email address — that before the “@” sign — allows for a much wider variety of characters than you think. According to section 2.3.10 of RFC 2821, which defines SMTP, the part before the "@" sign is called the local part (the part after the sign is the recipient's domain) and is intended for interpretation only by the recipient server.

Consequently - and due to a long series of problems caused by intermediate hosts trying to optimize the transmission by changing their [addresses - transl. ], the local part MUST be interpreted (and it should be assigned a semantic meaning) exclusively by the server specified in the domain part of the address .
Section 3.4.1 of RFC 2822 describes additional details of the email address specification (highlighted by me - auth. ).
An address specification is a specific identifier on the Internet, containing a locally interpreted string, followed by the “at” sign (“@”, ASCII code 64), which, in turn, is followed by the Internet domain. A locally interpreted string is either a string-delimited string or a dotted atom .
A dot atom is a collection of atoms separated by dots. In turn, the atom is defined in section 3.2.4 as a set of alphanumeric characters and may include any of the following symbols (you know, the ones that are usually replaced with the mat) ...

! \$ & * - = ^ ` | ~ # % ' + / ? _ { }

Moreover, it is completely acceptable (although it is not recommended and rarely used where) to have quoted local parts in which almost any characters are allowed. Sticking can be done either by using the backslash character, or by framing the local part with double quotes.

RFC 3696 , Application Techniques for Checking and Transformation of Names, was written by the author of the SMTP protocol ( RFC 2821 ) as a human-readable instruction manual for SMTP. In the third section, he gives examples of correct email addresses.

These are valid email addresses!


(Applause to the RFC author for using my favorite version of Vasya Pupkin - Joe Blow.)

Come on, run them through your favorite validator. Well, how much has passed?

For fun, I decided to try to write a regular expression (thanks, I was already reported, now I have two problems ), through which they would all pass. Here it is.

^(?!\.)("([^"\r\\]|\\["\r\\])*"|([-a-z0-9!#$%&'*+/=?^_`{|}~] |(?@[a-z0-9][\w\.-]*[a-z0-9]\.[az][az\.]*[az]$

Note that this expression implies that case sensitivity is off ( RegexOptions.IgnoreCase in .NET ). I agree, a very ugly expression.

I wrote a unit test to demonstrate all the cases that it covers. Each line contains an email address and a flag - is it correct or not.

 [RowTest] [Row(@"NotAnEmail", false)] [Row(@"@NotAnEmail", false)] [Row(@"""test\\blah""@example.com", true)] [Row(@"""test\blah""@example.com", false)] [Row("\"test\\\rblah\"@example.com", true)] [Row("\"test\rblah\"@example.com", false)] [Row(@"""test\""blah""@example.com", true)] [Row(@"""test""blah""@example.com", false)] [Row(@"customer/department@example.com", true)] [Row(@"$A12345@example.com", true)] [Row(@"!def!xyz%abc@example.com", true)] [Row(@"_Yosemite.Sam@example.com", true)] [Row(@"~@example.com", true)] [Row(@".wooly@example.com", false)] [Row(@"wo..oly@example.com", false)] [Row(@"pootietang.@example.com", false)] [Row(@".@example.com", false)] [Row(@"""Austin@Powers""@example.com", true)] [Row(@"Ima.Fool@example.com", true)] [Row(@"""Ima.Fool""@example.com", true)] [Row(@"""Ima Fool""@example.com", true)] [Row(@"Ima Fool@example.com", false)] public void EmailTests(string email, bool expected) { string pattern = @"^(?!\.)(""([^""\r\\]|\\[""\r\\])*""|" + @"([-a-z0-9!#$%&'*+/=?^_`{|}~]|(?<!\.)\.)*)(?<!\.)" + @"@[a-z0-9][\w\.-]*[a-z0-9]\.[az][az\.]*[az]$"; Regex regex = new Regex(pattern, RegexOptions.IgnoreCase); Assert.AreEqual(expected, regex.IsMatch(email) , "Problem with '" + email + "'. Expected " + expected + " but was not that."); } 


Before you call me a terrible bore and pedant (maybe you are right, but still wait), I do not think that such a deep checking of email addresses is absolutely necessary. Most email providers have more stringent requirements for email addresses. For example, Yahoo requires an address to begin with a letter. It seems that there is a standardized, more stringent set of rules that most email providers follow, but as far as I know, it is not documented anywhere.

I think I'll create an email address such as phil.h\@\@ck@haacked.com and start complaining to tech support on sites that require you to enter an email address, but do not allow me to create an account with this address. I love to misbehave!

The moral is that it is useful to challenge prejudices and assumptions from time to time, and also never allow me to go to the RFC.

PS Corrected several errors that I made in my reading of the RFC. Do you see? Even after reading the RFC, I'm still not sure what I'm doing, damn it! Which once again confirms the thesis that programmers are not readers .

Source: https://habr.com/ru/post/274985/


All Articles