How to find or check e-mail address

The greatest number of reviews, not to mention the "errors", comes to me on a regular expression e-mail address:

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[AZ]{2,4}\b

I argue that this regular expression defines any e-mail address. In the feedback usually shows one e-mail address that does not fall under this expression; Also, in the "error" reports there is a suggestion to create a perfect regexp .

As I will explain below, my statement is true when you accept my definition of what is a valid email address and what is not. If you use a different definition, you will have to correct the expression. Determining a valid e-mail address is a great example showing that

Before writing the expression, you should know exactly what should match and what should not;
a trade-off between precision and practicality should be allowed.

The advantage of my regular expression is that it defines 99% of the e-mail addresses used today. All received addresses can be processed by 99% of mail programs. If you are looking for a quick solution, then you will only need to read the next paragraph.

If you want to use the regular expression written above, you need to understand two things. First, a large regular expression makes it difficult to beautifully format paragraphs. Therefore, I do not use " az " in any of the three classes of characters. To do this, the option of checking the register should be disabled in the settings of your program. (You will be surprised how many “error” reports I get about this.) Secondly, the regular expression above is separated by word boundaries, which makes it suitable for extracting e-mail addresses from files or large blocks of text. If you want to check whether the text entered by the user is a valid e-mail address, replace the word separators with an anchor for the beginning and end of the line, for example:

 ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[AZ]{2,4}$

Compromises in the confirmation email address

Yes, there is a whole bunch of email addresses that my regular expression does not find. The most cited addresses are those that have the top-level domain .museum, which is longer than the 4 characters allowed for it. I accept this compromise because the number of people using this top-level domain is extremely small. I have never received complaints that the order forms or news subscriptions on the websites of my company refused to use the address from .museum.
')
To enable .museum, you can use the following expression:

 ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[AZ]{2,6}$

But this is another compromise. This expression will find the address john@mail.office . It is much more likely that John forgot to include the .com at the end of the address than the fact that he created the top level domain .office without ICANN permission.

The example above shows another compromise: do you want a regular expression to check if a top-level domain exists? My regular expression does not. Any combination of two to four characters will cover all existing (and planned) top-level domains, with the exception of .museum [and .travel - approx. per. ] But it will determine invalid email addresses like asdf@asdf.asdf . Without being extremely strict with a top-level domain, I don’t have to update regular expressions whenever a new domain is created.

 ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.(?:[AZ]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)$

This regular expression can be used for any two-letter country codes and only specific generic top-level domains. Currently this list may be outdated. I recommend keeping this list as a global constant in your application so that you only have to update it in one place. You can also list all country codes, although there are already about two hundred.

An e-mail address can also be registered on a subdomain, for example john@mail.company.com . All the regular expressions above will define this address, since I added the " . " Character to the character class following the @. But they will also define john @ aol ... com , which is incorrect. You can exclude such cases by replacing " [A-Z0-9 .-] + \. " By " (?: [A-Z0-9 -] + \.) + ". I removed the dot from the character class and instead repeated this class and dot symbol. For example,

 \b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[AZ]{2,4}\b

will specify john@server.department.company.com , but not john @ aol ... com .

Another compromise is that my regular expressions allow only Latin letters, numbers and some special characters. The main reason for this is that I’m not quite sure that my email software could handle the rest of the characters. Even if John.O'Hara@theoharas.com is a syntactically correct e-mail address, there is a danger that some mail programs will regard the apostrophe as a separator of citations. For example, blindly inserting this address into SQL will fail if the lines are separated by single quotes. And, of course, for many years, domains may contain non-Latin characters. Most programs and even domain registrars, however, still adhere to the 37 characters they are used to.

The conclusion is: to decide which regular expression to use, no matter if you are going to find an e-mail address or anything else exactly defined, you must start by taking into account all the trade-offs. How bad is it that what is found does not correspond to the real? How bad is it that something real is not? How complicated can your regular expression be? At what price would it cost you to change this expression afterwards? Different answers to these questions require different regular expressions as a solution. My regular expressions do what I want, but they may not do what you want.

Regular expressions do not send e-mail

Do not overdo your attempts to eliminate invalid email addresses in your regular expression. If you need to enable .museum, it is often better to resolve all six-character top-level domains than to list all current domains. The reason is that you don’t really know if the given e-mail address is valid until you try to send an email to it. And even this may not be enough. Even if the letter arrives in the mailbox, it does not mean that someone will read it.

The same principle applies in many situations. When trying to find a valid date, it is often easier to add a bit of arithmetic to determine leap years than to try to do it inside a regular expression. Use regular expressions to search for potential matches, or check whether the input satisfies the required syntax, and perform a valid check for the candidates found by the regular expression. Regular expressions are a powerful tool, but far from a panacea.

Official Standard: RFC 2822

Maybe you wonder why there is no “official” reliable regular expression for finding email addresses? Well, here is the official definition, but it is hardly reliable.

The official standard is known as RFC 2822. It describes the syntax that valid e-mail addresses must adhere to. You can (but should not) implement it with the following regular expression:

 (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

This regular expression consists of two parts: the part before @, and the part after the @ character. There are two variants of the part before @: it may consist of a series of letters, numbers and some symbols, including one or several points. However, dots may not appear sequentially, at the beginning or end of an email address. Another alternative requires that the part before @ be enclosed in double quotes, allowing any ASCII character string to be between the quotes. Spaces, double quotes, and backslashes must begin with a backslash.

The part after @ also has two alternatives. This can be either a fully qualified domain name (for example, example.com), or a literal Internet address in square brackets. The literal Internet address can be an IP address or a domain-specific routing address.

The reason why you should not use this regular expression is that it only checks the basic syntax of the email address. john@aol.com.nospam will be considered a valid email address in accordance with RFC 2822. Obviously, this email address will not work, since there is no top-level .nospam domain. It also does not guarantee that your email program can handle this. Not all applications support syntax using double quotes or square brackets. In fact, RFC 2822 itself notes the use of square brackets as obsolete.

We will get a more practical implementation of RFC 2822 if we omit parts that use double quotes and square brackets. It still corresponds to 99.99% of all email addresses that are in actual use.

 [a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

A further change you could make is to allow any two-letter country codes of top-level domains, and only specific generic top-level domains. This regular expression filters bogus e-mail addresses such as asdf@adsf.adsf . You will need to update it as you add new top-level domains.

 [a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[AZ]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)\b

So even after following the official standards, the compromises still remain. Do not blindly copy regular expressions from online libraries or forums. Always check them with your own data and with your own applications.

Source: https://habr.com/ru/post/175329/

All Articles

How to find or check e-mail address

Compromises in the confirmation email address

Regular expressions do not send e-mail

Official Standard: RFC 2822

More articles: