This article discusses email validation using regular expressions. All regexps are performed with the
i
modifier, i.e. make case-insensitive checking.
Training
Before writing a validation, you need to know what the email address consists of. I think everyone knows that this is “username @ hostname”. It would be best to break the creation of the regexp into 2 logical parts - hostname validation and username validation. Let's start with a more voluminous.
Hostname validation
First, let's think about it, but what does the hostname consist of?
The host name consists of several components, separated by a dot and not exceeding 63 characters, and suffixes (first-level domains). Components, in turn, consist of Latin letters, numbers and hyphens, and hyphens cannot be at the beginning or end of a component. Suffixes are a limited list of first-level domains (I found the list on the
IANA website). To simplify the expression, we write the domains of countries as
[az][az]
(any 2 characters from a to z are not case sensitive). We also will not use non-Latin characters until they are officially introduced for public use. As a result, we obtain an expression that checks the suffix (the construction
(foo|bar)
indicates that the search is either foo or bar, that is, replaces or):
')
(aero|arpa|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|[az][az])
For components, the code will be more complicated:
([a-z0-9]([-a-z0-9]{0,61}[a-z0-9])?\.)
Understand the expression:
[a-z0-9] #
([-a-z0-9]{0,61}[a-z0-9])? #
\. #
Consider the optional part:
# ,
# {0,61} , 0 61
[-a-z0-9]{0,61}
# 61 , 63
[a-z0-9]
As a result, we received an expression responsible for checking the hostname:
([a-z0-9]([-a-z0-9]{0,61}[a-z0-9])?\.)*(aero|arpa|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|[az][az])
I draw attention to the fact that the presence of components is not necessary, since Some first-level domains are supported by servers.
An example .
Username validation
Username may contain:
- Latin
- numbers
- marks! # $% & '* + - / =? ^ _ `{| } ~
- point, except for the first and last characters, which cannot be repeated
I will give the expression immediately:
[-a-z0-9!#$%&'*+/=?^_`{|}~]+(\.[-a-z0-9!#$%&'*+/=?^_`{|}~]+)*
In fact, everything is simple: 1 or more
[-a-z0-9!#$%&'*+/=?^_`{|}~]
, then 0 or more
\.[-a-z0-9!#$%&'*+/=?^_`{|}~]+
.
Eventually
Regexp email verification:
^[-a-z0-9!#$%&'*+/=?^_`{|}~]+(\.[-a-z0-9!#$%&'*+/=?^_`{|}~]+)*@([a-z0-9]([-a-z0-9]{0,61}[a-z0-9])?\.)*(aero|arpa|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|[az][az])$
This expression can be optimized drop (about optimization, I think, there will be a separate article):
^[-a-z0-9!#$%&'*+/=?^_`{|}~]+(?:\.[-a-z0-9!#$%&'*+/=?^_`{|}~]+)*@(?:[a-z0-9]([-a-z0-9]{0,61}[a-z0-9])?\.)*(?:aero|arpa|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|[az][az])$
Bonus
Consider regexp, which was cited as an example in the
comments to the introductory topic:
^(\S+)@([a-z0-9-]+)(\.)([az]{2,4})(\.?)([az]{0,4})+$
What are the main problems here?
- Since username can consist of any characters except spaces, username "日本国" is valid.
- The first level domain can consist of any 4 Latin letters, for example .habr
- Terrible domain check: 1 or more characters, mandatory dot, 2-4 mandatory symbols, optional dot, 0-4 symbols in increments. Moreover, each of these blocks is stored in memory.
PS Write about bugs and wishes - I will definitely fix it.
PPS At will, I can post the email verification function that I use.