Hello, dear ladies and gentlemen.
We are actively looking for fresh literature on the topic of regular expressions for beginners. And in this case, we would rather be attracted not by a translated, but initially Russian-language book, which in some way would affect regular expressions when processing natural language. We want to bring to your attention the following text - first, to remind about this topic, and second, to demonstrate an approximate level of complexity that interests us.
Sooner or later you will have to deal with regular expressions. Besides, what is their complicated syntax, confusing documentation and a hard learning curve, most developers are satisfied with the following: copy-paste an expression from StackOverflow and hope that it will work. But what if, in fact, could decode regular expressions and use them to the fullest? In this article, I will explain why regular expressions should be looked at again, and how they can be useful in practice.
')
Why do we need regular expressions?Why bother with regular expressions at all? How can they help you?
- Pattern Comparison : Regular expressions are great at determining whether a string matches a particular format — for example, a phone number, email address, or credit card number.
- Replacement : With regular expressions, it is easy to find and replace patterns in a string. Thus, the expression
text.replace(/\s+/g, " ")
replaces all spaces in the text, for example, " \n\t "
, with one space. - Extraction : Using regular expressions it is easy to extract pieces of information from a template. For example,
name.matches(/^(Mr|Ms|Mrs|Dr)\.?\s/i)[1]
extracts a call to a person from a string, for example, "Mr"
from "Mr. Schropp"
. - Portability : Almost every common programming language has its own regular expression library. The syntax is basically standardized, so you don’t have to relearn regular expressions when switching to a new language.
- Code: When writing code, you can use regular expressions to search for information in files; so, in Atom for this purpose provides the find and replace, and in the command line - ack.
- Clarity and conciseness: If you are with regular expressions on "you", then you can perform highly non-trivial operations by writing the minimum amount of code.
How to write regular expressionsRegular expressions are easiest to learn by example. Suppose you are writing a web page with a field for entering a telephone number. Since you are an ace of web development, you want to additionally display a tick on the screen if the phone number is valid, and a cross X - if not.
<input id="phone-number" type="text"> <label class="valid" for="phone-number"><img src="check.svg"></label> <label class="invalid" for="phone-number"><img src="x.svg"></label> input:not([data-validation="valid"]) ~ label.valid, input:not([data-validation="invalid"]) ~ label.invalid { display: none; } $("input").on("input blur", function(event) { if (isPhoneNumber($(this).val())) { $(this).attr({ "data-validation": "valid" }); return; } if (event.type == "blur") { $(this).attr({ "data-validation": "invalid" }); } else { $(this).removeAttr("data-validation"); } });
Now, if a person enters or inserts a valid number into the field, a tick will be displayed. If the user removes the cursor from the input field, and an invalid value is left in the field, a cross is displayed.
Since you know that the phone numbers consist of ten digits, first of all check that
isPhoneNumber
looks like this:
function isPhoneNumber(string) { return /\d\d\d\d\d\d\d\d\d\d/.test(string); }
In this function between the characters / contains a regular expression with ten
\d'
, that is, the characters-digits. The
test
method returns
true
if the regular expression matches the string, otherwise
false
. If you execute
isPhoneNumber("5558675309")
, the method will return
true
! Hooray!
However, writing ten
\d
is a slightly dreary job. Fortunately, the same can be done with curly braces.
function isPhoneNumber(string) { return /\d{10}/.test(string); }
Sometimes, entering a phone number, a person starts with a leading one. Would it be nice if your regular expression also handled such cases? This can be done using the? Symbol.
function isPhoneNumber(string) { return /1?\d{10}/.test(string); }
Character
?
means "zero or one", so now
isPhoneNumber
returns
true
for both "5558675309" and "15558675309"!
So far,
isPhoneNumber
is quite good, but we are missing one key detail: regular expressions may very often not coincide with the string, but with part of the string. It turns out that
isPhoneNumber("555555555555555555")
returns true, since this line has ten digits. The problem can be solved by using the anchors ^ and $.
function isPhoneNumber(string) { return /^1?\d{10}$/.test(string); }
Roughly speaking, ^ corresponds to the beginning of the line, and $ - to the end of the line, so now your regular expression will coincide with the whole telephone number.
Serious exampleThe release of the page took place, it enjoys a furious success, but there is a significant problem. In the US, the phone number can be written in various ways:
- (234) 567-8901
- 234-567-8901
- 234.567.8901
- 234 / 567-8901
- 234 567 8901
- +1 (234) 567-8901
- 1-234-567-8901
Although users can do without punctuation, it would be much easier for them to enter a preformatted number.
Even if you could write a regular expression to handle all of these formats, I think this is a bad idea. No matter how hard you try to take into account all formats, still skip some. In addition, in reality, you are only interested in the data itself, not in its formatting. So, what to do with all this punctuation, isn't it easier to get rid of it?
function isPhoneNumber(string) { return /^1?\d{10}$/.test(string.replace(/\D/g, "")); }
The replace function replaces the
\D
character with any characters except digits with an empty string. The global flag
g
orders the function to replace with the regular expression all matches, and not just the first one.
Even more serious exampleYour page with phone numbers everyone likes, in the office you are the king of the cooler. However, professionals like you don't stop there, so you want to make the page even better.
North American Numbering Plan is a standard telephone compiler used in the United States, Canada, and another 23 countries. There are some simple rules in this system:
- The telephone number ((234) 567-8901) is divided into three parts: the regional code (234), the PBX code (567) and the subscriber number (8901).
- In the regional code and the code of the PBX, the first digit can be any number from 2 to 9, and the second and third digits - from 0 to 9.
- In the code of PBX 1 can not be the third digit, if the second digit is 1.
Your regular expression already matches the first rule, but violates the second and third. For now, let's deal with the second. The new regular expression should look something like this:
/^1?<AREA CODE><EXCHANGE CODE><SUBSCRIBER NUMBER>$;/
The subscriber number is simple, it consists of only four digits.
/^1?<AREA CODE><EXCHANGE CODE>\d{4}$/
The regional code is a bit more complicated. We are interested in the number from 2 to 9, followed by two more numbers. For this you can use the character set! The character set allows you to specify a group of characters from which you can then choose.
/ ^ 1? [23456789] \ d \ d <EXCHANGE CODE> \ d {4} $ /
Great, but we get tired of manually entering all the characters from 2 to 9. Let's make the code even clearer with the help of the character range.
/ ^ 1? [2-9] \ d \ d <EXCHANGE CODE> \ d {4} $ /
Already better! Since the regional code is the same as the PBX code, you can simply duplicate the regular expression to bring this template to mind.
/ ^ 1? [2-9] \ d \ d [2-9] \ d \ d \ d {4} $ /
And how can you do it so that you do not have to copy and paste the part of the expression that contains the regional code? Everything is easier if you use the group! To group characters, they just need to be enclosed in parentheses.
/ ^ 1? ([2-9] \ d \ d) {2} \ d {4} $ /
So,
[2-9]\d\d
contained in a group, and
{2}
indicates that this group should appear twice.
That's all! Consider the final version of the function.
isPhoneNumber
:
function isPhoneNumber(string) { return /^1?([2-9]\d\d){2}\d{4}$/.test(string.replace(/\D/g, "")); }
When it is better to do without regular expressionsRegular expressions are a great thing, you just shouldn’t solve some problems with them.
Do not be too strict . There is no point in being too strict when writing regular expressions. In the case of phone numbers, even if we take into account all the rules from the NANP document, it is still impossible to determine whether the phone number is real. If I close the number (555) 555-5555, then it will coincide with the template, but there is no such telephone number.
Do not write an HTML parser . Although regular expressions are great for parsing some simple things, a parser for a whole language cannot be made of them. If you do not like to
bother , then you are unlikely to like to parse irregular languages with regular expressions.
Do not use them with very complex strings .
The full regular e-mail
expression consists of 6,318 characters. Simple and approximate is as follows: /
/^[^@]+@[^@]+\.[^@\.]+$/
. The general rule is this: if you get a regular expression longer than a single line of code, then you may want to look for another solution.