⬆️ ⬇️

Boost is easy. Part 1. Boost.Regex

This article is the first in a series of articles that I am going to devote, probably the best, to the C ++ library.

This article addresses the following questions regarding regular expressions:





Introduction



I do not want to engage in controversy about the need or non-use of regular expressions, everyone decides for himself. My goal was to convey the ease of use of Boost.Regex for those who like to use regular expressions. For those who are not familiar with regular expressions, I advise you to read at least Wikipedia , and if someone wants to get acquainted with them more deeply, then I would advise Mastering regular expressions .

Boost.Regex is a compiled library, that is, to use it you need to build it. How to do this is written in Getting started .

When compiling libraries, you can choose one of two algorithms that will be used in the regulatory expression engine: recursive and non-recursive. The first one is fast, but can threaten with stack overflow, the second is a bit slower, but safe. Macros for defining different BOOST_REGEX_RECURSIVE and BOOST_REGEX_NON_RECURSIVE methods, respectively. Also, each algorithm can be slightly configured. Macros for setting and their description can be found here.

Boost.Regex supports the following syntax types for regular expressions:

  1. Perl (default)
  2. Posix extended
  3. POSIX Basic


Note that the '.' (Dot) by default includes '\ n'. This can be changed by passing a special flag to the appropriate algorithm.



Basic algorithms



boost :: regex_match



This algorithm is used to check the correspondence of the incoming string and some regular expression, returning true If the string matches and false otherwise.

Typical usage: regex_match (incoming_line, [matches_find_cards], regular_expression, [flags]) .

For a complete list of all overloaded ads, see the documentation.

An example of its use:

std:: string xStr( "AAAA-12222-BBBBB-44455" );

boost::regex xRegEx( "(\\w+)-(\\d+)-(\\w+)-(\\d+)" );

boost::smatch xResults;

std::cout << "==========================Results============================== \n" ;

std::cout << "Does this line match our needs? " << std::boolalpha << boost::regex_match(xStr, xResults, xRegEx) << "\n" ;

std::cout << "Print entire match:\n " << xResults[0] << std::endl;

std::cout << "Print the former string into another format:\n" << xResults[1] << "+"

<< xResults[2] << "+"

<< xResults[3] << "+"

<< xResults[4] << std::endl;




* This source code was highlighted with Source Code Highlighter .


The result of the work will be:

========================== Results ====================== =======

Does this line match our needs? true

Print entire match:

AAAA-12222-BBBBB-44455

Print the former string into another format:

AAAA + 12222 + BBBBB + 44455



A small deviation from the algorithm to describe its parameters. These parameters are used in all algorithms, but we consider them only here.



results_of_consistencies - is an optional parameter and is nothing but an object of the class match_results . This object is an array of objects of the sub_match class, which, in turn, is nothing more than the custodial object of the iterators at the beginning and end of the found match in the string. results_of_consistencies serves to save the results of the algorithm. So, if the algorithm was successfully executed, then the zero member of the array will store sub_match for the whole match (the exception is using partial match, but more on that later). Each subsequent array member will store iterators for each capture contained in the regular expression. Each array element can be checked for content through the matched flag. It is important to remember that each sub_match stores iterators on incoming_string , so you cannot pass a temporary object as the source line and use the results of the algorithm in the future, at best, get assert in debug, at worst undefined behavior with a headache. For recursive captures in a regular expression (for example, "(\ w) +")), only the last capture will fall into the resulting match_result, this is the default behavior that can be changed. In order for us to access all recursive captures, we must pass the match_extra flag to [flags], but that’s not all, in order for match_extra to work, we need to declare the default BOOST_REGEX_MATCH_EXTRA in all translated units. Or just uncomment the define in boost / regex / user.hpp. This functionality is labeled as experimental and greatly reduces performance. I didn’t manage to try it, because my VS2008 issues an Access violation in the xutulity depth when trying to use regex_ * algorithms with uncommented define. An untested example of its use:

std:: string xStr( "The boost library has a great opportunity for the regex!" );

boost::regex xRegEx( "(\\b\\w{5}\\b)*" );

boost::smatch xResults;

std::cout << "==========================Results============================== \n" ;

if ( boost::regex_search(xStr, xResults, xRegEx, boost::match_extra) )

{

std::cout << "Words consist from exact 5 digits have been found in our line:\n" ;

for ( int j = 0; j < xResults.captures(1).size(); ++j)

std::cout << xResults.captures(1)[j] << std::endl;

}




* This source code was highlighted with Source Code Highlighter .


[Flags] is an optional parameter with the default match_default value. About available flags, you can see here . Flags are combined using '|' (or).



Partial match



Partial matching is necessary for checking the input string, for partial matching the regular expression. This can be useful when validating incoming data asynchronously or with large amounts of data, that is, in cases where at a particular point in time there is no possibility to draw a full correspondence between a regular expression and the original string. To use partial match, you must pass the match_partial flag to [flags]. In this case, if a partial match is used, the algorithm used (regex_match, regex_search, etc.) returns true, but the matched flag of the zero element match_results will be set to false. What was found as a result of a partial match can be obtained through the same zero element.

Example of use:

std:: string xStr( "AAAA-12222" );

boost::regex xRegEx( "(\\w+)-(\\d+)-(\\w+)-(\\d+)" );

boost::smatch xResults;

std::cout << "==========================Results============================== \n" ;

std::cout << "Does this line match the regex? " << std::boolalpha << boost::regex_match(xStr, xResults, xRegEx,

boost::match_default | boost::match_partial) << "\n" ;

std::cout << "Is it the partial match? " << std::boolalpha << !xResults[0].matched << "\nPrint the partial match:\n" << xResults[0] << std::endl;




* This source code was highlighted with Source Code Highlighter .


Conclusion:

========================== Results ====================== =======

Does this line match the regex? true

Is it a partial match? true

Print the partial match:

AAAA-12222



regex_search



This algorithm is designed to search for a substring in the source string, by a given regular expression.

The usage format is as follows:

regex_search (incoming_line, [match_find_lear_count], regular_expression, [flags]) .

Example of use:

std:: string xStr( "The boost library has a great opportunity for the regex!" );

boost::regex xRegEx( "\\b(?:\\w+?)((\\w)\\2)(?:\\w+?)\\b" );

boost::smatch xResults;

std::cout << "==========================Results============================== \n" ;

std:: string ::const_iterator xItStart = xStr.begin();

std:: string ::const_iterator xItEnd = xStr.end();

while ( boost::regex_search(xItStart, xItEnd, xResults, xRegEx) )

{

std::cout << "Word, we've searched, is \"" << xResults[0] << "\". It has two \"" << xResults[2] << "\" inside itself.\n" ;

xItStart = xResults[1].second;

}




* This source code was highlighted with Source Code Highlighter .


Conclusion:

========================== Results ====================== =======

Word, we've searched, is a "boost." It has two "o" inside itself.

Word, we've searched, is “opportunity.” It has two "p" inside itself.


')

regex_replace



The algorithm is used to replace all occurrences of substrings corresponding to a regular expression with a string specified in a specific format. The result can be obtained through an iterator, passed as an argument or as a return string. Parts of the timeline that do not match the regular expression are copied to the output line unchanged, unless the format_no_copy flag is set , which leaves only broken lines in the result. When the format_first_only flag is passed, only the first substring corresponding to the regular expression is replaced.

Typically used format:

regex_replace (incoming_string, regular_expression, format_string, [flags]) .

format string defines the string to which the found substring will be replaced.

She can obey one of the following syntax rules:



Example of use:

std:: string xStr( "AAAA-12222-BBBBB-44455" );

boost::regex xRegEx( "(\\w+)-(\\d+)-(\\w+)-(\\d+)" );

std:: string xFormatString( "$1*$2*$3*$4" );

boost::smatch xResults;

std::cout << "==========================Results============================== \n" ;

std::cout << "Print string after replace:\n " << boost::regex_replace(xStr, xRegEx, xFormatString, boost::match_default | boost::format_perl) << std::endl;





* This source code was highlighted with Source Code Highlighter .


Conclusion:

========================== Results ====================== =======

Print string after replace:

AAAA * 12222 * BBBBB * 44455





Auxiliary means



regex_iterator



This iterator can be convenient for sequential search of occurrences of a substring corresponding to a regular expression. For each increment, the next substring is found using regex_search. When dereferencing an iterator, we get an object of the type match_results, with which we can get all the necessary information.

Usage format: regex_iterator (start_iterator, end_iterator, regular_expression)

Example of use:

std:: string xStr( "AAAA-12222-BBBBB-44455" );

boost::regex xRegEx( "(\\w|\\d)+" );

boost::smatch xResults;

std::cout << "==========================Results============================== \n" ;

boost::sregex_iterator xIt(xStr.begin(), xStr.end(), xRegEx);

boost::sregex_iterator xInvalidIt;

while (xIt != xInvalidIt)

std::cout << *xIt++ << "*" ;




* This source code was highlighted with Source Code Highlighter .


Conclusion:

========================== Results ====================== =======

AAAA * 12222 * BBBBB * 44455 *




regex_token_iterator



A very useful tool for splitting a string into tokens,

Usage format: regex_token_iterator (start_iterator, end_iterator, regular_expression, [submatch])

[submatch] is used to specify how tokens should be interpreted in a string.

When -1, the iterator returns the part of the sequence that does not match the regular expression. That is, either the string that comes after the first match is returned before the start of the next match (not including the first character of the match). Or, from the beginning of the line, if the beginning of the line does not satisfy the regular expression. That is, by passing -1, the regular expression is the delimiter. At 0, each offset of the iterator (++) gives the next part of the string that was “zamatchena“, that is, each dereferenced iterator is a capture string. For any positive number, as the parameter, the capture of the regular expression is chosen corresponding to the number passed as the parameter. You can also pass an array of indices as a parameter, then the iterator will search for each capture according to the indices in the array, that is, if the array consists of {4, 2, 1}, then the initial iterator will point to 4 capture, the next iterator to 2 and etc. The process will be repeated for the entire sequence until the matches for the given regular expression are completed. By default, this parameter is 0.

A dereferenced iterator is an object of class sub_match.

Examples of using:

std:: string xStr( "AAAA-12222-BBBBB-44455" );

boost::regex xRegEx( "(\\w|\\d)+" );

boost::smatch xResults;

std::cout << "==========================Results============================== \n" ;

boost::sregex_token_iterator xItFull(xStr.begin(), xStr.end(), xRegEx, 0);

boost::sregex_token_iterator xInvalidIt;

std::cout << "Result the same as the regex_iterator: \n" ;

while (xItFull != xInvalidIt)

std::cout << *xItFull++ << "*" ;

//Parts of captures

boost::regex xRegEx2( "(\\w+)-(\\d+)" );

boost::sregex_token_iterator xItFirstCapture(xStr.begin(), xStr.end(), xRegEx2, 1);

std::cout << "\nShow only first captures: \n" ;

while (xItFirstCapture != xInvalidIt)

std::cout << *xItFirstCapture++ << "*" ;

//Reverse order

int aIndices[] = {2,1};

boost::sregex_token_iterator xItReverseCapture(xStr.begin(), xStr.end(), xRegEx2, aIndices);

std::cout << "\nShow captures in the reverse order: \n" ;

while (xItReverseCapture != xInvalidIt)

std::cout << *xItReverseCapture++ << "*" ;

//Delimiters

boost::regex xRegEx3( "(\\w|\\d)+" );

boost::sregex_token_iterator xItDelimiters(xStr.begin(), xStr.end(), xRegEx3, -1);

std::cout << "\nShow delimiters: \n" ;

while (xItDelimiters != xInvalidIt)

std::cout << *xItDelimiters++ << " " ;




* This source code was highlighted with Source Code Highlighter .


Conclusion:

========================== Results ====================== =======

Result the same as the regex_iterator:

AAAA * 12222 * BBBBB * 44455 *

Show only first captures:

AAAA * BBBBB *

Show captures in the reverse order:

12222 * AAAA * 44455 * BBBBB *

Show delimiters:

- - -


Comment



Any algorithm can throw an exception of type std :: runtime_error if the complexity of checking the full matching (matching) N elements starts to exceed O (N ^ 2) or if the stack overflows (if Boost.Regex was built in recursive mode)



Source: https://habr.com/ru/post/64226/



All Articles