Oracle regular expressions. Dangerous range

The Oracle developer, who often uses regular expressions in code, especially on bases with orthodox settings, may sooner or later encounter a phenomenon that you can’t name except mysticism. A long search for the causes of the problem can lead to weight loss, appetite and provoke various kinds of psychosomatic disorders - all of which I will now try to prevent. And the regexp_replace function will help me in this. It can have up to 6 arguments:

REGEXP_REPLACE (

source_string
template,
replacement_string
the position of the start of the match with the template (default 1),
the number of the occurrence of the pattern in the source string (by default, 0 - all occurrences),
modifier (for now dark horse)

)
Returns the modified source_line in which all occurrences of the pattern are replaced by the value passed in the replace_string parameter. Often use the short version of the function, where the first 3 arguments are given, which is enough to solve many problems. I'll do it too. Suppose we need in the string 'MASK: lower case' to mask all lowercase characters with asterisks. To specify a range of lowercase characters, the pattern '[az]' should be suitable. Check
')

select regexp_replace('MASK: lower case', '[az]', '*') as result from dual

Expectation

 +------------------+ | RESULT | +------------------+ | MASK: ***** **** | +------------------+

Reality

 +------------------+ | RESULT | +------------------+ | *A**: ***** **** | +------------------+

If on your base this phenomenon is not reproduced, it means that you are lucky so far. But more often begin digging in encodings, converting strings from one set of characters to another, and over time comes about this state

Establishing diagnosis.

The question arises - what is so special about the letter 'A' that it has not been replaced, because the other uppercase characters should not have been either. Maybe, besides her, are there any other correct letters? It is necessary to watch the entire alphabet of capital letters.

 select regexp_replace('ABCDEFJHIGKLMNOPQRSTUVWXYZ', '[az]', '*') as alphabet from dual +----------------------------+ | ALPHABET | +----------------------------+ | A************************* | +----------------------------+

But.

And the joke is this. If the 6th function argument is not explicitly specified - a modifier, for example, 'i' is case-insensitive, or 'c' is case-sensitive when comparing the source string with a pattern, then the regular expression uses the NLS_SORT parameter of the session / base by default. I have it like this:

 select value from sys.nls_session_parameters where parameter = 'NLS_SORT' +---------+ | VALUE | +---------+ | RUSSIAN | +---------+

This parameter sets the sorting method in ORDER BY. If we are talking about the sorting of simple single characters, then each of them in the binary representation corresponds to a number (NLSSORT – code) and the sorting actually occurs according to the magnitude of these numbers.

For a visual example, take the first few and last few characters of the alphabet, both lowercase and uppercase, and put them in a conditionally unordered table set, let's call it ABC. Then we sort this set by the SYMBOL field and next to each symbol we display its NLSSORT – code in HEX format.

 with ABC as ( select column_value as symbol from table(sys.odcivarchar2list('A','B','C','X','Y','Z','a','b','c','x','y','z')) ) select symbol, nlssort(symbol) nls_code_hex from ABC order by symbol

SYMBOL	NLS_CODE_HEX
A	14000100
a	14000200
B	19000100
b	19000200
C	1E000100
c	1E000200
X	7D000100
x	7D000200
Y	82000100
y	82000200
Z	87000100
z	87000200

The request contains the ORDER BY by the SYMBOL field, but in fact the database was sorted according to the values from the NLS_CODE_HEX field.

Now let's go back to the range from the template and look at the table - what is vertically between the character 'a' (code 14000200) and 'z' (code 87000200)? All but the capital letter 'A'. This is all an asterisk and replaced. And the code 14000100 of the letter 'A' did not fall into the range of replacement from 14000200 to 87000200.

Cyrillic is the same story. Below is a query with similar results, their reasons are now not difficult to understand.

 select 1 id, regexp_replace('ABCDEFJHIGKLMNOPQRSTUVWXYZ', '[az]', '*') result from dual union all select 2, regexp_replace('abcdefjhigklmnopqrstuvwxyz', '[AZ]', '*') from dual union all select 3, regexp_replace('', '[-]', '*') from dual union all select 4, regexp_replace('', '[-]', '*') from dual

ID	RESULT
one	A *************************
2	************************* z
3	BUT*****************************
four	*****************************I

Treatment.

Explicitly specify a case-sensitive modifier

 select regexp_replace('MASK: lower case', '[az]', '*', 1, 0, 'c') from dual +------------------+ | RESULT | +------------------+ | MASK: ***** **** | +------------------+

In some sources they write that the modifier 'c' is set by default, but just that we saw that this is not quite so. And if someone doesn’t see, it means that the NLS_SORT parameter of its session / base is most likely set to BINARY and the sorting is in accordance with the actual character codes. Indeed, if you change the session parameter, the problem will go away.

 ALTER SESSION SET NLS_SORT=BINARY; select regexp_replace('MASK: lower case', '[az]', '*') as result from dual +------------------+ | RESULT | +------------------+ | MASK: ***** **** | +------------------+

Tests, if anything, were conducted in Oracle 12c.

In the meantime, everything. Good health.

Source: https://habr.com/ru/post/269387/

All Articles

Oracle regular expressions. Dangerous range

More articles: