PHP, PREG and UTF-8

In this post we will discuss the work of PHP5 with multibyte strings by means of preg _ * () functions.

I noticed an interesting state of affairs, in general, something long ago described on the Internet, but still relevant to this day (the question has surfaced due to a recent post about trim () ).

For example, here’s a small script:

<? print ": " . setLocale ( LC_ALL , 0 ) . "\n" ; /** * preg_match_all * @param string $comment * @param string $pattern preg_match_all * @param bool $usePatch * @return void */ function preg_test ( $comment , $pattern , $usePatch = false ) { $test = "one three" ; print "\n{$comment}: {$pattern}\n" ; if ( $usePatch ) mb_preg_match_all ( $pattern , $test , $matches , PREG_OFFSET_CAPTURE ); else preg_match_all ( $pattern , $test , $matches , PREG_OFFSET_CAPTURE ); foreach ( $matches [ 0 ] as $v ) print " : «{$v[0]}», : {$v[1]}\n" ; } /** * , */ function mb_preg_match_all ( $ps_pattern , $ps_subject , & $pa_matches , $pn_flags = PREG_PATTERN_ORDER , $pn_offset = 0 , $ps_encoding = NULL ) { // WARNING! - All this function does is to correct offsets, nothing else: //(code is independent of PREG_PATTER_ORDER / PREG_SET_ORDER) if ( is_null ( $ps_encoding )) $ps_encoding = mb_internal_encoding (); $pn_offset = strlen ( mb_substr ( $ps_subject , 0 , $pn_offset , $ps_encoding )); $ret = preg_match_all ( $ps_pattern , $ps_subject , $pa_matches , $pn_flags , $pn_offset ); if ( $ret && ( $pn_flags & PREG_OFFSET_CAPTURE )) foreach( $pa_matches as & $ha_match ) foreach( $ha_match as & $ha_match ) $ha_match [ 1 ] = mb_strlen ( substr ( $ps_subject , 0 , $ha_match [ 1 ]), $ps_encoding ); return $ret ; } preg_test ( "« »" , "/[\w]+/i" ); preg_test ( "Character range" , "/[-a-z]+/i" ); preg_test ( "« » «/u»" , "/[\w]+/ui" ); preg_test ( "Character range «/u»" , "/[-a-z]+/ui" ); preg_test ( " «\pL», «/u»" , "/[\pL]+/i" ); preg_test ( " «\p{Cyrillic}», «/u»" , "/[\p{Cyrillic}]+/i" ); preg_test ( "(!) «\pL» " , "/[\pL]+/i" , true ); $source = highlight_file ( __FILE__ , true ); ?>

The working example is at http://test.dis.dj/utf/ .
')
What conclusions should be drawn from what he saw:

Offset relative to the beginning of the line is always considered in bytes:
3 bytes "one" +
1 byte space +
3 × 2 bytes “two” +
1 byte space +
3 × 2 bytes “two” +
1 byte space =
18 bytes ,
and should be
3 + 1 + 3 + 1 + 3 + 1 = 12 characters .
Cyrillic correctly recognizes only "Character range" with the key "/ u" and the modifier "\ pL", meaning "Unicode letter"
The "\ w" modifier with Cyrillic does not work at all, even the "/ u" key does not help
On a server running Windows Server 2008, for some unknown reason, the very first construction worked, and the “/ u” key no longer exists :)

Useful links:

Codenet forum thread .
More information about the PCRE engine and modifiers can be found in the official documentation .
In another thread on ixbt it was well written about “/ u”.
In the comments to preg_match_all there is a function mb_preg_match_all , which converts the indents into the correct ones (it is just used in this post).

Well, we are waiting for PHP6, where it promises normal support for strings in UTF, including the BOM, which will fill up our script, outputting 3 bytes before header (). Actually there will be a lot of bonuses in PHP6 ...

PS Fasting does not in any way claim to be the “discovery of America” - I just collected the information I know.

UPD. In the course of the discussion, we came to the following replacement “\ w”: either the recommended conglomerate “(?: \ P {L} | \ p {M} | \ p {D} | \ p {Pc})” or “[\ p {L} \ p {Nd}] ”(if you want a shorter one). Thank you khim .

Source: https://habr.com/ru/post/45910/

All Articles

PHP, PREG and UTF-8

More articles: