📜 ⬆️ ⬇️

PHP, PREG and UTF-8

In this post we will discuss the work of PHP5 with multibyte strings by means of preg _ * () functions.

I noticed an interesting state of affairs, in general, something long ago described on the Internet, but still relevant to this day (the question has surfaced due to a recent post about trim () ).

For example, here’s a small script:

<?<br> <br> print ": " . setLocale ( LC_ALL , 0 ) . "\n" ;<br> <br> /**<br> * preg_match_all<br> * @param string $comment <br> * @param string $pattern preg_match_all<br> * @param bool $usePatch <br> * @return void<br> */<br> <br> function preg_test ( $comment , $pattern , $usePatch = false ) {<br> <br> $test = "one three" ;<br> <br> print "\n<strong>{$comment}:</strong> <u>{$pattern}</u>\n" ;<br> <br> if ( $usePatch ) mb_preg_match_all ( $pattern , $test , $matches , PREG_OFFSET_CAPTURE );<br> else preg_match_all ( $pattern , $test , $matches , PREG_OFFSET_CAPTURE );<br> <br> foreach ( $matches [ 0 ] as $v ) print " : «{$v[0]}», : {$v[1]}\n" ;<br> <br> }<br> <br> /**<br> * , <br> */<br> <br> function mb_preg_match_all (<br> $ps_pattern ,<br> $ps_subject ,<br> & $pa_matches ,<br> $pn_flags = PREG_PATTERN_ORDER ,<br> $pn_offset = 0 ,<br> $ps_encoding = NULL<br> ) {<br> <br> // WARNING! - All this function does is to correct offsets, nothing else:<br> //(code is independent of PREG_PATTER_ORDER / PREG_SET_ORDER)<br> <br> if ( is_null ( $ps_encoding )) $ps_encoding = mb_internal_encoding ();<br> <br> $pn_offset = strlen ( mb_substr ( $ps_subject , 0 , $pn_offset , $ps_encoding ));<br> $ret = preg_match_all ( $ps_pattern , $ps_subject , $pa_matches , $pn_flags , $pn_offset );<br> <br> if ( $ret && ( $pn_flags & PREG_OFFSET_CAPTURE ))<br> foreach( $pa_matches as & $ha_match )<br> foreach( $ha_match as & $ha_match )<br> $ha_match [ 1 ] = mb_strlen ( substr ( $ps_subject , 0 , $ha_match [ 1 ]), $ps_encoding );<br> <br> return $ret ;<br> <br> }<br> <br> preg_test ( "« »" , "/[\w]+/i" );<br> preg_test ( "Character range" , "/[-a-z]+/i" );<br> preg_test ( "« » «/u»" , "/[\w]+/ui" );<br> preg_test ( "Character range «/u»" , "/[-a-z]+/ui" );<br> preg_test ( " «\pL», «/u»" , "/[\pL]+/i" );<br> preg_test ( " «\p{Cyrillic}», «/u»" , "/[\p{Cyrillic}]+/i" );<br> preg_test ( "(!) «\pL» " , "/[\pL]+/i" , true );<br> <br> $source = highlight_file ( __FILE__ , true );<br> <br> ?>

The working example is at http://test.dis.dj/utf/ .
')
What conclusions should be drawn from what he saw:
  1. Offset relative to the beginning of the line is always considered in bytes:
    3 bytes "one" +
    1 byte space +
    3 × 2 bytes “two” +
    1 byte space +
    3 × 2 bytes “two” +
    1 byte space =
    18 bytes ,
    and should be
    3 + 1 + 3 + 1 + 3 + 1 = 12 characters .
  2. Cyrillic correctly recognizes only "Character range" with the key "/ u" and the modifier "\ pL", meaning "Unicode letter"
  3. The "\ w" modifier with Cyrillic does not work at all, even the "/ u" key does not help
  4. On a server running Windows Server 2008, for some unknown reason, the very first construction worked, and the “/ u” key no longer exists :)

Useful links:

Well, we are waiting for PHP6, where it promises normal support for strings in UTF, including the BOM, which will fill up our script, outputting 3 bytes before header (). Actually there will be a lot of bonuses in PHP6 ...

PS Fasting does not in any way claim to be the “discovery of America” - I just collected the information I know.

UPD. In the course of the discussion, we came to the following replacement “\ w”: either the recommended conglomerate “(?: \ P {L} | \ p {M} | \ p {D} | \ p {Pc})” or “[\ p {L} \ p {Nd}] ”(if you want a shorter one). Thank you khim .

Source: https://habr.com/ru/post/45910/


All Articles