Htmlspecialchars () improvements in version 5.4

Around the new features in PHP 5.4, there was a lot of talk, such as about traits, the short syntax of arrays.

But some particularly important changes that are often forgotten for PHP 5.4, heroically rewrote cataphract ( Artefacto on StackOverflow) most of the htmlspecialchars.

The changes in question refer not only to htmlspecialchars, but also to htmlentities, htmlspecialchars_decode, html_entity_decode, get_html_translation_table.
')
Here is a brief overview of the most important changes:

UTF-8 default encoding
Improved error handling (ENT_SUBSTITUTE)
Doctype processing (ENT_HTML401, ...)

UTF-8 default encoding

As you know, the third argument for htmlspecialchars is an encoding. Most people simply overlook this argument, thus getting the default encoding. This value was ISO-8859-1 until PHP 5.4. The new version fixes this by making UTF-8 the default.

Improved error handling

Error handling in htmlspecialchars to 5.4 was ... hmm, let's call it “non-intuitive”:

If you specify a string containing an “incorrect code sequence” (for Unicode, this is “incorrectly encoded string”) htmlspecialchars will return an empty string. Well, okay, so far so good. The funny thing is that it will additionally generate an error, but only if the error display has been turned off . Wonderful, isn't it?

This basically meant that on your dev-computer, you will not see any errors, but on the production environment the error log will be filled with them. Amazing.

In PHP 5.4, fortunately, this behavior is history. Errors will no longer be generated.

In addition, there are two options that allow you to specify an alternative to the empty string returned:

ENT_IGNORE: This version (which is not really new, it was in PHP 5.3) will simply discard the entire incorrect code sequence. This is bad for two reasons: first, you will not see invalid characters. Secondly, it imposes a certain security risk .
ENT_SUBSTITUTE: This is a new alternative option. Instead of simply deleting characters, they will be replaced with Unicode (U + FFFD) replacement characters.

Let's look at the different behaviors ( demo ):

<?php // "\80"  UTF-8    var_dump(htmlspecialchars("a\x80b")); // string(0) "" var_dump(htmlspecialchars("a\x80b", ENT_IGNORE)); // string(2) "ab" var_dump(htmlspecialchars("a\x80b", ENT_SUBSTITUTE)); // string(5) "a b"

Obviously, the latter option is preferable. In a real application, it will look like this:

 <?php //   bootstrap,       5.3 if (!defined('ENT_SUBSTITUTE')) { define('ENT_SUBSTITUTE', 0); //      5.3 //  define('ENT_SUBSTITUTE', ENT_IGNORE); //      5.3 } //     5.3 $escaped = htmlspecialchars($string, ENT_QUOTES | ENT_SUBSTITUTE, 'UTF-8');

Doctype processing

In PHP 5.4, there are four additional flags to tell the doctype to use:

ENT_HTML401 (HTML 4.01) => used by default
ENT_HTML5 (HTML 5)
ENT_XML1 (XML 1)
ENT_XHTML (XHTML)

Depending on which doctype you specify htmlspecialchars (and other related functions) will use different entity tables.

Example ( demo ):

 <?php var_dump(htmlspecialchars("'", ENT_HTML401)); // string(6) "&#039;" var_dump(htmlspecialchars("'", ENT_HTML5)); // string(6) "&apos;"

Thus, for HTML 5, the essence of & apos; will be returned, and for HTML 4.01, because it does not support & apos; - The numeric code & # 039 ;.
The difference becomes more apparent when using htmlentities, because there are more differences.
You can easily see this when you look at the raw translation tables.

To do this, you can use the get_html_translation_table function. Here is an example for XML 1 doctype ( demo ):

 <?php var_dump(get_html_translation_table(HTML_ENTITIES, ENT_QUOTES | ENT_XML1));

Result of performance:

array(5) {
["""]=>
string(6) "&quot;"
["&"]=>
string(5) "&amp;"
["'"]=>
string(6) "&apos;"
["<"]=>
string(4) "&lt;"
[">"]=>
string(4) "&gt;"
}

This is in line with our expectations: XML itself defines only five basic entities.
And now let's try the same for HTML 5 ( demo ), and we'll see something like this:

array(1510) {
[" "]=>
string(5) "&Tab;"
["
"]=>
string(9) "&NewLine;"
["!"]=>
string(6) "&excl;"
["""]=>
string(6) "&quot;"
["#"]=>
string(5) "&num;"
["$"]=>
string(8) "&dollar;"
["%"]=>
string(8) "&percnt;"
["&"]=>
string(5) "&amp;"
["'"]=>
string(6) "&apos;"
// ...
}

HTML 5 defines a large number of entities - 1510, to be exact. You can also try specifying HTML 4.01 and XHTML, they both define 253 entities.

Also affected by the selected document type is another new error handling flag, which I did not mention above: ENT_DISALLOWED. This flag will replace characters with Unicode replacement characters, which are formally correct code sequences, but are not allowed in this DOCTYPE.

In this way, you can ensure that the returned string will always be well formed with respect to encoding (in this type of document). Although I’m not sure how much this flag gives. The browser handles invalid characters delicately anyway, so it seems to me unnecessary (although I'm probably wrong).

That's not all

... but I do not want to list everything here. I think that the three changes mentioned above are the most important of the improvements.

 <?php htmlspecialchars("<\x80The End\xef\xbf\xbf>", ENT_QUOTES | ENT_HTML5 | ENT_DISALLOWED | ENT_SUBSTITUTE, 'UTF-8');

Source: https://habr.com/ru/post/137296/

All Articles

Htmlspecialchars () improvements in version 5.4

UTF-8 default encoding

Improved error handling

Doctype processing

That's not all

More articles: