Own browser - the way the mouse: Theory

So, you decided to raise your browser.

You understand all the accompanying nuances and behavior of the Browser Administrator: habrahabr.ru/post/249625
You clearly realize that the only reasonable option, why it is worth harnessed in is self-development, and this is not the best way: habrahabr.ru/post/249705

Mice prick and cry, but continued to eat the cactus ...

Okay. In your tenacious paws got some browser engine downloaded from an Internet. At best, it will be an open source with a good number of forks (that is, free from “childish problems”). Perhaps it will be so-called. “Dump” - i.e. merged from a live commercial server engine with a database dump. But most likely - it will be another "hack on the subject." Here we will stop on the last case - because the minimally necessary actions are common for all three cases.
')
Here I implicitly assume that your engine is based on PHP + MySQL. It is for this bunch that I will lead the further story. If you have an unexpected Python + PostgressSQL - do not worry. The fundamental principles of working with such engines are identical - only the implementation is different.

What do you need to do first with the engine? It is not so easy to answer this question without familiarizing yourself with the basic principles of working with data inside the browser engine (and indeed, any site or even any program). These principles are few and very simple.

Data from the user is toxic.

DO NOT TRUST THE USER INPUT!
All data coming from the user is considered to be “toxic” by definition - i.e. wrong, invalid or even malicious until the opposite is proven. This means that all data from the user must be checked on the server side.

Do you think that you have a wonderful JavaScript validator that will not miss incorrect data to the backend? MISTAKE! Any request from the frontend (the part that is directly visible to the user and is responsible for interacting with it) of your engine can be falsified - so the backend (the part that performs actual data processing on the server side and writes the changes to the database) MUST always repeat the frontend calculations, to be sure that you did not feed the bullshit.

Data normalization

All operations inside the frontend and backend should be performed on the source data with minimal conversion. A little incomprehensible?

Well, for example, you have a sequence of characters, which in fact is a multi-line text. It seems to be nothing complicated and everything is obvious, right? MISTAKE! The string can be obtained from an external file.

Here immediately begin dancing with a tambourine about the newline character. Indeed, in different systems the end of the line is encoded differently. Files of Unix-like systems beat off a new line with the LF (Line Feed, ASCII 0x0A) symbol. Older systems (MacOS up to version 9, Commodore and some more) loved to end the line with the CR symbol (Carriage Return, ASCII 0x0D). The most desktop OS of Windows (as the heir of MS-DOS) adore the combination CR + LF - and in that order!

And here comes the great HTML, for which everything outside of the “pre” tag (and other similar tags) is the essence of one line. And line breaks are done with the <br> tag. Which, of course, correctly spelled as

. Allowed the same

<br/>,

<br/>,

etc

The problem with normalizing strings is not mentioned in vain first. Usually incorrect row normalization does not give critical errors - but most often it is visible to players. It’s nice to see something like instead of text-formatted text
1\r\n 2\r\n 3 ? Or even

1
2
3

1
2
3

1
2
3

, which happens when there is a simultaneous error in data normalization and output formatting.

The numbers are not without similar problems. It would seem - what could be simpler than numbers? intval () and floatval () (and their counterparts in JavaScript - however, this is a separate conversation. JS allows you to shoot yourself in the foot in many sophisticated ways ...) should solve all the problems, right? MISTAKE!

What number is obtained in a variable, if in the PHP backend we write intval (0123)? Think - 123? Well, then go ahead, read the manual: php.net/manual/ru/language.types.integer.php

A separate song is floating point numbers in PHP and their comparison. This is not a lesson in PHP - so again I’ll refer to the documentation: php.net/manual/ru/language.types.float.php . I suggest paying special attention to the insertion “Accuracy of floating-point numbers” and subsection “Comparison of floating-point numbers” .

It is unlikely that this will be relevant at the first stage of development, but to understand the importance of data normalization I cannot but mention a separate interesting issue of identifiers in the database. Well, these are the funniest numbers declared as BIGINT (20) in the database and defining the uniqueness of each record.
To fully understand the essence of the problem - again refer to the PHP documentation php.net/manual/ru/reserved.constants.php Search the text for the keyword PHP_INT_MAX. Count the number of characters. Compare with the description of the database structure above. Hint - in MySQL, this declaration means a number with 20 significant characters to the comma. If you work carelessly with such an identifier in PHP or JS code, it can easily contact the float with a loss of accuracy.

What conclusion can be drawn from all the above? Data should be NORMALIZED - i.e. given to some kind of standardized form, which will be perceived by your engine from now on and everywhere. Usually the author (s) of the engine implicitly choose any one form of normalization. Unfortunately - the same author (s) observe it by no means always and everywhere.

Escaping Output

You will be surprised how often the game will need to display data for the players! Here we are also waiting for a bunch of interesting revelations.

HTML output

The most important rule is this: ANY output to HTML of uncontrolled data should be made exclusively through the function htmlentities () php.net/manual/ru/function.htmlentities.php ! Once again I emphasize - ANY! WITH NO EXCEPTIONS!

Here it should be noted that the data in the engine are divided into "controlled" and "uncontrolled." The first group includes such entities as simple localization strings, guaranteed normalized integers, guaranteed normalized single-line strings, and ... perhaps everything. All other data are “uncontrollable” and must pass through a small sieve htmlentities (). Yes, this seems to be an unnecessary complication - but in the future it will at least save you from damaging the type of game, and as a maximum - to a certain extent guarantees the impossibility of infecting the player’s computer with hostile scripts (100% guarantee is provided only by Gosstrakh).

Here it is worth making a digression and talk in detail about each type of monitored data.

HTML Controlled Data

Let's start with the simplest - with guaranteed normalized integers. These include $ variable variables that are_int ($ variable) checked for 'true'. Stupid. Straight. Head-on. 100% guarantee that these variables in the output will consist of only decimal digits, do not contain anything extraneous and look equally beautiful in HTML and JS.

Simple localization strings are strings from the current locale, which consist of one line, do not contain HTML tags, do not contain special HTML characters like <> "', line breaks, are not regexps and are not templates. Here, too, everything seems to be clear from the first Looks. It seems that the localization line is completely controlled by the engine developer. It seems that there should be no pitfalls - and all these specifications are superfluous ... WRONG!

The most common mistake is to use the “Start of line% s remaining line” line patterns with the subsequent use of the sprintf () function. We remember that our data should be normalized, i.e. reduced to standard format. But nowhere is it said that the normal format is safe for at least something! Normalization of the format of variables only guarantees the uniformity of the type of data processed in different pieces of code - and nothing more. Whether the data is safe for output in HTML or (especially!) For writing to the database is unknown to us (generally known, by definition, they are not). Therefore, direct data output to the template without specifying additional modifiers is not allowed. Moreover, not every modifier guarantees safe data output.

I note - here we are talking about the security of the conclusion, and not about correctness. For example, the output of a variable to the template "% d" is safe, but not correct - if, for example, a string is supplied to the input that cannot be reduced to a integer. Here again, I can not not refer to the documentation: php.net/manual/ru/function.sprintf.php

In normal cases, all localization strings should be guaranteed to be simple. However, the ideal is what it aspires to, but at the same time that which is unattainable in reality.

Practical aspects and calculations of labor costs say that sometimes you have to do non-simple localization lines — for example, add HTML formatting directly to a line. I think that this is acceptable as long as the developer is aware of all the risks and controls the use of unconventional localization strings.

Why guaranteed normalized lines should contain only one line / paragraph - it should be clear from the read above. But just in case, I repeat - multi-line strings (even normalized) require a different attitude when outputting to HTML, JS, or when saved to a database.

Uncontrolled HTML data

Above, we looked at three types of controlled data, which (with reservations) can be displayed directly in HTML, bypassing the htmlentities () function. All other data MUST be passed through the above function! However, there are nuances.

In fact, the “uncontrolled data” is user data, or strings that do not fit the above criteria. How to work with user data will be described in detail in the next article. And working with engine lines (in particular, localization) is quite a delicate topic.
Depending on the quality of the original product, the principles of working with strings, the source of which is the engine itself, can be very different. I can only give recommendations - which in no case should not be a dogma. A lot depends on how the engine describes multi-line data in general and localization strings in particular.

1. If the engine uses CR, LF or a combination of CR + LF to separate lines in the text - first use htmlentities (), and then the PHP function nl2br () php.net/manual/ru/function.nl2br.php
2. If the engine uses a tag to separate lines

- output data directly if you have completely checked the code and are confident in its security
3. All output templates using the format "% s" and the like - always MUST go through htmlentities ()!

In any case, it is ALWAYS necessary to remember the fundamental principle - it is better to bring the player to an “ugly”, but guaranteed safe line, than to remove “beauty” in 99.99%, and in 0.01% to infect the player’s computer with a malicious script.

In the following article:
- Data output in JavaScript - features and jokes;
- Writing data to the database - how not to “get” on SQL-injection;
- How to make sure that the data from the user are not "toxic";
- … and much more

Source: https://habr.com/ru/post/275767/

All Articles