Data validation essay

In the note “Can I divide by 0.01?” On the testers website, I wrote that when testing, it is necessary to check the consistency of input validators with the logic of processing this data by the application. But from the comments to this note, I realized that in order to understand how to validate data validation, you need to understand how it should work, what can be considered right and what is not. Therefore, I wrote a separate article about it. It addresses three questions: 1) why data validation is generally needed, and 2) where and when data validation can be performed, 3) what kinds of checks there are. And of course, it shows how it all looks with live examples. Or maybe my reasoning will be interesting not only for testers, but also for developers.

Why do we need data validation?

It would seem that “invalid” data that does not satisfy certain limitations can cause a program crash. But what does this mean? Suppose, in some place of the program an exception occurs when trying to convert a string to a number if the string has an incorrect format. Of course, if the exception is not intercepted anywhere, it can lead to a program crash. But this is an unlikely scenario. Most likely, an interceptor will work in some place, which will either give the user some error message in the program or write to the error log, after which the program will try to recover from the failure and continue working. That is, even if the validation is not performed, it is likely that nothing terrible will happen.

But there may still be some negative consequences for the lack of validation, let's take a closer look at what problems may arise.

Inability to recover from a failure. The program is not always able to "return everything back." Perhaps, in the course of work, the program performed some irreversible actions - deleted the file, sent the data over the network, printed something to the printer, launched the machine cutter and it partially processed the workpiece blank. But even if recovery is possible in principle, the recovery algorithm may also contain errors, and this sometimes leads to very sad consequences.
Additional system load. Disaster recovery is a bit of work. All the work that was done before the crash is also superfluous. And this means an additional load on the system, which can be avoided by checking the data in advance. On the other hand, validation is also an additional burden, and the restoration has to be done only occasionally, and the check must be done every time, so it is not yet known what is more profitable.
Injections do not cause failures. One of the main ways to exploit vulnerabilities in programs is to “trick” validators, that is, to transfer data that the validator recognizes as valid, but they are interpreted in an unintended manner, so that an attacker can get unauthorized access to the data or some program features, either capable of destroying data or a program. If there is no validation at all, the attacker's task is simplified as much as possible.
The difficulty of identifying the cause of the problem. If an exception has flown from somewhere in the depth of the program, it is not so easy to determine the reasons for its occurrence. And even if it is possible, it may not be easy to explain to the user that the failure was caused by the data that he entered some time ago in some completely different place in the program. And if the check is performed immediately after data entry, there is no difficulty in identifying the source of the problem.

In short, the lack of validation can lead to the above (and maybe even some other) problems. Accordingly, the presence of validation helps to prevent serious failures, simplifies the identification of problems, but it has to be paid for with performance, as additional checks increase the load on the system. And here we go to the second question - how to reduce this additional load.

Where and when to validate data?

As mentioned above, from the point of view of reducing the load, it is best not to validate the data at all.
')
But if, after all, verification is necessary, logic dictates that it is convenient to check the data in the place where it enters the program from the outside world. After such a test, you can be sure that the correct data enters the program and can be used in the future without additional checks. This can be the user interface through which the person enters the data. This may be a file containing program settings or data that the program must process. This may be a database in which information may fall from other programs. This may be a network protocol for exchanging data with other programs. Finally, it can be a program interface that another program uses, calling some functions / procedures and passing parameters to them.

Alas, common sense is sometimes forced to retreat before the onslaught of reality. "Face control" data at the entrance is sometimes not just inappropriate, but impossible at all. Below are some reasons for this.

Validation requires access to the inaccessible part of the system state. This is especially true for verifying human input through a graphical user interface. Modern applications are often built using a multi-layered architecture, which assumes that the implementation of the user interface is highlighted in the presentation layer, and for checking, access to other layers is required, up to the database layer.

This is especially noticeable for web applications, where the user interface is implemented in the browser and runs on the client side, and to verify the input, a comparison with what is stored in the database is required. In this situation, the verification has to be performed after sending the data to the server. (However, now with the advent of AJAX technology, this problem has been partially solved).
Validation requires repeating the processing logic. As already noted in the two paragraphs above, with a multi-layer application architecture, the user interface is usually allocated to a special presentation layer, and the data processing logic is on a different layer. And there are such situations when for validation it is necessary to perform this processing almost completely, therefore there is no shorter way to understand whether it will be completed with success or not.

How to validate data?

However, wherever validation is performed, this can be done in several different ways, depending on the limitations imposed on the data.

Per letter test. Typically, such checks are performed in the user interface as data is entered. But not only. For example, the lexical analyzer of the compiler also reveals invalid characters directly in the process of reading the compiled file. Therefore, such checks can be called “lexical”.
Check individual values. For the user interface, this is checking the value in a separate field, and it can be performed as you enter (it checks the incomplete value that has been entered so far) and after the input is completed when the field loses focus. For a software interface (API), this is a check of one of the parameters passed to the procedure being called. For data received from a file, this is a check of some read fragment of the file. Such checks, again by analogy with compiler terminology, can be called “syntactic”.
The set of input values. It can be assumed that some data is first transmitted to the program, after which a signal is given that initiates their processing. For example, the user entered the data in a form or in several forms (in the so-called "wizard") and finally pressed the "OK" button. At this point, it is possible to perform so-called “semantic” checks aimed at validating not only individual values, but also interrelations between them, mutual restrictions.

It is quite possible that each individual value is “syntactically” correct, but together they form an inconsistent set. For a software interface, this type of validation involves checking the set of input parameters of the called procedure, for the case of receiving data from a file, it is checking all the read data.
Check system status after processing data. Finally, there is the last way to which you can resort, if the validation of the input data itself fails - you can try to process them, but leave the opportunity to return everything to its original state. Such a mechanism is often called transactional.

A transaction is a sequence of actions that either all complete successfully, or some kind of failure occurs during the execution of a separate action, and then the results of all previous actions of this chain are canceled. So, validation can be performed during the execution of the transaction, and the last check can be performed at the very end of the data processing transaction. At the same time, we no longer validate the data itself, but the state that turned out after their complete processing, and if this state does not satisfy any restrictions, then we recognize the input data as invalid and return everything to its original state.

What method of validation should be applied in practice in this or that case? Most often, one cannot restrict one way, and it is not necessary. Validation of data can and should be done in several stages, making it difficult to verify.

First, as you type, make sure that the data does not contain invalid characters. For example, for a numeric field, the user may not be allowed to enter non-numeric characters.

After input is complete, you can check the entire value. For the entered number there may be some restrictions, for example, it should not exceed a certain maximum permissible value. If our numeric field is an age, it should be in the range of 0 to, say, 120.

When all fields are filled, you can check whether the entered values are consistent with each other. For example, if there is a field for entering a passport number in the form other than the age field, the application can check that the age must be at least 14 years old when filling out the passport number.

Finally, if everything is entered correctly, you can try to start processing, performing checks along the way, as well as at the very end, and if something went wrong, roll back to the original state.

And, of course, checks at the next level can insure checks at previous levels. For example, for web applications, it is mandatory to check the data that came to the server in an HTTP request, regardless of whether it was previously performed a preliminary validation in the browser or not. The reason for this is that the client side check can be circumvented. For other types of applications, it is not so easy to bypass the checks, but sometimes it is also quite possible, as shown in the example below.

Validator testing

We conclude the article with a demonstration of various types of validators, as well as some recommendations on how to check the validity of their work during testing.

Let's start with a character test. Graphic editor Paint, dialog for resizing a picture, width of a picture. Only digits are allowed in this field; when you try to enter other characters, an error message is displayed:

However, by showing ingenuity, you can bypass this validation of the entered characters: you cannot insert a negative number through the clipboard, despite the fact that the minus is an invalid character:

However, this does not lead to negative consequences, because at the next level there is another check that works when you click OK:

There are other restrictions for this field, which are also checked after pressing the OK button:

But the field for entering the picture tilt, which is very close in the same dialog, does not contain validation of characters, despite the fact that this is also a numeric field. Moreover, when you enter invalid characters after clicking OK, you can see such a strange message that is practically undecipherable:

All the above examples are related to the verification of a single field. An example of validation of a combination of fields can be found in the same application, but in a different place in the page settings dialog for printing. If we specify the size of the page margins so that in total they exceed the width of the page, we get the following message:

And finally, the note “Why is there not enough memory to reduce the size of the picture?” Describes an error due to the fact that in this graphic editor there is no correct handling of failures and rollback of the transaction if the picture size is too large.

The tester needs to work out all these situations. First, you need to check validation at all levels. Secondly, it is necessary to check the consistency of validators at different levels. Thirdly, we must look for ways to bypass the validators, trying to get to the next level without preliminary checks.

Conclusion

Most of this article is not about validators testing methods, but about their device description. Why? Because the enemy must be known in person. To find a data validation defect, you need to understand where to look and what to look for.

PS Crosspost

Source: https://habr.com/ru/post/72796/

All Articles