The difficulty of finding errors in scientific applications

This is a continuation of a note on why unit tests work poorly in scientific applications; and in this article I want to talk about the difficulties of finding errors and debugging (scientific applications) that I once faced, and many of which were amazing for me as a web developer.

The article will consist of several sections:

Introduction, for order
Difficulty finding errors
- Parallelism
- Error locality
- Non-obvious typos
- The impact of the setting on the result
- Error Identification: Error Or Not?
Conclusion

Introduction

The main purpose of this article is to describe one’s own experience in the hope that it will be interesting and unusual to someone (especially, I think, to industrial programmers); Perhaps someone can better prepare for writing diplomas / coursework / laboratory. The preamble, the formulation of a scientific problem, a brief description of the algorithm can be found in the aforementioned article [1]. Therefore, I will immediately turn to the presentation of those unusual (for example, for web developers) difficulties in finding errors, in order to broaden the horizons of readers and consciousness in general.

Difficulty finding errors

Parallelism

This is an exceptionally short point for completeness, in which I will once again mention that in parallel programs it is much more difficult to find errors and perform debugs than in single-stream ones. And most resource-intensive scientific programs are parallel.
')

Error locality

Here I mean that an error in the code of one class can manifest itself in completely unexpected places. And the point is not in the poor application architecture and high connectivity of the modules, but in the features of the tasks and modeling. For example, if when modeling a fluid flow in a pipe, the velocity profile is distorted at the walls, this does not mean that the error is in the wall calculation algorithm, it can be anywhere. And vice versa, if the density is distributed strangely in the thickness of the liquid, this does not mean that the matter is not in the algorithm for calculating the walls.

Compare this with a typical business application scenario: if product discounts are calculated incorrectly in online store purchases, the error is almost certainly hidden in the product discount calculation code.

It can be argued that it is easy to identify the source of the error from the change history. Like, as soon as the application stopped working, you need to look for the error in the modified code. However, this approach is not applicable to parts of the program that were added for the first time (for example, adding heat transfer, new boundary conditions, etc.), because there is no previous working version (heat transfer, given boundary conditions, etc.).

In addition, errors in scientific applications may not manifest for a long time. For example, if when adding a new type of boundary conditions, the Poiseuille flow [2], driven by force, not pressure gradient, suddenly ceased to be correctly modeled, it may turn out that this is not a matter of the new boundary conditions algorithm, but the logic of taking into account external force, before that, the error was not critical. (see also the item “Slow error rate” in [1]).

Non-obvious typos

One of the problems of scientific algorithms is that they are often not obvious. Even if you create a program with a beautiful architecture, if you allocate a special class for the algorithm, design it well and write it, you probably will not be able to avoid several problems.

First, these are meaningless variable names (because these are auxiliary variables from the original scientific article that do not carry a semantic load). Secondly, these are unobvious operations on class variables (because they are also taken from the original article, which were obtained using dark magic methods such as optimization of standard deviation, Fredholm alternative, calculation of spatial density harmonics, etc.).

If you debug a business application and see a line like
bool categoryIsVisible = categoryIsEnabled || productsCount> 0;
then you will immediately notice a typo, because the condition must have a logical "AND".

But imagine that you came across a line (from a real project)
double probability = latticeVectorWeight * density * (1.0 + 3.0 * dotProduct + 9.0 / 2.0 * dotProduct * dotProduct - 3.0 / 2.0 * velocitySquare);
It is unlikely that you will determine that you have mixed up somewhere plus and minus. And, by the way, the variable names here are meaningful.

The impact of the setting on the result

At this point, I will try to explain that the work of scientific applications is much stronger (compared to business applications) depends on the input data, the system parameters, its initial state — that is, the system setting.

The main sources of dependency on the setting are as follows.

1. The parameters of the system greatly (qualitatively) affect the result of the program, unlike business applications, where the parameters usually affect only quantitatively (for example, the work of the CMS will not fundamentally depend on whether the administrator adds five or ten lines)

2. The smaller area of stability of the algorithms for the input data. In business applications, the main limitation on data is the absence of overflow errors (and who is paying attention to it ?!). In scientific algorithms (one of the differences in which is to work with sets of greater power), you need to remember about stability (and after this, the rigidity of differential equations, theory of stability, Lyapunov indicators, etc.) and follow it. In addition, in business applications, all restrictions are deterministic (they say, the name during registration cannot be longer than 100 characters, email must correspond to a certain regular expression), while in scientific tasks it is often necessary to use trial and error to determine the working range of input data.

3. Everything else (difficult to formalize for now). In particular, it is the conversion of units of measurement from physical to units of measurement for a program.

To illustrate these aspects, I will demonstrate the checklist, which I compiled for myself after weeks of vain debagging an application for modeling hydrodynamics. If I could not find an error in several hours / days of step-by-step execution, I checked with this checklist.

Attention! It is somewhat far from the subject of Habr and the interests of the majority of readers, so that you can skip it if you wish and go on to the next item.
So checklist:

Check incompressibility
Check Reynolds numbers
Check the transfer of values

The first of these points means that the algorithm works only in a weakly compressible fluid, which is equivalent to its low velocities (much less than the speed of sound in a fluid) (after all, the flow is induced by a density gradient). When I forgot about this restriction for the first time, I spent several days searching for errors in the code, because the program seemed to work almost correctly.

The second of points is equivalent to checking the stability domain of the algorithm. The fact is that the Reynolds number determines how much fluid motion is turbulent and unstable [3]. The larger it is, the more unstable the current, the less - the more "viscous". It turns out that even if the motion is never physically turbulent (again, in the Poiseuille flow), the calculations begin to diverge at sufficiently large Reynolds numbers. Of course, until I stepped on this rake (and I didn’t look like a week), I didn’t think about tracking the area of sustainability.

The third item is specific only to physical calculations and some algorithms. The method used accepts input physical quantities in special lattice units (when the unit of length is the step of the uniform spatial lattice, and the unit of time is proportional to it). Until I came across a special article [4] devoted to the transfer of values in this method, I unsuccessfully tried for several weeks to understand why the program behaves not quite correctly.

It is worth noting that the second and third points barely allow automatic checking.

Error Identification: Error Or Not?

This problem is completely unimaginable in business applications; and it lies in the fact that it is often impossible to say for sure whether a deviation in the program’s behavior (from the expected one) is an error.

For example, it is known that the velocity profile when a viscous fluid flows through a cylindrical tube will be parabolic [2]. Nevertheless, suppose that in modeling, the fluid at the pipe walls flows a little faster than it should. Options are usually considered the following:

this is actually a mistake
this is a feature of the algorithm
this is a consequence of incorrect or inappropriate for the algorithm input data (initial conditions, physical parameters) (see. "The influence of the setting on the result")

Verification through unit testing of the first item is complicated by the difficulties of writing unit tests in such applications [1].

The second point in this example is easy to check by replacing the wall calculation algorithm. However, it may turn out that modeling with a new method also leads to distorted results. In this case, you can try a couple more algorithms (if they exist in principle, and if you have time to look for them, disassemble and implement them).

Checking the third point, unfortunately, is not so trivial. One option is to try to change the input parameters and the system setting to determine if there is an area in the initial data phase space in which the program works. Sadly, this is not so simple, because the number of degrees of freedom in the initial conditions is very large in complex modeling (different physical parameters can be set, such as viscosity and thermal conductivity; the initial distribution of velocities, forces, densities in the entire system, etc.) . For example, in the test with the flow of liquid through the pipe, it took me several days to try to start modeling not with a stationary velocity distribution, but with a stationary liquid, which was subsequently accelerated by constant force — and the error disappeared!

Conclusion

Here, in fact, all the difficulties of finding errors, which I wanted to talk about. If someone has thoughts on how to avoid or effectively cope with such effects, I will be glad to hear them.

Thanks for reading!

References:

[1] Why unit tests do not work in scientific applications
[2] Poiseuille Stream
[3] Reynolds number
[4] Conversion of values to LBM

Source: https://habr.com/ru/post/93570/

All Articles