Metamorphic testing: why almost no one knows about this promising method

I must admit: I read ACM Magazine . This makes me a “nerd” even by the standards of programmers. Among other things, I learned from this magazine about "metamorphic testing". I had never heard of him before, like all the people I asked. But the scientific literature on this topic is surprisingly voluminous: there are many incredibly successful examples of its use in completely different fields of research. So why haven't we heard of him before? There is only one article for people outside academia. Let them now be two.

Brief background

Most written tests use oracles . That is, you know the answer and explicitly check whether the calculations give the correct answer.

def test_dist(): p1 = (0, 3) p2 = (4, 0) assert dist(p1, p2) == 5

In addition to the test oracles, there are manual tests. The tester sits at the computer and compares the input data with the results. In the process of complicating systems, manual tests are becoming less and less useful. Each of them checks only one point in a much larger state space, and we need something that examines the entire state space.
')
This leads us to generative testing : writing tests that cover a random set in the state space. The most popular style of generative testing is property based testing , or PBT. We find the “property” of the function, and then generate the input values and check whether the output values correspond to this property.

 def test_dist(): p1 = random_point() p2 = random_point() assert dist(p1, p2) >= 0

The advantage of PBT is to cover more space. Its disadvantage is the loss of specificity. This is not an oracle test! We do not know what the answer should be, and the function may be erroneous, but in a way that has the same property. Here we rely on heuristics.

A serious PBT problem is finding good properties. Most functions have simple, general properties and complex, specific properties. Common properties can be applied to a large variety of functions, but they do not give us very much information. More specific properties provide more information, but they are more difficult to find and are applicable only in areas of limited tasks. If you have a function that determines whether the graph is acyclic, then what property tests will you write? Will they give you the assurance that the function is correct?

Motivation

Now consider the more complex task. Imagine that you want to write a speech-to-text (STT) speech converter for English. It receives a sound file, and displays the text. How would you test it?

The easiest way to use a manual oracle. Dictate the sentence and check whether the text on the output matches it. But this is not even close enough! The range of human speech is huge . It would be better to test 1,000 or even 10,000 different sound files. Manual transcription oracles would be too expensive. This means that we will have to use property-based testing instead.

But how do we generate the input? For example, we can create random strings, skip them through a text-to-speech (TTS) converter, and then make sure that our STT produces the same text. But this again gives us a very limited range of human voices. Can TTS create intonation changes, swallow words, imitate a strong accent? If we do not cope with them, will the STT be particularly useful? It is better to use arbitrary texts, for example, recordings from radio, from podcasts and online videos.

Now there is a new problem. When using TTS, we started with a written text. In the case of arbitrary sound files, we do not have it, and at the same time we do not want to perform the transcription manually. Instead, we are limited to using properties. What properties do we need to test? Examples of the simplest properties: “the program does not crash with any incoming data” (a good property) or “it does not convert acoustic music into words” (perhaps?). These properties do not very well cover the verification of the main task of the program and slightly increase confidence in its quality.

So, we have two tasks. First, we need a large amount of incoming data in the form of speech. Secondly, we need to figure out how to convert them into useful tests, without spending hours on manual transcription of voices into oracles.

Metamorphic Testing

In this case, the output is considered separately. What if we embed them in a wider context? For example, if the sound clip is transcribed into the out output, then we should always get out when:

Double the volume
Frequency boost
Increase tempo
Adding background noise
Adding traffic noise
Any combination of the above.

These are all “simple” transformations that we can easily test. For example, for the test with “traffic noise” we can take 10 car noise samples, put them on the sound clip and check whether the recognition results of all 11 versions match. We can two increase or decrease the volume, turning 11 versions into 33 versions, and then double the pace, getting 66 versions. This principle can be applied to each sound clip in our database, significantly expanding the space of incoming data.

The presence of 66 versions for comparison is quite convenient. But that’s not all: we don’t need to know what the output should be. If all 66 transformations return out , then the test is successful, if at least one returns something else, then the test fails. At no stage do we need to check what is contained in out . This is extremely important. So we significantly increase the test space with very little human participation. For example, we can download a series episode, perform transformations, and check whether all the results of their conversion into text ^{1 are the} same. We got useful tests without listening to the voice clip . Now we can generate complex and deep tests without using an oracle!

Two sets of incoming data, as well as their outgoing data are related to each other. Such a property related to the set of incoming / outgoing data is called a metamorphic link ² . Testing in which this property is applied is called metamorphic testing . In complex systems, interesting metamorphic relationships can be easier to find than interesting properties of individual incoming / outgoing data.

We put it a little more formally: if we have x and f(x) , then we can perform some transformations of x to get x2 and f(x2) . In the case of STT, we simply check f(x) = f(x2) , but we can use any links between these two data sets. There may be metamorphic relationships of the type f(x2) > f(x) or “whether f(x2)/f(x) an integer value”. Similarly, this principle can be extended to multiple sets of input data using f(x) and f(x3) . An example of this would be a comparison of the results of a search engine without filters with the results of a search engine with one or two filters. In most of the descriptions of usage examples I have read, only two sets of incoming data are used, then even they are enough to find crazy bugs.

Examples of using

Speaking of examples of use: how effective is metamorphic testing in practice? It's one thing to talk about the methodology in the abstract or give artificial examples. Learning usage examples are useful for three reasons. First, it shows whether the method actually works. Secondly, from them it is possible to learn about potential difficulties when using MT. Third, examples show us how we can use the technique. Any metaphorical relationship used in the example of use, you can try to adapt to the solution of our problems.

The article “Metamorphic Testing: A Review of Challenges and Opportunities” provides a list of many studies, but they are all scientific articles. Below are the most interesting. Articles marked with (pdf) , posted, as you might guess, in PDF.

METTLE: A Metamorphic Testing Approach To Validating Unsupervised Machine Learning Methods (pdf). It describes 11 different metamorphic links for testing uncontrolled clustering, for example, “will we get the same result if we mix the incoming data?” And “do the additional incoming data along the cluster edges to these clusters?” Different models change with different links. For example, about 5% of the tested k-means models with mixing of the order of the incoming points have an average clustering error of 20%.
DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars (pdf). The subject of the article is autopilot vision systems in cars. Examples of metamorphic relationships (MS): “adding a rain filter” or “small image tilting”. The authors laid out the results of the sampling here : almost all of the systems tested by them could not cope with their functions when changing the MS.
Automated Testing of Graphics Shader Compilers (pdf). Injections of the “dead” code and execution time constants into the shaders led to the disappearance of objects in the images or their transformation into noise. Researched on the basis of their work created a startup GraphicsFuzz , which was purchased by Google, and the site is closed.
Metamorphic Testing of RESTful Web APIs (pdf). Will we get the same elements when changing pagination ? What if you sort them by date? This article contains a whole bunch of errors in Spotify and Youtube.
An innovative approach for testing bioinformatics programs using metamorphic testing (earlier pdf, but now not). Search for errors in bioinformatics data. I have little knowledge of bioinformatics, but the article shows that MS are also useful in specialized areas.

Problem

Oh, so all these sources are in PDF.

It took several hours to search all these articles. And this problem is associated with the biggest obstacle to the development of MT: all the above links are preprints or the first drafts of future scientific articles. When I start to understand little-known methods, I first ask myself: “why are they little known?” Sometimes the reason is obvious, sometimes it is a complex set of small reasons, sometimes the problem is simply that the technique is “not lucky.”

In the case of MT, the problem is obvious. Almost all information is hidden behind scientific paywall. If you want to study MT, you either need access to the magazine, or you need to spend several hours searching for preprints ³ .

Further study

The inventor of MT is Ty Chen . He also became the driving force of many studies. Other researchers in this area are Qi Chuan Zhou (Zhi Quan Zhou) and Sergio Segura (Sergio Segura) ; both have put all their preprints on the Internet. Most of the research has been done by some of these people.

Probably the best place to start is Metamorphic Testing: A Review of Challenges and Opportunities and A Survey on Metamorphic Testing . Although this article was written about metamorphic testing , the study also applied metamorphic links in general to a wide variety of other disciplines, for example, to formal code verification and debugging. I have not yet studied in detail these areas of application of the technique, but probably they should also be looked at.

From the point of view of applicability, it is theoretically possible to adapt most PBT libraries to verify metamorphic properties. In fact, the first example in Quickcheck tests the MS, and in this PBT essay, the MS is indirectly applied. In general, it seems to me that most PBT research focuses on efficient generation and truncation of incoming data, and MT research focuses mainly on determining what we really need to test. Consequently, these techniques are likely to complement each other.

Thanks to Brian Ng for research assistance.

Postscript: request

In fact, it is not surprising that I have never heard of this method before. There are many really interesting and useful techniques that could not leave their tiny bubble. I learned about MT more because of luck, not active searches.

If you know something worthwhile of wide application, please write to me .

Well, there may be obvious problems: there may be music in the podcast, speech snippets in other languages, etc. But the theory is reliable: if we are able to get speech samples, we can use them as part of tests without prior manual transcription / markup.
In specifications, the corresponding idea is hyperproperties ( properties of sets of behaviors, rather than individual behaviors). Most hyper-property studies are related to HS security. As I understand it, her HS are a superset of MC.
I had a second, now disproved, hypothesis: since most of the main researchers from China and Hong Kong, perhaps, this technique is more known in the communities of programmers who communicate in Mandarin, and not English. Brian Eun checked this hypothesis for me, but did not find any significant signs of the method being used by the Chinese.

Source: https://habr.com/ru/post/454458/

All Articles