Pandora's White Box

When discussing testing, most often the speakers talk about the features of the approach known as the “black box”. But here we will talk about the opposite scenario - the “white box”, which allows to formulate questions to the code, understanding its internal structure.

The article is based on the interpretation of the report of Nikita Makarov (Odnoklassniki) from our December conference Heisenbug 2017 Moscow.

Theory

At a large number of conferences and in a very large number of books, blog posts and other sources, it is said that testing by the black box method is good and correct, because this is how the user sees the system.
')

We kind of join it - we see and test it the same way.

This is all cool, but for some reason very little is said about the white box.

Once I myself was wondering why. What is white box testing?

White box definition

I got to understand. Started looking for sources. The quality of Russian speakers was very low, translated from English to Russian - a little higher. And I got to English-speaking sources — right up to Glenford Myers (G. Myers), who wrote a wonderful book, The Art of Software Testing.

Literally in the second chapter, the author begins to talk about testing the white box:
“To meet the strategies associated with testing economics, you should establish some strategies before the beginning. Two of the most prevalent strategies are testing ... "

Transfer

To stay within the reasonable in terms of costs associated with testing, before you begin, you must develop some kind of strategy. There are two prevalent strategies : black and white box testing.

At the end, in the Myers dictionary gives some definition of testing the white box:
"White-box testing - a program of testing."

Transfer

White box testing is a type of testing where you explore the internal structure of a program.

What in practice? Myers suggests building test scenarios, focusing on coverage:

Statement coverage - coverage of statements in the code;
Decision coverage
Condition coverage
Decision-condition coverage - coverage of conditions and solutions;
Multiple-condition coverage - combinatorial coverage of conditions and solutions.

All that Myers talks about was 35 years ago. What software was written then and what - now? What code bases did they work then - and now? A lot has changed. Coverage is, of course, good, and there are many tools to measure it, which we will discuss below. But coverage is not everything. Especially considering the fact that we live in the world of distributed systems, where the bracelet from the hand of a person sends data through the phone to cloud services.

What you need to understand now under the white box testing? We look at the code, understand the structure and dependencies that are in this code, ask questions, draw conclusions and design tests based on this data. We perform these tests manually or automatically and based on them we obtain new data on the state of our system - how it may or may not work. This is our profit.

Why do you need a white box?

Why should we do all this if we have a black box - that is, how the user sees the system? The answer is very simple: life is difficult.

This is the call stack of a regular modern enterprise application written in Java:

Not only in Java, everything is so verbose and plentiful. In any other language it will look about the same. What is there?

There are web server calls; security framework, which does authorization, authentication, checks rights and everything else. There is a web framework and another web framework (because in 2017 it is impossible to just take and write an entertaining application on a single web framework). There are frameworks for working with a database and converting objects into columns, tables, columns and everything else. And here there is a small yellow square - this is one challenge to business logic. Everything under and above it happens in your application every time.

Trying to get to this thing somewhere outside with a black box (as the user sees it), you can not test a lot of things. And sometimes you really need it, especially when user behavior changes something in security, the user is redirected to some other place or something happens in the database. The black box does not allow you to do this. That is why you need to climb inside - in the white box.

How to do it? Let's take a look at the practice.

Practice

So that there are no wrong or high expectations, let's clarify some details from the very beginning:

Ready-made recipes will not be. At all. All that I will show requires the attachment of a file, hands and head.
Much depends on the context. I came from Java development (I’ve been doing this for quite some time). We have our own tools. Some may seem miraculous, others - ugly. Some of them cannot or should not exist in your context. This is normal. I came not to show the tools, but to share ideas. That is why all my examples are simplified to the limit.
So you can do all this with your development team, you need to have an influence on it. What do I mean by that? You should be able to read the code that developers write. You should be able to speak the same language with them. Without this, everything I’m going to talk about next will not work.

To make my further story more or less structured, I broke it down into three levels. Let's start with the simplest - easy level.

Easy level

As I said, we look into the code and see:

code is not formatted;
the code is not written according to the guidelines;
names of methods, classes and variables does not correspond to what is accepted in the company;
the code is stylistically incorrect (again, the guidelines do not match);
any static code analyzer will find a bunch of standard problems for your language;
Unit tests for code are either not available or are written in such a way that they do not stand up to criticism.

The fix for this is the very first and simplest thing you can do in the white box testing area. Static code analysis tools, which are already quite complex today, remarkably cope with all this - such as Sonar for Java and analogues for your languages (in fact, Sonar is multilingual and suits almost everyone).

I do not want to stay here for long. There are a lot of interesting reports about this.

Medium level

The average level of difficulty is different in scale. When you work in a small company or team, you are the only tester, you have three or four developers (as well as the industry average), 100 thousand lines of code for everyone, and the code review is performed by throwing the lead developer into who guilty - you do not need any special tools. But it happens rarely.

Large successful projects are usually “spread out” into several offices and development teams. And the size of the code base starts with a million lines of code.

When there is a lot of code in the project, the developers begin to build the formal rules by which this code is written:

the code should go to certain places, to certain packages;
the code must be properly formatted;
it must be inherited from a certain class, it must have the correct logger, the necessary annotations are set so that all metrics on the production are correctly counted and statistics are collected, the exceptions are sent to the right place.

In other words, as the amount of code grows, formal rules arise that can be checked. Accordingly, there are tools that allow you to do this.

Let's look at an example.

ArchUnit

Sample source code

ArchUnit allows in the form of a more or less problem-oriented language to describe the formal rules of what should or should not be in the code, and to push them in the form of unit-test standards into each project. So from within the project, ArchUnit allows you to verify that the “sanitary minimum” is observed.

So, we have ArchRuleDefenition rule:

     @Test   public void testNoDirectUsagesOfSelenium() {       ArchRule rule = ArchRuleDefinition               .noClasses()               .that()               .resideInAPackage("org.example.out.test")               .should()               .accessClassesThat()               .resideInAPackage("..org.openqa.selenium..");       rule.check(classes);   }

The rule says that no class ( .noClasses() ) that is in the appropriate test org.example.out.test ( org.example.out.test ) should have direct access to the internals of Selenium ( ..org.openqa.selenium.. ).

Let's run this test. It falls wonderfully:

At the same time, he writes that we have violated the rule (when a class in such and such a package is knocking on classes that are in a different package). What is more valuable, in the form of a stack-trace, it shows all the lines where this rule is not respected.

ArchUnit is a great tool that allows you to embed such things in the CI / CD cycle, that is, to write tests inside the project that check some architectural rules. But it has one drawback: it checks everything when the code is already written and commited somewhere (that is, either a commit hook that rejects this commit or something else will work). And there are situations when it is necessary for bad code to be impossible to write at all.

Annotation Processing

Sample source code:

Library
Test Project

In the past Heisenbug in the summer of 2017, my colleague from Yandex, Kirill Merkushev, talked about how code generation solves the problems of test automation. Who has not watched his speech - please watch, there is a video here .

In fact, code generation can solve many problems. It allows you not only to create code that you do not want to write, but also to prohibit the creation of code that should not be written. Let's see how it works.

Most code generation works on annotation processing. I have a project that describes a pair of annotation processors that are specific to the world of Java development — in particular, the abstract Pojo. There is no such thing as a structure in Java programs. The founding fathers of Java are now thinking of introducing structures into a programming language. This has already happened in C. We have not yet (although more than 40 years have passed). But we were able to get out - we have Pojo (plain old java object), that is, objects with fields, getters, setters, but there is nothing in them anymore - no logic.

I have an annotation that characterizes a Pojo object, and an annotation that characterizes a Helper is an object with no state, which are crammed with all sorts of procedural methods (pure business logic). And I have two processors of such annotations.

The Pojo annotation processor searches for relevant annotations in the code, and when it finds it, it checks the code for compliance with what Pojo is (or is not). The Helper annotation processor operates in a similar manner (here is a link to annotations and annotation processors ).

How does all this work? I have a small project, I run a compilation in it:

I see that it doesn't even compile:

This is because this project contains code that violates the rules:

 package abc; import annotations.Pojo; @Pojo public class AnotherFailed {   private long point; }

Unlike the previous example, this piece is embedded inside the development environment, into continuous integration, that is, it allows you to cover a larger loop inside the CI / CD cycle.

Nightmare level

When you played enough at the previous levels, you want something more.

Code coverage

Sample source code

To measure code coverage, since Myers wrote his book, a lot of different tools have appeared. They are practically for every programming language. Here I gave only what I considered popular by the number of links to them on the Internet (you can say that this is wrong - I agree with you):

Jacoco, Cobertura - Java;
OpenCover - .NET;
Coverage - Python;
SimpleCov - Ruby;
OpenCppCoverage - C ++;
cover, gcov - Go.

In some programming languages (for me it was a surprise) - for example, in Python and in Go - tools for measuring code coverage with tests are built into the language itself.

There are tools and, moreover, there is an integration of these tools with development environments, when we see this wonderful little thing on the left, indicating that this piece of code is covered by unit tests (green), but this one is not (red).

And looking at it in the context of unit-tests, I want to ask the question - why can not this be done with integration or with functional tests? Somewhere you can!

But besides the tests, we have users. We can test anything (the main thing is not to test garbage), but users press in one place, because they use it 95% of the time. And why it is impossible to make the same beautiful stripes, but only for the code that is used or not used?

In fact, this can be done. Let's see how.

Imagine that I am a tester of this application. And it gets me to regression testing (“Urgently, we are on fire, we are doing a mega start-up, we need to check what works and what does not work”). I spend with him all these manipulations - everything works, we release to release. The release is successful, all is well.

Six months pass - the situation repeats. For six months, the developers have changed something there. What exactly, I do not know. Can I find out - this is a separate question. But most importantly - what code is now called? Have I checked everything by pressing a single button or not all? It is clear that not everything, but did I miss something important?

Answers to these questions can be found if, together with the application, an agent is launched that removes its coverage.

I used Jacoco. You can take any, the main thing is that you can then understand what he intended for you. As a result of the agent's work, we have a jacoco.exec file:

From this file, the original application and the application binary, you can create a report from which you can see how it all works.

I have a small script that analyzed this thing and created the html folder:

The script shows this report:

In the process of testing, I pushed through something with my hands, but something wasn’t - in different percentages. But since we are not ashamed to look in the white box and see what happens inside the application, we know where we need to push.

In this report, the green lines are those lines that I "pushed". Red - which did not push.

If we read this code more or less thoughtfully (even without delving into what is happening inside), we can understand that I did not push through any work related to the failure of the network. Also, I did not check the cases of receiving a bad status code (that we are not authorized to request the repositories of this organization).

To check if the network is down, you can either collapse the grid or inject Fault Injection testing, or you can write another Fault Injection Implementation, putting it in the directory with the application, and get the status code not 200, but, for example, 401.

Trying to answer questions about what is being tested by our tests, where our users put pressure on and how one actually relates to another, we in Odnoklassniki created a service that knows how to put everything together. We do custom service. We can test some forgotten corner of our large portal, where nobody goes, but what is the value of this?

At first we called it Cover. But then because of a typo of one of our engineers, we renamed it COWOR.

KOVYOR knows about our software development cycle, in particular, when to turn on coverage metering, when to turn it off, when to generate reports from this. And KOVYOR allows us to compare reports on what was, for example, last week, and this; by what we did autotests, and the fact that people pushed their hands.

It looks like this (these are real screenshots from KOVYOR):

We get a side-by-side comparison of the same code. On the left are autotests, on the right are users. Red is highlighted that is not pressed, green is what is pressed (in this case, auto tests push a particular piece of business logic much better than users).

As you understand, everything can be adjusted: left and right can change, the colors used - too.

As a result, we obtain such a fairly simple 2x2 matrix characterizing the code:

Where we have coverage and autotests, and people - it needs to be compared, and the COWER works with this. Where there is coverage of autotests, but there are no people, you need to think carefully. On the one hand, it can be a dead code - a very big problem of modern development. On the other hand, it can be a functional that is used by people in some extraordinary circumstances (user recovery, unlocking, backup, recovery from backup — something that is rarely called).

Where there are no autotests, but there are people, obviously, it is necessary to write code, covering these places, and strive for reasonable, good, eternal. And where there are no autotests or people, first of all, you need to insert some metrics and check that this code is never really called. After that, you must mercilessly remove it.

Code Coverage tools already exist, and you just need to integrate them to yourself. With them you can:

use them for introspecting manual testing;
get the standard of quality for autotests;
to find with their help the dead code and dead features.

Meta-information

There is a classic math task about how to pack a backpack: how to pack all the things into a backpack so that they fit in and leave as much space as possible. I think many of you have heard of her. Let's look at it in the context of testing.

Suppose I have 10 autotests. They look like this:

In reality, each autotest runs at different times. Therefore, at a certain point in time, they look like this:

And we have two resources on which we run them:

Resource to run tests number 1
Resource to run tests number 2

I do not know what it is - jenkins slave, virtual machines, docker-containers, phones - anything.

If we take these 10 tests and spread them equally into two resources, we get the following picture:

This picture is not good and not bad, but it has one feature: the first resource is idle for a long time, and testing on the second resource is still underway.

Without changing the number of tests on each of these resources, you can simply regroup them and get just such a picture:

Five tests remained in each resource, but the idle time was reduced - we saved approximately 20% of the testing time. When we first got this optimization in our hands, it really saved us 20%. That is, this figure is not from the ceiling, but from practice.

If we consider this pattern further, the speed of tests is always a function of how many resources you have and how many tests you have. Then you have to balance it and somehow optimize it.

Why is it important?

Because not everything is always the same. Suppose someone resorts to your Continuous integration server and says that we need to run the tests as soon as possible — check the fix and do it as soon as possible.

You can go on about this man and give him all the possible resources to run tests.

The truth may be that their fix is not very important compared to the current release, which should roll out in two hours. This is the first.

And second, there are actually not so many tests as you have resources. That is, the picture I showed earlier, where you have 10 tests and two resources, is a very big simplification. There can be 200 resources, and 10 thousand tests. And this game, with how many people should be given resources, begins to influence everyone.

To play this game correctly, you should always have answers to two questions: how many resources do you have to run and how many tests.

If you think long enough about the question of how many resources you have and how many tests you have (especially over the latter), sooner or later you will come to the conclusion that it would be nice to parse the code of your tests and understand that what happens in it:

This thought may seem crazy to you, but do not chase it right away. All development environments already do this to show you these tips:

And they are engaged in parsing not only the code, but all dependencies in it.
They know how to do it. Moreover, all development environments do this well, and some even provide libraries that allow you to solve such problems in literally six lines (at least for Java ).

In these six lines, you parse and completely parse some piece of code. You can get any meta information out of it: how many fields there are, methods, constructors - anything, including tests.

And with all this in mind, we created a service called Berrimor.

BERRIMOR knows how to say “porridge, sir!”, And he also knows:

download the code from the GIT repositories;
parse code correctly (including regularly);
highlights meta-information, namely: counts tests; gets meta information from tests (tags, disabled tests); knows the owners of the tests.

BERRIMOR delivers all this data to the outside.

I could show you the BERRIMOR interface, but you would still have nothing there. All its power lies within the API.

Social Code Analysis

In 2010, I read Sergei Archipenko’s lectures on managing software projects and I remember this quote:

"... the reality that lies in the special specifics of the production of programs, as compared with any other production activity, because what programmers produce is intangible, these are collective mental models recorded in a programming language " (Sergey Archipenkov, Lectures on software project management, 2009).

The key word is collective. People have handwriting, but not everyone has it good. Programmers also have handwriting (and also not always good). There are some interconnections between people: someone writes a feature, someone patches it, someone repairs it. These dependencies exist within each team, within each development team. And they influence the quality of what is happening in the project.

Social code analysis is an emerging discipline. I have identified three videos that are publicly available and can help you understand what it is.

There they are

Mining Repository Data to Debug Software Development Teams , Elmar Juergens;
Seven Secrets of Maintainable Codebases , Adam Tornhill;
How to Flaky Tests in Continuous Integration: Practice Practice at Google and Future Directions , John Micco, Atif Memon.

Social code analysis allows you to:

understand who repairs and who breaks;
find implicit links in the code. When you change a class and a test for it, this is an obvious link in the code, and this is normal. And when you change a class, a test and something else, and this happens every time, this is an implicit connection in the code;
find hot spots in the code where it is most often fixed, changed, broken;
find dead code and dead features. Now it looks very strange (in 2017) that the code was written once in 2013–2015 and has not changed since then. Either he is perfect and works well - and the metrics will show it, or he is dead;
If you know how technical debt looks in your code, you can also find it.

A little more about the technical debt. I have a weak technical duty hypothesis.

There is a bug tracker (issue tracker) on an abstract project in a vacuum. The bug tracker has all the bugs, tasks, and each of them has some kind of ID;
there is a Git version control system - in the most simplified case. Git has commits, and commits have comments where they write links to task IDs;
My hypothesis is that those files in Git, in which most often something changes for bugs - this is the place of accumulation of technical debt.

Here at Odnoklassniki it looks like this:

When I write something and commit, I specify the link to the ticket in Jira. By virtue of NDA, I cannot show you a social code analysis using the example of the Odnoklassniki repositories. I will show the example of the open source project Kafka.

Kafka has an open issue tracker, an open repository with code:

https://issues.apache.org/jira/projects/KAFKA
https://github.com/apache/kafka
Beautiful in the http://bit.ly/2yfc7lA and http://bit.ly/2wO5ByG repository

Let's see what happens there.

So, I have (a small utility application ) that raises all the commits in this repository and parses all the comments to them, providing a search on the regular expression Pattern.compile("KAFKA-\\d+") commits that refer to some ticket

In the console, you can see that there are only 4246 commits, and there are 1562 commits without such mention. That is, the analysis accuracy is a third less than we would like.

Then we raise each commit, compose an index from it - which files in it changed (under which ticket). We compile all these indices into a large hashmap: file name — a list of tickets for which this file has changed. Here's what it looks like:

For example, we have the KafkaApis file and a number of a huge list of issues for which it has changed (the API changes frequently).

Then we go to the issue-tracker Kafka and determine by what issue this thing changed - was it a bug, feature, optimization? At the output, we get a small hash, where it says what the thing is and what is its priority (all these are just bugs):

as a result we get the following conclusion:

Where we write what percentage of changes were in a particular file:

For example, for the top line, the total number of tickets that passed in commits through this file is 231, 128 of them are bugs and, accordingly, 128 are divided by 231 - we get 55% - the share of changes. Most likely, technical debt is concentrated in these files.

Results

I showed you six different examples. This is not all that exists. But this means that the white box is primarily a strategy. How you will implement it on your project - you know better. Presumably, do not be afraid to get into the code.There is always the whole truth about your project. Therefore, read the code, write the code, interfere with the code that programmers write.

If the topic of testing and error handling is as close to you as it is to us, you will certainly be interested in these reports at our May conference Heisenbug 2018 Piter :

Auxiliary techniques for testing microservices (Alexander Martyushov, Signavio)
Web Security Testing Starter Kit (Andrey Leonov, SEMrush)
Beta testing of VKontakte (Anastasia Semenyuk, VKontakte)
In the wake of Wild West testing: unusual tricks for common problems (Vitaly Friedman, Smashing Magazine)

Source: https://habr.com/ru/post/354434/

All Articles