Reflections on the evaluation of commits and robot programmers

Imagine yourself as a programmer at a company that develops a large and complex product that is used by a large number of people. This product has been on the market for many years and earns a lot of money for the company. It is possible that you are already such a programmer. With each new development cycle you release a new version of the product and hope that it has become better than the previous one. Moreover, you hope that with each new commit the product you are working on gets better and better.

How can you evaluate whether the new version has become better or worse? Or maybe your edit doesn’t affect anything at all? After all, in the end, the most important thing that matters to the company is how much money will the new version of the product bring?
')
There are various more or less clear metrics with which you can try to measure the very “best” or “worse”:

~~The number of lines of code~~ .
How many bugs have been fixed.
How many new features have been added that your users want.
How much more productive has the product become.
How much more comfortable the product has become .
How much better has the product result become, if there is a quality metric for it at all (classification accuracy, ranking, etc.)
Other various metrics .

But none of them answers the question posed above.

Imagine that on some day humanity will invent a metric that can measure the financial contribution of each commit. And then, for example, you can see the number in rubles or other currency in the repository's logs opposite each edit, meaning how much this edit brought the company money. Well, or how much the company lost money.

This day will be a black day for all programmers. After all, such a metric is an ideal objective function for training a robot programmer.

But how?

You are ~~lucky~~ not lucky if your company has natural ways to evaluate the financial contribution of any edits. This may be, for example, the accuracy of the classification of license plates by a system that records violations and issues fines. In this case, you, knowing the number of cars and the probability distribution of various violations, can estimate how many rubles in the form of unpaid fines each percentage of classification error gives. It is a little more difficult to estimate how much money you will lose from incorrectly written fines. And here you are the developer of such a system. You have a large sample of images of the numbers by which you estimate the accuracy of the classifier. And you made an edit that raised the entirety of the classification from 80% to 90%, with an accuracy of 100%. Multiply several numbers and get the value of your edit in rubles .

But you can work in a company that develops a website or mobile application, with the main income from advertising impressions / clicks. In this case, it is more difficult to assess the contribution of a particular edit. But you can try to evaluate immediately the whole version of the product. For example, using A / B testing, you can estimate how much more advertising clicks you will receive with the new version and now get the cost of the new version in rubles .

It is even more difficult to make such an assessment if you release some very heavy desktop product with a long life cycle of each version. But, theoretically, you can come up with a technique similar to A / B testing and check which version is selling better.

We need more genetics

In the last two cases described above, we received a rating for the whole product version. But how can one proceed to the estimates of concrete commits? You can think of several ways at once:

Build after each commit and compare the two versions of "before" and "after" with each other. In this case, you can determine the cost of editing as the difference between the cost of versions.
Take a stable version and try to build it with some random commit from the development version. In the case of successful assembly, you can similarly determine the cost of editing, as in the previous paragraph. Another option is to take the development version and throw out some random edit.
You can even take random subsets of edits, try to build a build with them and compare them with each other or with some fixed versions. In this case, to determine the cost of a particular edit, take the difference in the cost of versions and add a bonus to those edits that turned out to be in the “winning” version, and add a penalty to those edits that turned out to be “losing”.
Finally, you can define a genetic algorithm that will cross and mutate different subsets of edits. And to determine which edit most often leads to “winnings”.

For all cases there will be many uncollected builds. In this case, you can still, for example, additionally add penalties to edits if they were included in uncollected builds.

And then nothing prevents you from going even further and moving from evaluating one commit to evaluating a specific line of code . For example, dividing the cost of a commit into all the lines included in it.

KPI

Now you can set up an automatic system that will add its value to each commit in your repository. Perhaps not immediately, but with some kind of delay required for testing.

So we got the manager's dream! We open the repository's log and see which developers have made much money to the company, or how much money they have harmed. We can write bonuses, we can dismiss losers. But even cooler - to build a pyramid in the company, where each employee receives a share of the value of all his edits and subordinate edits. Long live network marketing!

Let's add more links between commits and tasks. In this case, it will be possible to evaluate the financial effect of the implementation of these tasks. And then we can add to the pyramid not only developers, but also analysts and other managers. And in the same way, nothing prevents the testers from adding according to how many bugs they found in the most valuable places of the code.

Robot programmer

Suppose that someone really managed to come up with an adequate function of evaluating the code in rubles. Imagine what happens if we start applying machine learning methods to a sample of the source code and the target variable — the cost of each line.

We can train a classifier or regressor who can predict whether the edit is financially positive and to what extent. We can make a plug-in for your favorite IDE, which will highlight the line you just wrote with a red wavy line and write you a warning that “with a probability of 98.7% with this edit, you will receive a $ 42 fine”. Or a plugin for a version control system that will make similar warnings before your commits.

But, worst of all, neural networks already know how to generate code simply by looking at the source code. And if you add such a neural network and the goal is to generate the most valuable code, then why do we need programmers at all? Of course, you can argue that something similar has already been tried using genetic programming and the matter has not gone further. But with the current pace of development in the field of machine learning, one cannot be sure that robots will not appear during your lifetime.

Afterword

It is hoped that in the “first wave”, when the first robots-programmers appear, they still cannot replace those who write them. In any case, for the first time. So it's time to start learning machine learning. Fortunately, the Internet is full of free materials and even have some in Russian.

Source: https://habr.com/ru/post/302422/

All Articles

Reflections on the evaluation of commits and robot programmers

But how?

We need more genetics

KPI

Robot programmer

Afterword

More articles: