⬆️ ⬇️

Telemetry and software





About 6 years ago, I participated in a project for the manufacture of hardware and software for one large North American medical company. Standing near the test stand, in which there were several devices under load, I asked myself the question: “If something goes wrong, how can we speed up the search and correction of the error?”



From the moment this issue arose to this day, a lot has been done, and I would like to share with you how the collection and analysis of telemetry in software and hardware has significantly reduced the time of error detection and correction in the whole spectrum of projects in which I participated.



Introduction



Telemetry comes from the ancient Greek τῆλε "far" + μέτρεω - "measured."

Everything is very simple, any measurements that a staff of various engineers and possibly scientists can invent, the target system sends to the processing center for visual and automatic control and processing.

')

Something like this:







When on the server side it might look like this:







Prehistory



Somehow, watching the work of our QA engineers, I wondered why complex devices like satellites, rockets, machines have telemetry, and we, creating, in fact, the program parts of the operating rooms, robots, complex software solutions, do not even think about it direction?



The amount of code is enormous, and there are ways to understand that something went wrong less than fingers on one hand:





In parallel, watching the work of our QA engineers, I looked at one of the screens, which the patient's heartbeat curves, pressure, temperature and heaps of other parameters ran along and I wanted our products to be like this patient - under reliable supervision. So that even if something goes wrong, it was possible to rewind the time and see what circumstances happened.



Bike or ride



As we were taught: bicycles are informative and fascinating, but first look for existing solutions, which I did.



By good tradition, he began with the requirements:





I searched for a long time and thoughtfully ... but alas, everything was not just bad, everything was terrible!



At the beginning of 2011, not a single project that I found, even close, even half of the requirements, came under these requirements.



Telemetry for software in the form of ready-made and open solutions was almost absent as a class, the big players did everything for themselves and were not in a hurry to share.



The second surprise was the reaction of colleagues - indifference or at worst rejection, but, fortunately - it did not last long, until the first results.



The only solution I found at that time (2011) was the P7 library located at the time on google code. The functionality was poor, from the platforms there was only X86, it was difficult to look at the server without tears, but there were also advantages:





After a series of thoughts, studying, it was decided to try to ride a bike on someone else.



First step



Embedding the library in our code was easy and without problems, but then the question arose: what beautiful graphics we want to see and what readings to record? It just seems that the question is simple, in fact - it is complicated and cunning.



At first, and without experience, we began to write a relatively insignificant amount of telemetry:





Unfortunately, the screenshots of those years have not survived, and I will give the closest approximation:







The very first baptism of fire gave excellent results: after a couple of days of inconspicuous work and the reproduction of several bugs, we were finally able to understand the nature of many of them:





There were some corrected on the first or second day, but after years of prescription, I can no longer remember what it was.



After we enthusiastically with our colleagues poked their fingers at the monitor and asked “¿Qué pasa?”, Found the answer and were happy as children - the utility question was no longer there, we got a new toy and wanted to play further.



Go to run



After the first success, we began to consistently increase the number of required meters:





Next, we divided the counters into 3 groups:





And the final step was to update the testing process and implement the practice, that the error recorded by QA should accompany not only the formal description, if possible, the playback method and log files, but also telemetry.



Conclusion



As a conclusion, let me present a few facts:





This article is rather superficial and leaves behind many technical issues “how to get the CPU load by my stream”, “how to make a cyclogram of my stream”, “how do you make plugins” and so on and so on. But if the topic is interesting enough, you can make separate articles on all these points.



I would like to hope that this article will allow you to ask the same question as “will telemetry help our product?”, Then I can say that I wrote it for good reason, as this question rarely sounds in the software industry, there is an opinion that this is a lot of space and defense.



Thank you for reading!



PS: I deliberately did not talk about our company and I will not - this article is not about her.

PPS: if you are interested, we use the Jenkins + Baical + P7 bundle ( www.baical.net ), it fits well for our needs, the author of the project has implemented, at our request, not one or two improvements, in addition, we use P7 for logging ( https://habrahabr.ru/post/313686/ )

Source: https://habr.com/ru/post/340882/



All Articles