EventTrace for Windows. High-speed transmission of driver debug messages over the network
There are different debugging techniques: someone digs into the debugger, someone meditates, waiting for enlightenment, someone frantically changes the code in the hope of good luck, but almost no one will refuse the file in which the last moments of the process life will be saved, what happened , in which threads, on which cores, at what time. Carefully and pedantically saved debug information can save many working hours, especially when it comes to debugging the driver and hardware with which it works. Well, in the case when the error is random and reproduced on 1 system out of 20 during the week, then without debugging information the meditation can be delayed. This article focuses on utilities that help in intercepting debug messages of drivers running on several machines simultaneously and sending messages to the server for storage and analysis.
Prehistory
The EventTrace for Windows technology was first included in Windows 2k and since then it has been a faithful companion of all subsequent operating systems, gradually penetrating an increasing number of software: from drivers to user applications. It is hard to say that it was a breakthrough, but the highlight of ETW is speed. Tens of thousands of debug messages per second from the depths of the driver with a meager effect on performance gives room for the engineer’s imagination. And if we add to this the detail and accuracy of the information received, then life begins to seem cloudless. Inspired by this feeling, we started the introduction. The difficulties began almost immediately, and the name of these difficulties was TraceView . This utility, designed to receive and view debug messages, stubbornly refused to do what it was created for. If the message delivery technology was worked out well enough - ETW easily swallowed tens of thousands, then it didn’t work out with receiving, it seemed that TraceView was created for its own needs by a lone engineer in the depths of the corporation and accidentally got into the DDK.
Problem
After another loss of the results of long-term tests, it was decided to classify all the difficulties that we faced, and look for an alternative that would suit us. Those who wish to familiarize themselves with the list of reasons may find them under
spoiler
Work with large files (a file is considered large if it exceeds 25 megabytes). Opening such a file takes considerable time;
Loss of service data. Regularly ETW, saving the file, forgets to write the service header. Subsequently, having found such a file, you have to meditate for a long time over the hexadecimal editor, restoring the data. This is especially offensive when the file contains information about an error that could not be reproduced for several days. In Windows 8, this defect was promised to be fixed;
In real-time, the message viewer only holds 65 thousand. At first glance, this is a lot, but if you imagine that the driver sends at least 1 thousand messages per second, then all you can see is only 1 minute;
Copy text messages. You can copy the message only as a whole, there is no possibility to work with parts of the text;
Search. He is not;
The backlight is made through the filtering window, the choice of color is carried out by its name: for example, "Gray - 50";
Filtration. Applying a filter on previously saved data is not a trivial task;
Profiles. Creating a new profile is tedious and time consuming, overloaded with settings. You cannot copy profiles using the utility;
The search gave no results, ETViewer was the closest to the requirements, but it did not cover all the necessary needs, besides, the work on it was abandoned and at the moment the list of developers is empty . Hopes for a quick solution to our problems went to pieces. To get used to this fact, it took time, during which we still used TraceView, accumulated strength and ideas. ')
Decision
The idea of ​​self-development was greeted by the management of our R & D department with understanding and enthusiasm, but after drawing up a list of tasks and evaluation in parrots, it became obvious that we do not have such free resources. It was about 6 person-months only for the most basic functionality. And, as you know - the appetite comes with eating. Despite the series of “failures”, I could not throw it away and forget the idea. Thus began a series of evenings almost a year when, coming from one job, I proceeded to another. As a result of this work, the following applications and libraries appeared (the source code is distributed under the Apache 2.0 license):
Baikal is a server framework application to which extension modules are connected (providers, storages, viewers);
Angara - a client application that manages the ETW session, receives messages and forwards them to Baikal;
P7.Trace - C ++ library responsible for processing debugging messages of the application being developed, and transferring them to Baikal. In essence, this is an ETW replacement that can work on both Windows and Linux;
What is the highlight of all this good? As with ETW , this is speed. These are the millions of debug messages per second for modern home PCs. For example: a duet of I7-870 and a gigabit network can swallow more than 2 million messages in the format:
Trace(“Here is my trace and here are my values: %d %d”, iVal1, iVal2);
In addition to the obvious format of the string and parameters, the following data is also transmitted:
The level of "anxiety" (trace, info, warning, error, etc);
Time accurate to 100 nano seconds;
The number of the processor on which the code was executed;
ID of the current thread;
Module ID;
The name of the function, file, line number in the file;
Serial number;
Information about the process (name, ID, creation time);
On the side of Baikal, it all looks like this
The second most important and convenience point is the centralized collection of debug messages. I will give an example - the company in which I work develops both its hardware and almost all layers of software for it. When developing and testing, a huge amount of time is spent on observing the zoo from several hardware, drivers and software. Therefore, the ability to centrally monitor vital signs and, in the event of a failure, to restore a detailed picture of what happened is highly valued. Unfortunately, and perhaps fortunately, I cannot fit all possible information on all components into one article due to my time constraints. Also overboard will be an overview of what is under the hood of ETW , but if you are interested, you can fill this gap with another article.
Angara
So I got in his story to the part for which it was started. In this chapter I will discuss how to configure and run the Angar. It is assumed that you are already familiar with ETW and that your driver has already been instrumented with similar messages.
ETW session options
The starting point of each ETW session is the Control GUID, there are similar lines in one of your source files:
//{63423ADB-D156-4281-BA74-8ACAEDDBD810} #define WPP_CONTROL_GUIDS \ WPP_DEFINE_CONTROL_GUID(ProviderGuid, (63423ADB, D156, 48d5, B0BC, 8ACAEDDBD810),\ WPP_DEFINE_BIT(DBG_PROVIDER) /* bit 0 = 0x00000001 */ \ /* You can have up to 32 defines. If you want more than that,\ you have to provide another trace control GUID */\ )
We are interested in the value of WPP_DEFINE_CONTROL_GUID, which can be written in the form {63423ADB-D156-4281-BA74-8ACAEDDBD810}, remember it - it is still needed.
The second most important element required from your driver will be the Trace Message Format File (TMF). The file describes the format of debug messages, as well as contains additional information about the message. The easiest way to get this file is to run the following command after successfully building your driver:
The “Tracepdb.exe” utility comes with the DDK, the default path is the following: C: \ WinDDK \ {Version} \ tools \ tracing \
Settings. Provider
So, we armed with the Control GUID and the TMF file - it’s time to edit the Hangar configuration file, it is called unpretentiously - “Angara.xml” and is in the same folder as the executable files. Open it with any editor that understands UTF-16, and we can start editing.
The first section of interest is dedicated to the ETW provider:
TMF - path to the folder with the TMF files that you generated;
GUID - the same Control GUID;
Important, but not obligatory parameters are the “Flags” and “Level” parameters; they allow filtering currently unwanted messages on the driver side, thereby reducing the load on the ETW and the system as a whole. The remaining optional parameters are described in sufficient detail in the documentation , as well as in the XML file itself, and their description is unnecessary here.
Settings. Network
The following section is dedicated to network connectivity:
There is only one obligatory parameter — Address; you must enter the IP address (IPv4 / IPv6) or the symbolic name of the PC on which the Baikal is running. Optional parameters are described in detail in the documentation .
Settings. Levels
The last section is a map on which the ETW message level is converted into a message level understandable to Baikal:
You need to edit this section only if you have redefined macros in the source code of your driver, for example like this:
// Define debug levels #define NONE 0 // Tracing is not on #define FATAL 1 // Abnormal exit or termination #define ERROR 2 // Severe errors that need logging #define WARNING 3 // Warnings such as allocation failure #define INFO 4 // Includes non-error cases such as Entry-Exit #define TRACE 5 // Detailed traces from intermediate steps #define LOUD 6 // Detailed trace from every step
In this case, the XML section will look as follows:
That's all, editing the configuration file is completed, you can start the launch. In what order you will run Baikal and Angara - it does not matter, but it is reasonable to launch Baikal first to be able to accept all sent messages, just do not forget to politely ask your Firewall not to block traffic between Angara and Baikal . The launch of the hangar is very simple:
NB : On older OSs (Windows 2k, XP), the ETW session cannot work stably without an intermediate file, so when starting the Hangar on these versions of the OS, remember that Angara will try to create several temporary files with a total size from 250 MB to 1 GB, in case failures - the process will end.
Appearance
The whole graphical interface, or rather its absence, can be described with the following picture: In addition to static information about the session, there are 12 different counters, a rather large and, at the moment, an excess amount of information. They were needed at the stage of active debugging of the program and made it possible to accurately detect the problem area, now it is only a historical relic, which was left for the purpose of technical monitoring. Counters are divided into 3 groups:
Main statistics (Angara):
Send - the number of sent messages
Rejected - total (for all reasons) number of messages rejected to be sent
Unknown - the number of unidentified messages, this counter increases each time a message arrives whose description cannot be found in the TMF files. Possible reason - the lack of the necessary TMF file or the TMF file is outdated (it was not generated after the driver was compiled)
TPS - estimated number of messages sent per second
Network statistics (P7):
Free mem. - the percentage of free internal memory.
Rej. No connect. - number of messages rejected for sending due to lack of connection with the server (Baikal)
Rej. No Memory - number of messages rejected for sending due to lack of memory
Rej. Internal - number of messages rejected for sending due to internal failures
ETW statistics:
Traces lost - the number of lost messages within ETW
Buffers lost - the number of lost buffers inside the ETW
RealT. Buf. lost - the number of lost real-time buffers inside ETW
Broken Seq. - the number of unaccounted ETW losses, monitoring occurs on the assumption that incoming message numbers increase linearly by one
Conclusion
This is where my story ends, I hope that the utilities presented here will facilitate the difficult but creative life of the developers of driver and application software.
PS: Project site with source codes and binary files, as well as English-language help . PPS: Bai c al is not a random typo.