In this article, I want to share my thoughts / observations / recommendations regarding the implementation of such an important task in software development as logging. On the Internet there are many articles describing tools for logging, but very little information about which events, and what information, should be recorded in the program operation log.
Introduction
Very often there is the problem of diagnosing defects in a test or production environment where there are no development and debugging tools. And the only way to understand what the error is is to add lines of code with debugging information and reinstall the application, if such lines have not been added before. Is it possible to immediately write code so that the information that the application logs would be enough to diagnose the problem? In the article I will not deal with such issues as logging tools at all. But in any case, you need to understand that such tools exist and allow you to filter the data recorded in the protocol and customize the record of the protocol in various sources. The main objective of the article is to give a presentation to the developers on how the logging is carried out, and to give recommendations on where to insert lines of code in the program for logging. In this article, we will mainly talk about tracing.
Logging
I consider logging much wider than just writing to a log file. For me, logging is a set of tools and methods that solve such tasks:
Be sure that the system is working and working properly.
Understand why the system and its data are in their current state
Have the ability to quickly find fault
Learn how to improve the system.
Logging Approaches
The above objectives can be specified by highlighting the “users” of the logging results and the tasks of these “users”. Then you can select the tools and methods by which these tasks can be implemented. So, I see 4 main categories of "users":
Developer - a specialist who develops and improves the application.
Test engineer - a specialist who is responsible for the quality of the application, the detection and localization of defects during the development period
The system administrator is a support service specialist who is responsible for the smooth running of the application in a working environment and the timely detection of errors.
The owner of the application is a business user who knows and understands the functionality of the application as a whole and, in fact, is the owner of the application data; the employee for whom this application was developed
The table below for the types of users are the most frequently used methods and means for solving their problems.
Lecture hall
Task
Means and Methods
Developer
Find the location of the problems and fix them
Run optimization
Tracing
Performance counters
State of objects / processes
Test engineer
Be sure the system is working properly.
Defect detection
As accurately as possible to determine the location and cause of the found defects
The event log
Audit trail
State of objects / processes
System Administrator
Be sure the system is working.
If there are errors to understand why (for whose fault, where to fix)
If it is slow, understand why
The event log
Performance counters
State of objects / processes
Application owner
Be sure that the system works exactly as it should.
Audit trail
State of objects / processes
The means and methods indicated in the table are briefly described below.
Tracing is a tool that is usually called a “log”, in fact, a repository where detailed information is written about the progress of the program (sequentially, in the order in which events occur). This is usually a text file or database table. This tool is needed by the developer, test engineer or support service specialist for a detailed analysis of what is happening in the application.
The event log is a tool that shows events in an application from an administrator’s point of view. Those. Events where the system administrator can tell whether the application is running or not. If we talk about software development for Windows — this is most often Windows Event Log or its own application logs. I’m in favor of not mixing trace repositories and event logs.
Audit log - a tool that allows the application user to understand who performed what actions (or tried to perform) in the system
Performance counters are a system administrator tool that helps detect a bottleneck in system performance. An example of such a tool might be the Performance Monitor built into the Windows operating system. For other OS there are similar tools.
The state of objects / processes is a tool that helps to understand what state (or what stage) the objects or processes in the application are currently in, and how they got into this state or the processing stage. For example: Imagine an application that processes incoming email messages. For each such message, you can select the state: received, processed, deleted. In the "state log of objects / processes" in this case, you should record key information on the letter, the history of the change of states of the letter and the message during its processing. Thus, important information on the processing of the letter from the "garbage" is completely separated.
The selection and implementation of logging methods is a very important task, the implementation of which determines the speed and quality of detection and correction of defects and quality of maintenance. Therefore, at the planning and development stage, this task needs to be paid ahead of time and selected a sufficient set of logging methods.
The task of tracing is to quickly find a defect in the operation of an application in any environment (developer environment, testing environment, working environment) by analyzing the operation of each step of the program. Therefore, it is logical to enter into the tracing log information:
about all errors - processed and not processed
startup parameters and loaded configuration
as well as the events described below.
Tracing information is intended mainly for the developer and the test engineer (or in a working environment for the support staff of a very high level of qualification). A specific feature of tracing is that usually this functionality is not described in the requirements, and therefore it is usually difficult for developers at the beginning of a project to imagine what tracing information may be needed, and therefore it is difficult to understand what should be recorded when. The most important thing is to understand that tracing in the working environment is enabled only when necessary, i.e. does not litter the event log. For the developer and test environments, tracing is most often enabled all the time to monitor the correct operation of the application and debug. Logging tools usually provide an opportunity to record in the log, indicate exactly where this information will be recorded. An important element is also the ability to specify in the configuration which entries will be logged and which not — usually based on event categories and event levels. However, a big problem is that, despite the availability of tools, it is rare to find recommendations on how to use them correctly, namely:
What events need to be written to the trace log
How to choose the level for the event
How to choose event categories
What information should be recorded when an event occurs
This will be discussed further in the article.
What events need to be included in the trace log
An important factor when choosing events to write to the trace log depends, in my opinion, on two factors:
Is unit tests used during development? The use of unit tests can significantly reduce the number of errors in the business logic of methods that do not interact with external systems (external to this layer of the application). However, when code interacts with an external system (interaction of a business layer code with a database, interaction of business logic layers located on different computers, etc.), unit tests are not effective because the configuration of different layers may be different in different environments. Based on this, it can be concluded that when using unit tests, it is logical to perform only the trace of interactions between layers and the trace of errors (since we believe that the logic of each layer separately is very well tested). If there are no unit tests, you need to trace each branch of the program logic (method input, output, error occurring in the method, each branch of the conditional operator)
Type of application. The table shows some types of applications and events for logging to the trace log (it is clear that there are other types of applications).
Application Type
Logging features
Isolated desktop application (doesn't even save anything to disk)
If such an application is well tested with modular tests, then there is no point in doing a trace.
Application for entering data and receiving reports
There is already interaction between the application and the repository, and therefore it is rational to log information about such interaction: requests, number of entries made and received, request processing speed, key parameters for generating reports
Application Installer (patches, updates)
In this case, the program closely interacts with the external system and therefore each step (attempt to perform and the result of the execution) must be entered into the trace log
Integration bus
Summary or complete (full data) information on incoming or outgoing data
An application that can be greatly modified by the user (or extended by additional modules and plug-ins)
All interactions with such external modules (input / output parameters) and the effect of the installed configuration parameters on the program operation
What data should be entered in the trace log
In addition to the simple name (description) of the event, additional information is often needed to analyze the work. The following table shows the data that would be useful to record. It is clear that it is not always necessary to write events in such detail. In addition, tracing tools typically allow some of the information below to be recorded automatically.
Data
Description
date and time
Date and time of the event
Server
The server on which the event occurred (useful when analyzing logs collected from various servers)
Process
The name of the process where the event occurred. This is necessary, for example, if different processes use shared libraries.
Method
Method name, possibly including class and library name
Event Category
The name of the layer or logic module
Level
Event detail level
Title
The name of the event (start or end of the method, error, change of the object state, etc.)
Detailed information
For example, detailed information about the error (and in case of a critical error there may be detailed information about the system), the value of the parameter (s), the name of the object or a description of the action on the object
Account under which the process runs
The user account that triggered the action.
The user account that made the initial call that led to this event
Stack
The stack of method calls that led to this event. May be useful for detailed event analysis.
Process correlation number
If the application is multi-user, then it is important to understand what kind of request (user) this or that event record refers to.
Correlation number of the initiating process
If the application is distributed, then this number is used to map events on different servers (or processes). For example, you can transfer a correlation number from the client to the server and save it when tracing. In the future, you can match the call to the client application with an event on the server
Trace levels
Levels are mainly used to filter events when writing to the log. This is to prevent logging of data that is not needed in a given period of time. For example, a tool such as NLog provides 6 levels of events by default (from more detailed to less detailed): trace, debug, info, warn, error, fatal (see NLog documentation for more details) Further, in the configuration, you can specify that, for example, in the work environment, write Error and Fatal events to the trace log (ignore all others), and when a problem occurs, change the configuration so that all events are recorded. The following table shows my recommendations for choosing event levels for tracing.
Event
Level
Loaded configuration / configuration change
Info
User action
Info
The beginning and end of each “public” method (or a method that implements the logic according to the specification), input / output parameters, the result of the work of such a method
Info
In public methods, input / output parameters that are data sets
Debug
Logic (program branches) described by specification
Info
Start and end of other methods, input / output parameters, work result
Trace
The steps of the remaining methods
Trace
Access to external resources (for example: database, web-services)
Info
Detailed information on requests (commands) for accessing external resources and the result obtained
Debug
Unexpected exceptions (not critical)
Error
The exception described in the specification
Warn / error
Processed exceptions
Warn / info / debug
Critical exception (processed or not processed)
Fatal
Select event categories
The second important parameter by which you can configure the filtering of event logging is the categories of events. These categories should be selected by the developer himself (i.e., the tools do not provide default categories) I recommend sticking to such recommendations - for each logical level to make a separate category. For example: interface level (UIControls), business logic level (BusinessLogic), data access level (DAL), search module (Search), configuration software (ConfigManager), and so on. Further, if you have separate components inside the layer, then you can select individual subcategories to trace them, separating them from the main category with a point. For example, the visual component for displaying the tag cloud (which is located at the interface level) is UIControls.TagsControl. Thus, if there is a problem with a component, on the one hand, you can always determine which component has created an event from the log, and on the other, it is more flexible to set up filtering of the event log entry only for the selected component.
Conclusion
Logging is an important function in any application and requires careful analysis and design. Despite the fact that tracing is usually not described in the requirements, its correct use can greatly speed up the process of detecting and fixing defects on the test and production environment. These calculations are my practice and observations, and, accordingly, you may have your own experience and your own method of using logging (and tracing in particular). I am pleased to hear critical feedback and comments to improve the recommendations.