System description
Most Tinkoff Bank products are created entirely by our developers. One of such products is the Pds (predictive dialing system) system. Pds is a robot that calls customers instead of operators. Moreover, if the client answered, the call is transferred to the operator. If the client did not answer, the robot records an unsuccessful call.
Thus, the task of the system is to provide operators with continuous work for a given quality of service. By the work of operators, we understand the processing of calls to the Pds system and recording the result of a conversation. The level of quality of service is the ratio of calls received and missed by the operator.
Our implementation is a distributed system consisting of several different component services.
')
The system architecture is schematically represented in the figure.

I will briefly describe the elements presented in the diagram.
The PDS component is technically a Windows service, which itself calculates how many numbers to call and records unsuccessful calls.
The PBX component is an open API telephone exchange that dials customer numbers and establishes telephone connections between customers and operators.
Operators - a group of specialists who handle calls made by the Pds system.
Client M, Client X are the phone numbers of the clients you need to contact.
In general, the system works as follows. The PDS calculates how many numbers to dial to provide operators with useful work until the next cycle starts. Next, the PDS receives the specified number of numbers from the source of tasks and sends commands to the PBX to make calls. The PBX dials the received numbers and informs the PDS service of the results. The sets where the client picked up the phone are transferred to the free operators specified by the PDS service (if readers are interested, I will describe the architecture in more detail in a separate article).
Requirements for the quality of the system
The critical point for the Pds service is the speed of reaction to events occurring in the system and incoming requests. Slow data processing can lead to a decrease in the quality of service, for example, to a large number of unprocessed calls or long waiting for operators. This behavior of the system is unacceptable, because leads to a decrease in the efficiency of business units. Therefore, quality control of the manufactured product is one of the priority stages of the development cycle.
During the operation and development of the system, new requirements and wishes of users arise. Therefore, the source code of the product varies greatly from version to version. Including due to ongoing refactoring and optimization changes. Thus, the team faces a task in the new version to preserve the quality of the system.
Product quality assurance
Stress Testing
In order for the Pds system to work efficiently, in addition to standard development practices (code review, functional testing), we apply load testing of the service with measuring the execution speed of the methods.
In the early stages of system commissioning, it was noticed that most of the failures that occurred on the combat circuit were caused by the slow operation of certain code sections. After optimization, the quality of the system became acceptable. After one of the “successful” versions, we measured the speed of the system, these values ​​became “reference”.
After completing the development and testing of the next version of the product, we run a load test and evaluate the speed of the methods. Next, compare the result with the "reference" values. According to the results of the comparison, we decide on the need for optimizations and the release of updates.
About applicable practices, test scenarios and evaluation of the results later in the article.
Implementation of measurements
The PDS service project uses an IoC container for ease of dependency management, unit testing and the ability to spoof implementations. Thanks to this approach, we can wrap the objects created by the container into interceptors (interceptor), in which we can secure the execution time of each method we are interested in, the input parameters and the output results (logging proxy). In order for the service component to support the speed of execution, you need to mark the interface that implements the component with a special marker attribute LoggingAttribute. All interfaces of components affecting the speed of the application (request and event handlers) are marked with this attribute and, when created, are wrapped in a container into special proxy objects.
Below is a sample code for the SimpleInjector IoC container and the NLog logging library with comments (most of the code is available on the
SimpleInjector website):
Logger messages are output to a separate file, the data from which are sent to the log analyzer.
The convenience of this approach lies in the fact that measuring the execution speed, if necessary, can be enabled for a particular service / component through the logger configuration (without restarting the application) - this allows revealing performance problems in a particular component. As a disadvantage of the approach, it is worth noting the wrapping of service objects in a proxy. This causes inconvenience in the development (inspection of objects) and small performance losses due to the addition of unnecessary call and logic to the methods of the proxy object. You can get rid of the first drawback by adding a rule to the transformation of configs, which includes proxying only for the necessary contours (QA, Prod). With the second disadvantage we have to put up, because the benefits they bring are much more — you can always turn on diagnostics without restarting the application.
It should be noted that a similar result could be achieved using any .net profiler (dotTrace, ANTS, etc.), but it is difficult to use the profiler on the combat circuit during real service operation.
Sample data from the log file:
dateTime=2016-01-25 16:00:27.3451;service=TCSBank.Pds.HostingService.PdsOperationService;method=SetReady;duration=0.0142099; dateTime=2016-01-25 16:00:31.6109;service=TCSBank.Pds.HostingService.PdsMonitoringService;method=GetStatisticsAll;duration=0.0002707; dateTime=2016-01-25 16:00:31.6109;service=TCSBank.Pds.HostingService.PdsMonitoringService;method=GetUsersByQueueId;duration=0.0005592; dateTime=2016-01-25 16:00:34.2828;service=TCSBank.Telephony.Core.Pds.IOperatorBoardService;method=RefreshUsers;duration=0,0323294; dateTime=2016-01-25 16:00:36.6110;service=TCSBank.Pds.HostingService.PdsMonitoringService;method=GetUsersByQueueId;duration=0.0006961; dateTime=2016-01-25 16:00:36.7204;service=TCSBank.Pds.HostingService.PdsMonitoringService;method=GetStatisticsAll;duration=0.0003596; 16:00:41.6112;service=TCSBank.Pds.HostingService.PdsMonitoringService;method=GetUsersByQueueId;duration=0.0002869; dateTime=2016-01-25 16:00:46.6113;service=TCSBank.Pds.HostingService.PdsMonitoringService;method=GetStatisticsAll;duration=0.0002227; dateTime=2016-01-25 16:00:49.0177;service=TCSBank.Telephony.Core.Pds.IOperatorBoardService;method=SetReadyState;duration=0,0096144; dateTime=2016-01-25 16:00:49.0177;service=TCSBank.Pds.HostingService.PdsOperationService;method=SetReady;duration=0.0109496; dateTime=2016-01-25 16:00:49.0489;service=TCSBank.Telephony.Core.Pds.IOperatorBoardService;method=SetReadyState;duration=0,0086434; dateTime=2016-01-25 16:00:49.0489;service=TCSBank.Pds.HostingService.PdsOperationService;method=SetReady;duration=0.0097707; dateTime=2016-01-25 16:00:51.6115;service=TCSBank.Pds.HostingService.PdsMonitoringService;method=GetStatisticsAll;duration=0.0003434; dateTime=2016-01-25 16:00:51.6271;service=TCSBank.Pds.HostingService.PdsMonitoringService;method=GetUsersByQueueId;duration=0.000687; dateTime=2016-01-25 16:00:54.2834;service=TCSBank.Telephony.Core.Pds.IOperatorBoardService;method=RefreshUsers;duration=0,0200767;
Carrying out load testing
We conduct load testing after the completion of functional and regression testing. Those. immediately before the release of the product, when the correctness of the system is checked.
For a correct assessment of the speed of implementation of the methods, it is important to measure when the load is close to real, The number of simultaneous requests (parallel threads) is important. Since the load on the combat circuit is significantly different from the test, and the number of testers who can be involved in the test is limited, we developed an operator’s work emulator to generate the required load. The emulator is a one-page web application that sends requests and processes the result in the same way as a system user. Identification data (login) and behavior parameters (number of seconds before the next step) can be passed to the application via url, which is very convenient for mass launch.
The test is carried out according to the following scenario from several development machines or dedicated virtual machines running several instances of the emulator that send the necessary signals (commands) to the Pds service, the service takes the required number of phone numbers and sends them to the PBX. For the load test, a dedicated telephone number is used, for which the specified behavior is registered on the PBX: answer (pick up the phone) after a random number of seconds, hang up the phone after a random number of seconds.
The test is carried out for 15-20 minutes. During this time, enough data is collected for analysis.
In the future we plan to add control over the load on the database using Oracle Enterprise Manager.
Evaluation of test results
The measurements recorded as a result of the test go to the log analyzer, which forms a table with the results. The table consists of the following columns:
- Service is the name of the service whose method is being called.
- Method - the name of the method that is called for processing.
- Count - the number of method calls per test.
- AvgTime is the average method execution time.
- MinTime - the minimum time the method is executed.
- MaxTime - the maximum time the method is executed.
- Delta - the difference between the maximum and minimum execution time.
An example of a table with the results of measurements.
Service | Method | Count | Avgtime | Mintime | Maxtime | Delta |
---|
TCSBank.Telephony.Core.Pds.ICustomerBoardService | FindCustomerByUniqueId | 60 | 0.006371 | 0.003428 | 0.010329 | 0.006901 |
TCSBank.Pds.HostingService.PdsOperationService | GetAcdAnswerer | 26 | 0.450426 | 0.011836 | 4.247844 | 4.236008 |
TCSBank.Teleony.Core.Pds.IPdsCoreService | OnConnectionSuccess | 56 | 0.119736 | 0.093473 | 0.526525 | 0.433052 |
Similar tables with the results of the work of methods are published on project wiki-pages for each displayed version and are available for evaluation and comparison.
The ratio of received and missed calls received at a given processing speed is also evaluated. If there is a significant increase in the speed of execution of methods between versions, leading to an increase in the number of missed calls, optimization changes are made to the product, after which the load test is repeated.
In the future, we will complement this approach by analyzing the load on the database caused by the test.
Summary
The use of the described method allows us to guarantee the quality of the system and quickly deal with abnormal situations on the combat circuit. I want to note that this approach is not applicable to all systems and stages of the product life cycle. What methods to apply for quality assurance in each particular case is up to you. Thank you for your interest and successful work.