7 simple optimizations that reduce the CPU load from 80% to 27%

For more than 3 years, our team has been developing such an important component of the operator's network as PCRF . Policy and Charging Rules Function (PCRF) is a solution for managing subscriber service policies in LTE networks (3GPP), which allows you to assign a policy in real time, taking into account the services connected to the subscriber, his location, the quality of the network in a given location in currently, time of day, amount of traffic consumed, etc. Under the policy in this context refers to the set of services available to the subscriber and the parameters of QoS (quality of service). Analyzing the price-quality ratio for various products in this field from various suppliers, we decided to develop our product. And now, for more than 2 years, our PCRF has been successfully operating on the commercial network of the Yota company. The solution is fully software, with the ability to install even on ordinary virtual servers. It works in commerce on Red Hat Linux, but in general it can be installed for other Linux systems.

Of all the capabilities of our PCRF, the most successful were:

flexible tool for direct decision making about subscriber policies, based on Lua language, which allows the operating service to easily and on the fly change the policy assignment algorithm;
support of various PCEF (Policy and Charging Enforcement Function - a component that directly establishes policies to subscribers), DPI (Deep Packet Inspection - a component for analyzing traffic packets, in particular, allowing to calculate the amount of traffic consumed by categories), AF (Application Function - a component describing flows service data and informing about the resources required by the service). All these network nodes can be installed in any number, many sessions from different network components per subscriber are supported. We have carried out many IOT with many large manufacturers of such equipment;
a whole family of external interfaces for systems located in the network, and a monitoring system describing all processes occurring in the system;
scalability and performance.

Actually, later in the article we will discuss one of the many criteria of the latter.

We have a resource on which we laid out an image for testing six months ago, available to everyone under the appropriate license , a list of equipment suppliers with whom we had IOTs, a package of product documents and several articles in English about our development experience (about Lua -based engine , for example, or a variety of testing ).

When it comes to performance, there are many criteria by which it is evaluated. The article about testing on our resource describes in some detail the load tests and tools that we used. Here I would like to stop at such a parameter as using CPU.
I will compare the results obtained on the test with 3000 transactions per second and the following scenarios:

CCR-I - setting the subscriber session,
CCR-U - update session information with information about the amount of traffic consumed by the subscriber,
CCR-T - the end of the session with information about the amount of traffic consumed by the subscriber.

In version 3.5.2, released by us in the first quarter of last year, the CPU load on this scenario was quite high and was 80% . We were able to lower it to 35% in version 3.6.0, which is currently on the commercial network, and up to 27% in version 3.6.1, which is currently at the stage of stabilization.

Despite such a huge difference, we didn’t perform any miracle, but simply performed 7 simple optimizations, which I will describe below. Perhaps, in your product you can also use something from the above to make it better in terms of CPU usage.
')
First of all, I would like to say that most of the optimizations concerned the interaction of the database and application logic. A more thoughtful use of requests and information caching is, perhaps, the main thing that we have done. To analyze the time of the database queries, we had to make our utility. The fact is that initially the application used Oracle TimesTen base, which does not have built-in advanced monitoring tools. And after introducing PostgreSQL, we decided that using one tool to compare two databases is correct, so we left our utility. In addition, our utility allows not to collect data all the time, but to enable / disable it as needed, for example, in a commercial network with a small increase in CPU utilization, but with the ability to analyze right away at production what query is causing problems at the moment.
The utility calls tt_perf_info and simply measures the time spent at different stages of the query execution: fetch, direct execution, number of calls per second, percentage taken from the total time. Time is displayed in microseconds. Top 15 requests for versions 3.5.2 and 3.6.1 can be seen in the tables by the links:
3.5.2 top 15
3.6.1 top 15 (empty cells correspond to the value 0 in this version)

Optimization 1: reduced commits

If you look closely at the output of tt_perf_info on different versions, you can see that the number of calls to pcrf.commit has been reduced from 12006 times per second to 1199 , that is, 10 times! A completely obvious decision that occurred to us was to check whether any changes in the database really occurred, and to make a commit only in the case of a positive answer. For example, for an UPDATE request, the PCRF checks the number of changed records. If it is 0, then the commit is not performed. Similar to DELETE.

Optimization 2: removing the MERGE query

Based on Oracle TimesTen, it was noted that the MERGE query sets the lock on the entire table. That in conditions of processes constantly competing for tables, led to obvious problems. So we simply replaced all MERGE requests with a combination of GET-UPDATE-INSERT. If there is an entry, it is updated; if not, a new one is added. We did not even bother wrapping all this into a transaction, but recursively called the function in case of failure. On pseudocode, it looks like this:

our_db_merge_function() { if (db_get() == OK) { if (db_update() == OK) { return OK; } else { return out_db_merge_function(); } } else { if (db_insert() == OK) { return OK; } else { return out_db_merge_function(); } } }

In practice, this almost always works out without a recursive call, since conflicts on one record still rarely occur.

Optimization 3: configuration caching for calculating the amount of traffic consumed by subscribers

The algorithm for calculating the amount of traffic consumed according to the 3GPP specification has a rather complex structure. In version 3.5.2, the entire configuration was stored in the database and represented tables of monitoring keys and batteries with a many-to-many relationship. The system also supported the summation of traffic accumulators from different external systems into one value on the PCRF and this setting was stored in the database. As a result, with the arrival of the next data on the accumulated volume, there was a complex sampling on the base.
In 3.6.1 most of the configuration was moved to the xml file with the notification of the processes of changing this file and counting the checksum on the configuration information. Also, the current information about the traffic monitoring subscription is stored in a blob associated with each user session. Reading and writing a blob is undoubtedly faster and less resource-intensive than a huge selection of tables with a many-to-many relationship.

Optimization 4: reducing the number of exports Lua engine

The Lua engine is called for each CCR-I, CCR-U and RAR request, processed in the PCRF, and executes the Lua script describing the policy selection algorithm, since the subscriber’s policy is likely to change when processing the request data. But the idea of check-sums has been applied here. In version 3.6.1, we saved all the information on which the actual change in policy could depend, into a separate structure and began to count the checksum on it. Accordingly, the engine began to twitch only in case of real changes.

Optimization 5: removal of the network configuration from the database

The network configuration is also stored in the Database from the earliest versions of the PCRF. In release 3.5.2, the application logic and the network part quite intersected the tables with the network settings, since the logic module regularly read connection parameters from the database, and the network part used the database as a repository of all network information. In version 3.6.1, the information for the network part was transferred to shared memory, and periodic processes were added to the main logic, updating it with changes in the database. Thereby, locks were reduced according to the general tables in the database.

Optimization 6: Selective Analysis of Diameter Commands

PCRF communicates with external systems using the Diameter protocol, analyzing and parsing multiple commands per unit of time. These commands, as a rule, contain many fields (avp) within themselves, but not every component needs all the fields. Often, only a few fields from the first (header) part of the command are used, such as Destination / Origin Host / Realm, or fields that identify the subscriber or session, that is, id (which are also often located at the beginning). And only one or two main processes use all message fields. Therefore, in version 3.6.1, masks were introduced that describe which fields need to be read for this component. And also removed almost all copy memory operations. In fact, only the original message remains in memory, and all processes use structures with pointers to the necessary parts, the data is copied inside the processes for strict need.

Optimization 7: Time Caching

When the PCRF began to process more than 10,000 transactions per second, it became noticeable that the logging process takes up a significant amount of time and CPU. Sometimes it seems that the logs can be sacrificed in favor of greater performance, but the operator should be able to reproduce the whole picture of what is happening on the network and on a specific component. Therefore, we sat down to analyze and found out that the most frequent entry in the log is a time and date stamp. Of course, in every record in the log it is present. And then, limiting the accuracy of time to a second, we simply began to cache the line with the current time and rewrite it only for the next second.

All these seven optimizations will surely seem to an experienced high-performance developer simple and obvious. They also seemed so to us, but only when we realized and realized them. The best solution often lies on the surface, but it is also the most difficult to see. So I summarize:

Check that the data actually changes;
Try to minimize the number of locks on entire tables;
Cache and remove configuration data from the database;
Do only those actions that are really needed, even if it seems that it is easier to make the entire list.

Source: https://habr.com/ru/post/210390/

All Articles