About the intricacies of improving performance in C ++, or how not to do

One day, many years ago, a client came to me, and ~~tearfully begged~~ me to sort out one wonderful project and increase the speed of work.

In short, the task was this - there is a kind of robot in C ++, which strips HTML pages and compiles them into a database (MySQL). With a lot of functionality and the web on LAMP - but this has nothing to do with the narration.

The previous team managed on a 4-core Xeon in the cloud to get a fantastic collection rate as much as 2 pages per second , with 100% utilization of the CPU of both the collector and the database on a separate same server.
')

By the way, realizing that they do not cope - “a team of strong professionals” from the city of Bangalore gave up and ran away, so that in addition to the source code - “nothing! besides beads ”(C).

On the intricacies of restoring order in PHP and in the database schema let's talk some other time, I will give only one example of the skill that came to us.

Getting to the autopsy

I was interested in such a serious loading of the database in the first place. I turn on detailed logging - and ~~I start pulling out my hair in all places,~~ here it is.

Obviously, tasks from the interface were added to the database, and the robot asked 50 times a second - didn’t a new task appear? Moreover, the data is naturally decomposed in such a way that the interface is convenient, and not the robot. The result is three inner join in the query.

Immediately increase the interval to "once a second." I remove the crazy query, that is, I add a new tablet of three fields and write a trigger on the tables from the web, so that it fills with an automatic machine, and I change it to a simple one

select * from new_table where status = Pending

The new picture - the collector is still 100% busy, the database is 2%, now four pages per second.

Take up the profiler

And suddenly it turns out that 80% of the execution time is taken up by the wonderful methods EnterCriticalSection and LeaveCriticalSection . And they are called (predictably) from the standard allocator of a well-known company.

The evening ceases to be languid, but I understand that the work - a lot and will have to rewrite from the heart.

And of course - ~~parsit~~ HTML ~~bydlokoder~~ my predecessors preferred a bunch of regular expressions, because SAX is so difficult.

It's time to get acquainted - and what has been improved to me?

On the dangers of premature optimizations by the mental ray

Seeing that the database is 100% loaded, the guys were firmly convinced that the insertion into the list of new URLs for processing was slow.

I even find it difficult to understand how they guided, optimizing this particular piece of code. But the approach itself! In theory, we are slowing down here, let's slow it down again.

For this, they came up with these tricks:

Asynchronous insert request queue
Huge HashMap in memory, self-written, with a giant lock, which memorized all passed URLs in memory. And since it was a server, it had to be restarted regularly after such optimizations. They did not finish cleaning their cache.
A lot of magic constants, for example - to process the next batch of URLs from the database, no more than 400 entries are taken. Why 400? And picked up.
The number of "writers" in the database was large, and each tried to cram their part in a cycle, suddenly they were lucky.

And of course, many other pearls were available.

In general, it was very instructive to observe the code evolution. The blessing in the store can not refuse - everything is neatly commented out. That's about

 void GodClass::PlaceToDB(const Foo* bar, ...) { /*      1,  */ /*      2 -     ,  */ /*      3 -  ,      ,  */ .... /*    N-1,        ,  */ //   -   }

What did i do

Of course, all the tricks were immediately thrown away, I returned the synchronous insert, and constraint was hung in the database to cut the doubles (instead of giant-lock dances and a self-written hashmap).

Auto-increment fields are also removed, instead of them inserted a UUID (to calculate the new value, an implicit lock table can crawl). At the same time, I seriously reduced the table, and then 20K per line - no wonder that the database sags.

Magic constants also removed, instead they made a normal thread pool with a common task queue and a separate thread filling the queue so that it would not be empty and not overflowed.

The result is 15 pages per second.

However, re-profiling did not show breakthrough improvements. Of course, acceleration is 7 times due to the improvement of the architecture - this is also bread, but not enough. After all, in fact, all the original shoals remained, I only removed the optimized pieces.

Regular expressions for parsing megabyte structured files is bad

I continue to study what has been done before me, enjoying the approach of authors unknown to me.

Hmmm di!

With the grace of the tractor, the guys solved the problem of getting the data like this (each action has its own set of regular expressions).

Cut all comments in HTML
Cut comments in javascript
Cut script tags
Cut out tags style
Take out two numbers from the head
Cut everything except body
Now we have collected all the "a href" and cut them
Body cut out all unnecessary div and table, as well as pictures
Then removed the table layout
In the rest of the tags removed p, strong, em, i, b, g, etc.
And finally, in the remaining plain text, they got three more digits.

Surprisingly, with such an approach that it chewed at least 2 pages per second.

Obviously, I don’t give the expressions themselves after their tuning - this is a huge sheet of unreadable ink stubs.

That's not all - of course, the correct boost library was used, and all operations were performed on std :: string ( correctly — and where else would HTML be added? Char * is not conceptual! Only hardcore! ). From here comes an insane amount of memory reallocations.

I take char * and a simple HTML parser in SAX-style, I remember the necessary numbers, in parallel I pull out the URL. Two days of work, and here.

The result is 200 pages per second.

Already acceptable, but not enough. Just 100 times.

Another approach to the projectile

Moving on to the results of the new profiling. It has become better, but there are still a lot of allocations, and for some reason the bustov to_lower () has come out for the first time for some reason.

The first thing that catches your eye is a powerful URL class, seamlessly drawn from Java. Well, that's right - this is C ++, it will be faster anyway, think that the allocators are different . So the stack of copies and substring () is our Hindu everything. And of course, to_lower directly to the URL :: host should be applied, no - no - it is necessary at every comparison and mention, and by all means boost.

I remove the excessive use of to_lower (), rewrite the URL to char * without relocation instead of std :: string. At the same time I optimize a couple of cycles.

The result is 300 pages per second.

At that end, acceleration was achieved 150 times, although there were still reserves for acceleration. And so he killed more than 2 weeks.

findings

Conclusions as always - a classic of the genre. Use tools when assessing performance, do not invent from the head. Shirche (or wider) use ready-made libraries, instead of setting the sun manually.

And may ~~Saint~~ Connecty be with you a high performance!

Source: https://habr.com/ru/post/157997/

All Articles