📜 ⬆️ ⬇️

We go around other people's brakes

I backed up the history of messages from Skype with a self-written utility, a year ago it worked perfectly, and now it began to slow down. This is unacceptable, tk. including for the sake of export speed, it was written, so I got into the profiler. As a result, I learned everything and received multiple enlightenments. It turns out that a breakpoint on a function in a loaded system DLL has to be set up with a subloader, and not just by name, but it can also be lightweight. It turns out that the Skype API is written in places brutally crooked, and why tormozishcha. It turns out that foreign binaries can sometimes be very easily podhachit and podoptimizit (thank MS Research!). It turns out that the profiler can lie strongly, and not just lightly ram it. Keywords for the impatient: C ++, VS, CodeAnalyst, Skype COM API, MS Research, Detours, SQLite; and for all other details under the cut.

In short, the background. A year and a half ago, I wanted to export Skype logs in a convenient text format for me personally (almost cut / paste from the chat, with tiny improvements). All found utilities for exporting something did not suit. Either they brutally braked (because there are a little less than 300 contacts and a little more than 360,000 messages in the logs); or wrote logs in an inconvenient format; or all these troubles at once; or stupidly did not work, etc. I decided to write my own, it turned out. At first I tried the Skype API bindings to Python, it turned out slowly. Then I tried Skype4COM and C ++, it turned out to be much faster. The result was a fairly quick utility for exporting logs. Hardcore, console and C ++, of course.

Another time I remembered about backups a couple of weeks ago, I launched a proven utility and ... I realized that I could not wait for the end. Export, which used to take several minutes, in half an hour advanced by about 10%, for a total of ETA, which was about 5 hours. Unacceptable for a long time. What's the matter?!

Slightly correct utilities, we limit the number of exported chats (as I understand it, for some reason, we divide individual sessions of communication with the same contacts into separate so-called chats, IChat), mark the base time. Then we run the profiler, good, CodeAnalyst is now embedded in Visual Studio and starts in just 2-3 clicks. And carefully look at the results.
')
exporting chat 100/7014...
exported 3925 events in 18.1 sec




The profile looks like curly hair, kernelbase.dll eats 81% of the time of the process, another 12% of skype4com.dll, the program itself cannot even be seen. Treason! It is not my code that is slowing down (which is easy to edit), but some kind of third-party. But which one?

The functions InternalLcidToName, LCMapStringEx, NlsValidateLocale confidently override. Who are all these people? I do not call anything like this from my code. So, I'm not calling. So, you need to know where they are called from, and then it will be possible to see why, and what can be done about it. It is necessary to put a breakpoint on the top function InternalLcidToName, see the stack. Oops, poser. Stupidly by the name of the function of the breach is not set How to be?

I know two options, but I suppose there is even more. Bo1x, since the function is all so top-profile in the profile, then by accidentally interrupting the performance several times, we will definitely get into it. Bo2x, slightly googling, manages to find the magic line {,, KernelBase.dll} @ InternalLcidToName @ 8 - it turns out that the desired character is called that way, with dogs and digits. Looking ahead, tsiferki are always multiples of 4, and generally strongly resemble the size of the stack, and instead of the leading dog for "public" characters are typically underlined. Noticing this, the exact rules of mangling are already much longer to search, study and apply, than to sort out a couple of options (dog / underscore, 4/8/12/16 ...), or even get into some _GetStringTypeW @ 16 first shot. Well, after returning from the near future to the actual InternalLcidToName, repeatedly poking F5 and collecting statistics, you can see two interesting things. The first thing is that the stack of most calls looks like this.

KernelBase.dll!@InternalLcidToName@8()
KernelBase.dll!_LCMapStringW@24() + 0x46 bytes
Skype4COM.dll!280c69f2()
// Skype4COM.dll
// ,

The second thing is that the repeatedly executed code looks like this. At the same time always goes on the same codepath.

@InternalLcidToName@8:
752F6F33 mov edi,edi
752F6F35 push ebp
752F6F36 mov ebp,esp
752F6F38 push ecx
752F6F39 push edx
752F6F3A lea eax,[ebp-4]
752F6F3D push eax
752F6F3E mov dword ptr [ebp-4],ecx
752F6F41 call _NlsValidateLocale@8 (752F6E04h)
752F6F46 test eax,eax
752F6F48 je @InternalLcidToName@8+17h (7531BAB0h)
752F6F4E push eax
752F6F4F call _LocaleNameFromHash@4 (752F6F13h)
752F6F54 leave
752F6F55 ret


Those. All our three top functions from the profile are actually reduced to one, _LCMapStringW. Unlike the other two, internal, this function is quite a part of the public interface and described in MSDN, Google instantly finds the link msdn.microsoft.com/en-us/library/dd318700%28v=vs.85%29.aspx - it turns out This is the conversion of strings from one locale to another time, for some reason, it eats, if not to itself.

Well, we put another breakpoint (the _LCMapStringW @ 24 symbol is immediately visible on the stack, conveniently) and a few more in the program, and look at two things. What, actually, API call leads to a call of this most expensive LCMapString (for this purpose in the program). And exactly which parameters are transmitted as a result (for this breakpoint on the function itself). Having walked the function several times, it is clear that the code always follows this path:

752F8188 push 0
752F818A push 0
752F818C push 0
752F818E push dword ptr [ebp+1Ch]
752F8191 push dword ptr [ebp+18h]
752F8194 push dword ptr [ebp+14h]
752F8197 push dword ptr [ebp+10h]
752F819A push dword ptr [ebp+0Ch]
752F819D push eax
752F819E call _LCMapStringEx@36 (752F81ACh)


Corresponding, but all the parameters can be seen very clearly; here they are all, are thrown into the stack (in the reverse order). The results of their careful examination, however, are simply shocking at first, and then cause an involuntary dropping of the jaw and uncontrolled outflow of foam from the mouth.

It turns out that for each (every) call to the IChatMessage :: GetBody method, the following occurs.

First, # APPLICATIONCALLCHATMESSAGECHATMEMBERCHATGROUPSMSUSERVOICEMAIL # arrives in the LCMapStringEx function, with the single flag LCMAP_LOWERCASE (“make lowercase based on locale”), the locale is en-US. Perhaps there will be another on another machine (I have en-US as the interface language for Windows), but for the conversion of the usual Latin into lower case it doesn't matter. Here only these data arrive not in one long line, and not even in a few short lines, but ... one character at a time .

Then, apparently, the Hindu who wrote this frenzy begins to suspect that something was wrong, all of a sudden it will be necessary to bring another command of the protocol to the lower register, and we have a torrent and a bunch of syscolles? Therefore, there is a brilliant preventive optimization. And in the LCMapStringEx function the entire table arrives from 0 to 255 inclusive, again one byte per call. And just in case this is done 2 times in a row.

Then ingenious optimization continues. (Or maybe just another Hindu is doing it now in a different place in the code.) And the entire table from 0 to 255 flies to the function again! But each byte is now repeated in a row 3 times. I personally think this approach is more correct than the previous one, of course. You cannot take two chronometers at sea, you need either one or three.

Total more than 1000 LCMapStringEx calls for each [censored] attempt to take the text of 1 (one) message, each call is made on a line of exactly 1 byte size. Through this, 5000 text message takes 5-10 seconds CPU. It does not depend on the Skype COM API version, that two years ago, that the current one is braked equally.

Oherenno, oherenno inside is arranged by Skype API. Feels written strong professionals.

The problem is clear, what to do? Ideally, these calls should be eliminated altogether, but patching in an unknown number of places skype4com.dll is lazy (who knows how many different points this LCMapString is pulling), and bo2 is fraught (who knows what bugs will suddenly lead). It begs to change the function itself and make a quick exit from it for that very case of a call with 1 byte. The simple technique has long been known, to take the function address in memory, put jmp in there to its function, from its own, if necessary, execute the instructions that were erased at the beginning and jmp-back to the original one. However, the technique is tedious and sawing it with handles on an assembler is also a bit lazy.

It turns out, no longer needed! MS Research has already thought and done everything for us. In nature, there is a library called Detours , which is able to do just that and more. You can, for example, replace a function in some third-party .exe without modifying it at all, on the fly shove your implementation out of your DLL. Well, for my utility, Detours provides a simple and straightforward C / C ++ interface for the necessary substitution, an example called simple.cpp is more than enough. We cling to the project cost of living (detours.h, syelog.lib, detoured.lib, detours.lib), append 20 lines of code, and ...

#define PROTO (LCID Locale, DWORD dwMapFlags, LPCWSTR lpSrcStr, int cchSrc, LPWSTR lpDestStr, int cchDest)
static int (WINAPI * TrueMap) PROTO = LCMapStringW;
int WINAPI MyMap PROTO
{
if (Locale==1033 && dwMapFlags==256 && cchSrc==1)
{
*lpDestStr++ = *lpSrcStr++;
*lpDestStr++ = 0;
return 1;
}
return TrueMap(Locale, dwMapFlags, lpSrcStr, cchSrc, lpDestStr, cchDest);
}

// ...
DetourRestoreAfterWith();
DetourTransactionBegin();
DetourUpdateThread(GetCurrentThread());
DetourAttach(&(PVOID&)TrueMap, MyMap);
LONG error = DetourTransactionCommit();
if (error != NO_ERROR)
printf ( "error detouring LCMapStringW(); export might be slower (code=%d)\n", error );

// ...

exporting chat 100/7014...
exported 3925 events in 5.8 sec


Overall, shorting out the idiotic and generally unnecessary (!) Byte-level translation of everyone into lower case using a system call, they accelerated the program 3.1 times as a whole. Not bad, not bad. Look in the profile further!



The picture has changed dramatically. Now KernelBase.dll eats only 20%, the rest is carried out in unknown wilds of skype4com.dll, which you can’t take so easily. However, it is obviously asking for a check with GetDriveType. This function determines the type of disk (removable, non-removable, CD-ROM, RAM, or network), if it is also called thousands of times, it is suggested to cache the result. Here we are waiting for the next small discovery.

It turns out that the profiler is lying and quite noticeable. The _GetDriveTypeW function is called 1 (one) time during the whole time the program is running. In the profile, it glows quite noticeably, even if you export 1000 chats, not 100, but in reality it does not eat.

However, the profiler does not lie about _GetStringTypeW. Having done a similar analysis of _LCMapStringW calls with simple manipulations, we find out that during the export, its parent function GetStringTypeEx is also constantly pulled on each byte from 0 to 255 (who would doubt). By intercepting GetStringTypeEx and returning the results for a single-byte case from the cache, another 20 lines will result in an acceleration of another 15%, and only 3.6 times.

exporting chat 100/7014...
exported 3925 events in 5.0 sec


Interestingly, after this optimization, KernelBase.dll disappears from the profile altogether. 62% of the remaining time is consumed by skype4com.dll, another 12% is using ntdll.dll (on allocation and critical sections), the program itself is eating about 8%, and then all sorts of things on the system. One feels that the potential for optimization is still 3-5 times better, but it’s fast to deal with what happens when there are function names and documentation on MSDN, and to disassemble the giblets skype4com.dll and to smoke them from hotspots. And the export of the cherished 360 thousand messages already takes less than 10 minutes, which is acceptable.

A brief technical summary seems to have already been announced at the beginning of the article. Around live people, including in famous companies such as Skype, and in some places they write the code ... like real people. Sometimes the situation can still be corrected with minimal effort, even if it slows down in a third-party library for which there are no sources. The investigation of the brakes and optimization took me literally a couple of hours, including mastering Detours (I saw and tried it for the first time). I am afraid that the preparation of the notes in total took more. The tool is powerful, with the right application clearly can do small miracles.

... and Skype message logs, by the way, are stored in the usual SQLite database, which is successfully opened and managed by the SQLite Browser using the usual SQL syntax. Clear history for one contact selectively? You are welcome. For everyone, to the desired depth (which Skype does not change when changing settings)? Easy. Optimize the cleaned base? Again, one click. But this is another, not a C ++ story at all;)

Successful optimizations for you.

Source: https://habr.com/ru/post/122735/


All Articles