📜 ⬆️ ⬇️

Audit of one “slow” application in one large concern

In general, I just wanted to respond to this comment and, as an example, cite the unexpected results of one audit of a web application I did, but the answer was very cumbersome.
So this article was born.

Introduction


The idea was that sometimes in the corporate sector, hiding behind far-fetched standards or imposed standards from security, completely unjustified and sometimes completely wild implementations of the latter occur, often bordering on impossible work conditions. For example, CIO and others like it (either careless or just lazy performers), guided by such politicians, might have overdone it, and often did not find (or did not look for) a better solution.

As a result, we have antiviruses on servers, with all the consequences, because it just has to be on every computer . Employees are forced to work under the administrator (aka MakeMeAdmin), because creating an admin account (tech-user) is stupid for scripts, restarting and debugging services, well, not at all possible (well, all the same, there is the same antivirus) . The policies allow you to run the executable file from anywhere (network, temporary directory, etc.), because some kind of update service does not know how else (no big deal - again, the antivirus as an argument). Etc. etc.
')
In fact, I clearly understand where such requirements grow from. Quite often, these are really the conditions of customers-customers, the requirements of the parent company or partner company, or de facto industry standards that you just need to meet. Well, that is stupidly not going anywhere.
But, the fact is that technically just some things, with regard to security, are completely unjustified, worse than that, in reality, they are not at all secure, and moreover, they hinder both the development and the productive work of the same client.
And if you cover your fifth point (using false "standards"), you decide to meet the security requirement of the evil one, then do it at least including your head so that it does not affect (or minimally influence) the company's productivity.

Example: somehow they fought with one well-known anti-virus (provably large, probably somewhere in the area of ​​analytics and real-time scanning queues) - with a very large server load (32 cores> = 50% cpu load):And in fact, it is not possible to disable it in production on servers (even temporarily as evidence, since they usually cover one place with such an antivirus).

For example, that it is necessary to divide the same network and insert the same antivirus behind the NAT. Or at least check for such sores and replace with another, more reliable one. Compare this cost with, say, man-hour * 25,000 employees, every day, minute by minute, spending on stupid waiting for an application to respond.
As a result, the number of bikes of type “safe_rename”, “real_delete” or “start_process_with_observe” around projects grows. The same CIO would quickly reconsider its position if he (his subdivision) were collectively invoiced for the total time of “idleness” (waiting) of all employees.

Audit


I somehow had to do an audit of one more or less large application in one very large concern. Web application on a company intranet. It is rumored that, it seems, did not shine with speed before. And after some release there, everything became very, very slowly.
The manufacturer claimed that everything was all right with them - ostensibly according to the logs it was not visible that the application was overloaded, but it was impossible to simulate such a load test as in production. In general, the client's patience is over - well, the audit itself ...

It all started with writing analyzers of the logs sent (thank God, the protocols in the application, more precisely, in the app-server, you could pretty much detail). As a result, huge logs are reduced to a more or less readable form. An example of a cropped protocol after the analyzer (and set up as a habraparser), who is suddenly interested, can be seen under the spoiler below - I’ll just say that the logs did not show anything obvious. Those. servers (application) well, yes - not fast (somewhere SQL slows down, somewhere - storage or NAS), but in general they could, in principle, cope with the load that exceeds the analyzed one ten times.

An example of an analysis protocol ...
Analyze:
Analyze-Time9338645
(155,644 min)
07:45:09 - 10:20:48
Idle (ms)6069160 (101,153 min)Idle-avg:735.66Count:8250
Busy (ms)3269485 (54,491 min)Busy-AVG:396.3
Total (ms)4536133 (75,602 min)
AVG (ms)285.6
Requests15883
Users106Avg (ms)Min (ms)Max (ms)
WorkTime (ms)521217468 (8.686,958 min)4917145 (81.952 min)344 (0,006 min)9322426 (155,374 min)
ServerTime (ms)4536133 (75,602 min)42793 (0.713 min)142 (0.002 min)298511 (4.975 min)
Name/app/mailbox.htm/app/docmain.htm/port/result.htm/app/tree.htm/app/view.htm/app/docnavi.htm/port/docview.htm/app/result.htm/port/report2.htm/port/search.htm/port/pdfview.htm/app/action.htm.../app/empty.htm
Time (ms)1135731 (18,929 min)770339 (12,839 min)616371 (10,273 min)606983 (10,116 min)286,304 (4,772 min)255469 (4.258 min)173729 (2.895 min)135370 (2.256 min)109145 (1,819 min)72917 (1.215 min)34346 (0.572 min)32499 (0.542 min)...0 (0.000 min)
AVG (ms)1.108.03514.591.064.54211.35239.79222.73622.68474.985.457.25197.07602,56172.87...0
Count102514975792872119411472792852037057188...one
User-Times:
UIDnet \ u101165net \ u144102net \ u193619...
Time (ms)298511 (4.975 min)238661 (3.978 min)168190 (2.803 min)...
AVG (ms)2.446.81269.07282.2...
Count122887596...
Worktime131,818 min 07: 55: 26-10: 07: 16117.066 min 08: 23: 28-10: 20: 32150,534 min 07: 46: 35-10: 17: 07...
RequestsNameTime (ms)AVG (ms)CountNameTime (ms)AVG (ms)CountNameTime (ms)AVG (ms)Count...
/port/result.htm280962 (4.683 min)12.771.0022/app/docmain.htm89560 (1,493 min)621.94144/app/mailbox.htm53689 (0.895 min)1.677.7832...
/port/docview.htm11938 (0.199 min)746.13sixteen/app/tree.htm55327 (0.922 min)278.03199/app/docmain.htm39019 (0.650 min)750,3752...
/port/search.htm3750 (0.063 min)25015/app/docnavi.htm42245 (0.704 min)291.34145/app/tree.htm22122 (0.369 min)254.2887...
/port/tree.htm797 (0.013 min)88.569/app/view.htm22986 (0.383 min)247.1693/app/view.htm15830 (0.264 min)316.650...
/port/view.htm640 (0.011 min)40sixteen/app/mailbox.htm16126 (0.269 min)1.466.00eleven/port/result.htm9622 (0.160 min)253.2138...

Roughly speaking, even the total reduced idle-time of all server workers (101,153 min out of 155,644 min) reports this.

I will not load the reader, with what else I had to face in search of the malicious type, which pulled the handbrake.
After fortune telling on the coffee grounds, everything was supposed to be from the same antivirus (where it would be without it) and balancer problems, to a stupid swap on the client or any browser extensions or slow javascript agony in the browser, in some asynchronous calls.
It all went to the point that hell code audit would be ahead. In the meantime, I decided to check on the spot how it behaves everything.

They show the work of the application - everything is really very slow. The browser creaks sluggishly, the pages after the click open frame by frame and slowly as a modem. There is no swap - nothing is eaten away at all.

And having spent just an hour searching, I was, as a result, in silent horror (there is actually another word) .

But in order: first I squeezed my proxy, with accesslog enabled, between the browser and the server, worn out with MITM with the substitution of NTLM-credential (the proxy also worked under the user account), and with the substitution of absolute URLs for another port, etc. I was expecting to see in the logs that the application had nothing to do with it - it was either an antivirus, or something with a browser (scripts are there, etc.).
And after making a few clicks in the application, I unexpectedly found in the log file a good hundred mini-requests from the browser for each invoked request (for each click). All statics , i.e. icons, static scripts and styles were processed again and again with each click - i.e. we have one or two large requests 200 and many (very many) small 304 (Not Modified).

However, the application, as expected, sent the correct headers for caching (Cache-Control, Expires, etc.) for statics.
In short, it turned out - the browser's cache was just stupidly “turned off” (by politicians)!
In the whole concern !!!
Well, that is The checkmark “Check for newer version” was checked in “Every time I visit the webpage”.

Add to the whole NTLM from above (with a handshake back and forth and a request-response to the PDC from both) and given that we have a lot of applications in the company on the web architecture, multiply the number of employees of the entire group and Voila! You can go to the server room and admire the red glow of the red-hot extremely network equipment.
Those. The browser not only sends and receives a bunch of small request-responses, but also waits for confirmation from the last resource in the page - I am unchanged to complete the rendering and show the page. It seems to me that the anti-virus, constantly bombarded by additional requests, has contributed to the big picture (although I don’t know its reaction to the same script, which returned from 304, for example).
Well, and as a bonus from above - a stupid, utterly stupid network - about how a huge mass of mini-packets can brighten up the life of the entire network segment as a whole and of a particular router in particular, network administrators can tell long stories (usually obscenely).

By the way, I was lucky to say the least with the search (and the client) - the accesslog for statics was turned on in my proxy settings, although it was usually turned off (I just tested 304 and their ilk and did not correct the configuration back). Well, no one expected (the administrators present there, too, went blotched) that the staff responsible for the policies (in this case, the browser settings) are so incompetent. So much so that just borders, IMHO, with the basis for dismissal from work.

According to rumors, the cache was “turned off” because in a completely different project, some kind of major version of some SAP components there was something else she couldn’t do otherwise (maybe Cache-Control, Expires, etc. were not put down there). d. or incorrectly implemented If-Modified-Since). The point is that the testers did not find it, and the performers who rolled out the major release in production could not roll back and did not find anything better than stupidly “turn off” the expires and cache-control checks. Once again - in the whole concern! That is, for all applications and surfing as such.
They could, for example, put something proxying in front of the application to change the headers only for specifically these URLs, or for example, crash into the application server for "rewrite" in it directly. You can come up with a dozen solutions better.
By the way, instead of recognizing the file - according to rumors, they argued with foaming at the mouth that it was necessary and good, and that even Microsoft recommends these settings for IE. As a result, your humble servant had, as an attachment to the progress report, to write also a “dissertation” on the topic “Why is bad - this is not good.”

And from the ordinary users of that company, I had the reputation of a wizard who had solved a problem in an hour that local IT specialists could not find for months (or even years).
Can you imagine, suddenly not only the application, but everything else in the company suddenly began to work much faster! Up to the last network printer ...

In return, I ask you not to kick with words like “Programmers also make mistakes,” and with the hope that there may still be responsible persons (as well as performers) will listen to us and at least sometimes will turn on our head ... It helps a lot.

Source: https://habr.com/ru/post/256975/


All Articles