Log Log Monitoring: Such a Vulnerable Log or How to Put a Pig On to Colleagues

Monitoring or analyzing logs, whether it concerns security topics, load analysis, or creating statistics and analytics for a sales person or feeding a neural network, is often associated with many problems.

Unfortunately, this is often associated with the human factor, namely, with the unwillingness or misunderstanding of some simple rather many things by many program developers, APIs and services that log the same information needed for monitoring.
Below is exactly how it is often done and why it is impossible to continue living like this. We'll talk about log formats, analyze a couple of examples, write a few regular expressions, etc ...

Dear colleagues, of course, this is your business, as well as what you write in the logs of your program, however, it’s still worth thinking only for yourself ... Perhaps, besides you, some user of yours is looking at this line now with despair programs, and even smart to impossible, but swearing in vain, bot.

I also wrote this post, made another file with such a complex log format for analysis, which led to another “vulnerability”, up to and including writing a ready-made exploit in the search process.

And if I encourage at least one developer to think about this article, this will already be a big deal, and perhaps, next time analyzing the journals written by his program, he will not be remembered with a dirty word, but on the contrary will be gratefully praised.

First, foul language ...

click (you were warned) ...

clear; echo -e "U2FsdGVkX19d2YHsJhZ9re6p/Gc7bK+Ri9MHvcrVSUsU0+a1UtXfEdIJNu88cQ56\nt6eC8VK5yIr5fiwVSV2e9zhpJLEq3BQQ/U1fthG6Jz4GMpFrqreajRhfVCXdrbpg\nMttWTW/3ljnX5hflOuh4OOycnXDL6kK7W5FOhe9nqnki6oYGj8UYkv06aM0acsea\nRq5OpvZrYT+/7E2ABqp+sg+opfDsaoOITtZPkoJMBPm1Ne4o//yq4tGJypLC/d0f\neWypmTRGEdCadPiFUqL97qWJYE2N7e8oIaETB6stHKfwULChVkI4TUff+ClzC1ZH\nJ9eDUa1qEnEtAvvbKxpumoxClF15hYa4Zb12jcaEM6OPIXiFw+fGk7BT6R64k/gN\nUufDNRQuxevX0C1ZJxAX311rqmqC4w9zQrAfiyrObxmk11x6+pj/Ukqn3V/w7Nt4\njfpxks49Ovnr7vy8Zo5uBHu2YcOAxOIjhj13onW2CK73fQ/vonvG/B0gMC9+FMaE\nk9RIRlRGmWJZLnqj6+RLKzakcoa91c60PXChzMCTC6BlXK5obW33uiPRhKmp6/nX\nVJo1XUI1d39yRny9N9m7hxuodFPSS0dgkT2FufzDexmwnFaTl7FvMo3bndbuNAIM\nA49+tM3qha7Bewc7J5cwGi2gFtkfYTJstjZh/rYA7rph2IsI7AJai7DGDhLDVeVV\nWSsFQ3KAkuD4VfdijDA4YLtYVsQguTMgiTwQ+5khqX9VPj9UXhhnX+pBUGj9ZKfa\nycT1gfkwya1+MCzDgAo28oXpoFj5/tGTNQuzi2AT6BteDJJy8U5P64zH4jgEmUD8\nvidPry7DaHY4PQQ8oF09ay5Jv/Z0ugK66+Al8wP15VRC8x0+W+HWzcC2a9LLz+Mx\n9uphZPo2Cl9nVIrWfhjqMKCJttpa3TT2j/pcciZZHJTiTg0hm5mU45YI68kl6s/a\nOxa5clTDOs6zJp79fbNk0jnjyb9Xx/9dcHNZzv1A3sUVDdhzG0EzMr6Fm5Mvg+op\noJ6TGFLuZrlcvdnBPc+J+ywOuhUCI9FPjr7JnkDbCKTMm9VykRqki+bWdURlKJ34\nlEI8LGT4Qrh5McBtruFu3KqC12giO1BvIKV8mj7jdzCflokW7/k+UI6+p1e8IP2j\n9rxlBgdym1t+ZaR3hhWo+WTMCbxzBrzmZaGNMsl5WVYKXUuAZ5hglbI12AcJzNyj\n5vQIft362+zcVY/opWuvhI61d3FdI+WuBGocexb63R/8TiQOaOD+WyElRZYwSFEI\nEd4uHtZOGFYwFJyghNlk6ubNq3BYHdp3RyBDr+R56ndEM25QemAj35TKwdOckqEi\nQCPoDTJwpsSO7pKBpER56O4rBwSu48PDXb95Mi3uBGUQZljXtJ1AHSWUJU3AIcUk\nvWpC0gzIWj9Ev4SXHxrCjqmXRrkfC8iJ7lLlTl3xF7v4Nxa5lorq6frF5500lmsH\nnEI7QmyuRJrE/JuiVbvUApOKnpmIJIlAw4ZCBuXo/PDsWwEwK4+Imi3hFTGtOv+Z\nj+cbOGetk5PWrIgDdbCGEnzWcKbdv31ASRdqfvwjqCpLN8kwRA2+pT7uFR65kkpd\ntpeZrnWc0RiVwwoyxI1IFLQvbWec4UXl/iJ1t8WuueI0BiK5crjzVhns/8v9uSDo\n1jtleZN5vaPlEWKuUUM4SrdS6NLOkqeHN0omtoP38fZoRkpwdytosbj07gI691cf\noc0c3nUo357d0GPq1Jmn3XCuLPnjv4Vn1+f1ryo+y8ang7rFI1C7+1wWEt2pp2nc\nDmQzAIFp0ncrSOTrLeCfVjy12+QAZ96ddG/cMVFcU4DFF/zxS9YIHJlbCF0/wjUY\nKcrpkIPc5Jb616WWUwbVZ0Kw4oPJf923Itu9LlcoNhlrGEUSVQXBwSm8cdWKcdlx\niVp22UjEn7Ycw6O7gZHJrpP2ysCBzpOFKSkd0274p8nT3bIva1aKtwEK0E49mPtr\n+WZ504z2blfHexYoVLtObrSOB2kktCuXLy6NpfhJyLDaywo3n1MHFOjfPE4dDPo4\nrTOEkFzsZukR8M+L77lQhuhskJ3zIZtpSqiL2qyfo8ZIS9t3ft+Vstj06BcbZSHJ\nGn/bKpAxAhHmaoy/qeEYh+fehn7KxGAc0eppPnwoPhfc5DPuXKtyfhBY5Ci9SZyV\nFOc8VcplHt5ED0lr0sfHeLLwUCaZGJY3tkHCPewQ2qGt+jGsbt8uI2s/gBKjePmU\nLTWts/eDPT9JzpTXcJmY6CqZccDsjOY5Pl4lqZwEc+yqMJHqXq+BbIsAwl/Wf19P\nPpv1VJ0L/MlM5r+o+QX5b70c9WEpSVlx946UlJbbPssrEAvgknwJrpKoNRF5gCAx\nDzDZ/ayUr5rlr8hfBcYUqGRYKGJPpzFvNkM6cuRIu8BSklZPmv4KaWdrpjZt5KdQ\nJ1vY6fe5Y/mB0w/qGeCbCb3bPGLnkhS2KDVazHHrsfdj50BMVtsJGmMTu4vwtUzF\nMTE6IjJJWL71DP5pCla9vLoyrUJboNFmQk9QqmOMrs2mLmJzIdL1zb51OpBIZOSG\nboYc0xU9sUMX7w2goPauyw==" | openssl enc -aes-128-cbc -a -d -salt -pass pass:wtf

I apologize to my colleagues from Windows, although it is likely that revelations will open under gitbash or mingw ...

Everyone calmed down and went ...

_{(Footnote for skiddie: there is no mentioned exploit in the article, - to think and write yourself)}

So, what is happening in the development world with regards to logging:

The format of the log records is written, I apologize, from the bald - completely without any "standard", structure or order
often the structure of the record itself is so dynamic that it almost completely depends on the "input" data
user-input or foreign-data is often not disguised (escape), randomly interferes with the format haphazardly, worse - often, even if there is a valid or disguised value, for some reason the original is written to the log (well, if at least cut off) without any hesitation it is fraught
in each subsequent release of the new version of the software, the format of almost every interesting log record will necessarily be changed (or even completely rewritten beyond recognition), which brings with it a lot of pleasure to sort it all out and analyze, once again running around the source (with availability of these) and generating a bunch of logs, making fun of the program in 20 different ways.

Here is my little analysis (eng) with which you have to fight in a particular case (using fail2ban as an example) and why this is at least not good.

Now the specifics: as an example, look at the following two lines:

 Aug 18 08:04:51 srv sshd[2131]: Failed password for invalid user test from 1.2.3.4 port 46589 ssh2 from 4.3.2.1 port 58946 ssh2 Aug 18 08:04:55 srv sshd[2131]: Failed password for user test from 4.3.2.1 port 58946 ssh2: ruser from 1.2.3.4 port 46589 ssh2

Let's forget for a minute a log analyzer (aka bot) and look at them with a human eye. Do you understand everything here?
No, that there is something "exploit" or trying to find vulnerability, can be seen with the naked eye. Those. at least should be confused by the presence of two different IP addresses in each of them.

The question is: which of these two addresses is bad?

Let us briefly digress and look at the damn interesting OpenSSH sources (module auth.c ), namely, where these lines were created (yes, yes, you understood correctly - they were made by one function):

 authmsg = authenticated ? "Accepted" : "Failed"; authlog("%s %s%s%s for %s%.100s from %.200s port %d ssh2%s%s", authmsg, method, submethod != NULL ? "/" : "", submethod == NULL ? "" : submethod, authctxt->valid ? "" : "invalid user ", authctxt->user, ssh_remote_ipaddr(ssh), ssh_remote_port(ssh), authctxt->info != NULL ? ": " : "", authctxt->info != NULL ? authctxt->info : "");

Already much clearer, right? Well, now you already know the answer? Still not? .. Hmm ...

Okay, I will not drag out the intrigue: this is 4.3.2.1

In the first case, from host 4.3.2.1 try to perform "Injecting on username" ( authctxt->user ) with the user name - "test from 1.2.3.4 port 46589 ssh2" .
In the second case, from host 4.3.2.1 try to perform "Injecting into info" ( authctxt->info ) with a value equal to "ruser from 1.2.3.4 port 46589 ssh2" .

Is it true, the intuitive record format?

The key to this particular case is the presence of a colon, which is created by authctxt->info != NULL ? ": " : "", authctxt->info != NULL ? ": " : "",

What I thought (and) the developer (s) of this masterpiece, I really do not understand ...

Now let us estimate the complexity of the machine analysis of this, if I may say so, “structure” from the point of view of security monitoring (specifically, for example, in fail2ban). In assessing, HOST (or IP address) is important to us first of all, the difficulty of getting it in this particular example is related to the unpredictability of the location of the latter. Yes, it always stands after from , but due to the lack of foreign-data masking and writing it after this data to the log (the sixth parameter, ssh_remote_ipaddr(ssh) ), determining its current position is not very easy.

We are not looking for easy ways (in fact, we have no choice), so simply, as an example of complexity, we will try to assemble a regular expression suitable for this record.
I will use regular expressions syntax for python (as the language in which fail2ban is made) ...

Firstly, the "statics" and the strictly typed component:

the actual "structure" of the entry - Failed ... for ... from ... port ... ssh2
+ submethod method - \S+ (password, challenge-response, publickey, hostbased, gssapi-with-mic etc)
optionally invalid login - (?:invalid user )?
host address, for simplicity we use IPv4 - (?:(?:\d{1,3}\.){3}\d{1,3})
port - \d+

That's all, now the "dynamics":

username (for the sake of simplicity, we assume that there is an honest user, that is, there are no spaces) - \S*
optional information at the end of the record from the authentication method, etc. (for the sake of simplicity, let's take everything to the end) - (?:: .*)?$

Those. we get the following expression, anchored for reliability on both sides ( ^...$ ):

 ^Failed (?P<meth>\S+) for (?P<valid>invalid user )?(?P<user>\S*) from (?P<host>(?:\d{1,3}\.){3}\d{1,3})(?: port \d*)?(?: ssh\d*)?(?P<info>: .*)?$

A check on two examples showing that the simplest case works:

 ##      (bash): $ _test() { python -c 'import sys, re; regex, log = sys.argv[1:]; print(log); r = re.search(regex, log); print(r.groupdict() if r else "*NOT-FOUND*")' "$1" "$2"; }; alias t=_test; ##  : $ regex='^Failed (?P<meth>\S+) for (?P<valid>invalid user )?(?P<user>\S*) from (?P<host>(?:\d{1,3}\.){3}\d{1,3})(?: port \d*)?(?: ssh\d*)?(?P<info>: .*)?$' ##  № 1 $ t "$regex" 'Failed password for invalid user test from 4.3.2.1 port 58946 ssh2' {'info': None, 'host': '4.3.2.1', 'valid': 'invalid user ', 'meth': 'password', 'user': 'test'} ##  № 2 $ t "$regex" 'Failed publickey for root from 4.3.2.1 port 58946 ssh2: RSA SHA256:v3dpapGleDaUKf...' {'info': ': RSA SHA256:v3dpapGleDaUKf...', 'host': '4.3.2.1', 'valid': None, 'meth': 'publickey', 'user': 'root'}

Now we will try to complicate the conditions (the username contains spaces) using non-greedy catch-all, although I do not like them, but we remember - we did not have much choice. Those. yuzay .*? instead of \S+ in username.

Why it is not good - for example, since the anchor on the right is almost open, because .*$ equivalent to an open expression on the right without an anchor. About the speed and cpu-load on the long lines already keep silent. But for now, let's continue this way (at least a colon is required in this case):

 $ regex='^Failed (?P<meth>\S+) for (?P<valid>invalid user )?(?P<user>.*?) from (?P<host>(?:\d{1,3}\.){3}\d{1,3})(?: port \d*)?(?: ssh\d*)?(?P<info>: .*)?$' $ t "$regex" 'Failed password for invalid user hello from space from 4.3.2.1 port 58946 ssh2' {'info': None, 'host': '4.3.2.1', 'valid': 'invalid user ', 'meth': 'password', 'user': 'hello from space'}

Works! Well, now we try on the top examples with injections:

 $ t "$regex" 'Failed password for invalid user test from 1.2.3.4 port 46589 ssh2 from 4.3.2.1 port 58946 ssh2' {'info': None, 'host': '4.3.2.1', 'valid': 'invalid user ', 'meth': 'password', 'user': 'test from 1.2.3.4 port 46589 ssh2'} $ t "$regex" 'Failed password for user test from 4.3.2.1 port 58946 ssh2: ruser from 1.2.3.4 port 46589 ssh2' {'info': ': ruser from 1.2.3.4 port 46589 ssh2', 'host': '4.3.2.1', 'valid': None, 'meth': 'password', 'user': 'user test'}

What we see, it also seems to work correctly (both times we have the correct value of 'host': '4.3.2.1' ).
But ... Always, there is a "but", isn't it?

Both of these examples are simple, even without taking into account the undesirable use of catch-all, if you make an injection more complicated, then our expression “breaks” or, much worse, returns incorrect data (which theoretically is a vulnerability, because we can either fail2ban to block a “foreign” host, or to go through passwords indefinitely, because we are “invisible”).

I will not include a gear grinder here and immediately cite the “correct” (no, rather more appropriate) expression. I don’t really like it either (for many reasons), but what is - that is ...

 ^Failed (?P<meth>\S+) for (?P<cond_inv>invalid user )?(?P<user>(?P<cond_user>\S+)|(?(cond_inv)(?:(?! from ).)*?|[^:]+)) from (?P<host>(?:\d{1,3}\.){3}\d{1,3})(?: port \d+)?(?: ssh\d*)?(?(cond_user):|(?P<info>(?:(?! from ).)*)$)

Below I will explain a little what it does. But why is it and what kind of injections (test-cases) does it cover, I will keep silent for now ...

Let it be like homework, well, or if you want to prevent script-kiddies from being tempted, although on the other hand they also need to learn something ...

So - this is a complicated (subordinate) expression with conditional "transitions" that in python look like

 (?P<->)? ... (?(-) -1 | -2)

Briefly why it is difficult (subordinate):

the expression sparsit is completely anchored to the right, if we have the simplest username, exactly one " from " (or not " from " before ":" and-or not " from " after ":" ); despite the fact that the conditional anchor on the right plays an important role, because it must check all this completely
or we have no ":" (usually ends with ssh2), in this case the host is preferred after the last " from "
otherwise, it always prefers the host after the first " from " .

Yes, the expression "(?:(?! from ).)*" - "conditional" catch-all, which will collect everything, if (so far) there is no " from " .

In fact, there are logs, much more complicated than the above example, right up to completely structural ones, which are not regularly understood in principle (or because of their complexity, because the three-story conditional transitions there will take the brain away from the word at all). Sometimes it is possible to collect trailer data from several records (if they have a common identifier).

Neural networks, unfortunately also not a panacea at all, because as a rule, they must first be fed with the necessary information, where, in the process of learning, they ideally should not collect any "garbage".

Unfortunately, such logs are more common than we would like, and there are often a lot of other questions to the "manufacturers" of logs. On this basis, disputes often arise (for example, your humble servant with SW. Prof. yarikoptic ) - how (how strictly) it is better to design a regular schedule:

use a short expression (not an anchor on the right) that covers the known stable information up to the first dynamic component, which does not carry any payload, which theoretically is some risk if tomorrow the developer changes or rewrites the logging (and a similar log entry appears but is not in class desired).
For some reason, this will allow parsing logs successfully, if this dynamic component changes almost completely (and we remember, it is not important).
or cover (currently known) the structure completely (with anchors from both ends), which is associated with other risks - even with a slight deviation, we will lose some records (for they will no longer match the desired expression).

Instead of the conclusion, a little more, as I believe, you need to do logging (something else, be it an API, or the most complex servers):

it is desirable to have (and begin) with a unique record identifier (for example, some prefix like "Auth attempt:" is appropriate here)
static, permanently valid, strictly typed data we always write to the beginning of the record (in our example, HOST, identification method, user presence or "invalid user")
the dynamic or strongly modified component of the record is placed accordingly at the end (for example, the user name and / or authctxt-> info transmitted by the client)
if possible, non-typed data is desirable to write already disguised (and cropped), i.e. escape at least newline ( "\n" -> "\\n" ) as a separator for records and some special characters used as block delimiters in the record format structure (for example, comma and colon)
the structure of the log entry should be well thought out before its first appearance in releases (important for the next item)
If possible, do not change the log structure, i.e. the structure of the log entries already published in the release is frozen and stored forever (at least partially, at least its more or less static component)
it is desirable to avoid a structure that allows the ~~brain~~ to tolerate an ambiguous interpretation of the record, vulnerable to "injection" by foreign-data, etc.

Well, for this particular entry, it would look something like this (everything is "strictly typed" at the beginning; the user name and other dynamic information at the end, for example, in quotes; well, we mask (quotes, spaces), for example, url_encode from above):

 Auth attempt: Failed password from 4.3.2.1 port 58946 ssh2, invalid user: "test+from+1.2.3.4+port+46589+ssh2" Auth attempt: Failed password from 4.3.2.1 port 58946 ssh2, user: "test", info: "ruser+from+1.2.3.4+port+46589+ssh2" Auth attempt: Failed publickey from 4.3.2.1 port 58946 ssh2, user: "root", info: "RSA+SHA256:v3dpapGleDaUKf..."

You can actually think up many more such points, but if it is at least to follow these rules or some of them, the world of many people (and not only people) will again start playing with new colors.

And thank you so much from your grateful users, your colleagues who understand your logs, and especially from some of the pieces (burdened with artificial intelligence, all kinds of neural networks and other bots) with rays of gratitude will be spilled on your karma.

Source: https://habr.com/ru/post/308116/

All Articles

Log Log Monitoring: Such a Vulnerable Log or How to Put a Pig On to Colleagues

More articles: