Scaling TLS

Habr, this is a report from one of the “non-main” halls of Highload ++ 2016. Artem Ximaera Gavrichenkov, technical director of Qrator Labs, talks about application encryption, including in high-load projects. Video and presentation at the end of the post - thanks to Oleg Bunin .

Greetings We continue to be in the session about HTTPS, TLS, SSL and all that.
What I’m going to talk about now is not some tutorial. As my university lecturer on databases, Sergey Dmitrievich Kuznetsov, said: “I will not teach you how to set up a Microsoft SQL server — let Microsoft do it; I will not teach you how to configure Oracle - let Oracle do it; I will not teach you how to configure MySQL - do it yourself. "
')
In the same way, I will not teach you how to configure NGINX - this is all on the site of Igor Sysoev . What we will discuss is a kind of general view of the problems and opportunities for solving problems that arise when implementing encryption on public services.

A small excursion into a completely new history: as you know, recently the topic of encryption of public services has risen to some new level - it is well known and the slide above shows how approximately this happened.

It all started in 2010, in my opinion, with the invention of the SPDY protocol by Google, the de jure "Speedy" could work without encryption, but de facto its implementation depended on the expansion of NPN ( Next Protocol Negotiation ), which was in TLS'e - and in the standard TCP transport layer protocol it was not. Therefore, SPDY did not work de facto without encryption.

For a long time in the industry there were disputes and lively discussions on the topics: “Does Speedy Help?”; “Does he accelerate work?”; “Does it reduce the load or not?” During this time, many managed to implement it and many people ended up with encryption in one way or another.

4 years have passed and Google search staff announced in their blog that from now on (2014), the presence of the HTTPS site will improve the ranking of the site when it comes out in search. This has become a serious acceleration - not everyone thinks about user security, but almost everyone thinks about the place in search results.

Therefore, from this point on, HTTPS began to be used by those who had never thought about it before, and had not seen such a problem for themselves.

A year later, the HTTP / 2 standard was finalized and published, in which, again, the requirement that it should work only on top of TLS was not recorded. However, de facto, every modern browser in HTTP / 2 without TLS is not able. And finally, in 2016, a number of companies, with the support of the Electronic Frontier Foundation, founded the non-commercial project Let's Encrypt, which issues certificates to everyone automatically, quickly and free of charge.

Something like this is a graph of the number of active Let's Encrypt certificates issued, the scale in millions. The implementation is very active, the other statistics below are from Firefox telemetry.

During the year from November 2015 to November 2016, the number of web pages visited by Firefox users on HTTPS, in comparison with HTTP, increased by a quarter to a third. Thank you very much for this Let's Let's Encrypt and, by the way, they have started (and have already ended ) the crowdfunding campaign, so do not pass by.

This is only one side of the story - speaking about it, it is impossible not to mention the second thread.

It concerns the revelations of a well-known employee of the corporation Booz Allen Hamilton. Who knows who I am talking about? Edward Snowden, yes.

At that time, this served as some marketing towards HTTPS: "We are being watched, let's protect our users." Google, after another revelation was published about the fact that the NSA "sort of" intercepted internal communications in Google's data center. The company's engineers wrote obscene words in their blog about the US government and encrypted all internal communications as well.

Accordingly, about a month later a post about HTTPS accounting was published during ranking - some conspiracy therapists may see some connection in this event .

Since that time, that is, 2013-2014, we have two factors for promoting TLS and it has really become popular. It became so popular that, as it turned out, the encryption infrastructure and libraries were not ready for such popularity.

In 2014, as many as two critical vulnerabilities were discovered in the most popular, at that time, SSL library. Someone even tried to move on this issue to another popular alternative - TLS, which turned out to be even worse. The IETF in February 2015 even issued a separate RFC, which described the well-known attacks on TLS and datagram TLS (over UDP). This is documentary evidence of the high optimism of the IETF working group, because in February 2015 they announced that they "will now sum up all attacks," and since then there have been three more.

And the very first of them was found 20 days after the RFC was published. In fact, this story is about that - in the discipline of project management can be found mention of such a term as "technological debt".

Relatively speaking, if you solve some problem and you understand that its solution will take 6 months, but you need 3, and if you put a crutch here , then in principle you can do it in three months - in fact, at this moment you take the universe has time in debt. Because the crutch, which will be substituted, sooner or later will collapse - you have to fix it. If there are a lot of such crutches, then a “technological default” will occur and you simply cannot do anything new until you disassemble everything old.

Encryption as it was at the end of the 2000s was made by enthusiasts. But when the people really began to use it, it turned out that there were a lot of jambs and they needed to be somehow corrected. And the most important joint is the awareness of the target audience, that is, you, about how it all works and how to tune it.

Let's go over and look at the main points with a sight, just for a big setup.

Let's start with platitudes. To raise a service — you need to configure it; in order to configure it — you need a certificate so that you have to buy a certificate. Who?

The SSL and TLS protocols naturally include the public key infrastructure. There is a set of certificate authorities, respectively, if you have a service - it must have a certificate that can be signed either by a certificate authority or by someone whom it has signed. That is, an unbroken chain must be built from you to someone whom the browser and user device trust. To whom? For example:

I downloaded vanilla Firefox and there are about 100 pieces of certificates - trusted certificate authorities. That is, you have a large choice, in the sense of buying a certificate.

What are these hundreds of companies?

According to the ideology, these companies can be trusted because signing certificates, tracking security, is their main business. Skipping this is difficult, otherwise the company will simply go bankrupt - a simple economic model. It should have been relatively small companies that are independent of large corporations and independent of governments. Well, not counting the fact that some of them are the government, in particular, the Japanese, Taiwan and Valencia.

And others were bought by large corporations for awesome sums.

Thus, to date, we can not say with certainty how certain certification centers behave. They should have and must still protect their interests as a player in the market, as a certificate authority. But, again, if you are a government employee or belong to a large organization, then if a large client comes to her who wants to do something for a lot of money, then you are ordered to do it. The question is what can you do about it.

How did it happen that the original model works so strangely now? Why are these companies owned, for example, the Ministry of Communications and Communications of China (which your browser trusts) still remain certification centers?

The point here is that it is quite difficult to stop being a certificate authority.

The Visual History is a certificate authority called WoSign . If I'm not mistaken, since 2009 he has been a trusted certificate authority in all popular browsers and operating systems. He became the center of a fairly correct trajectory after an audit of Ernst & Young (now E & Y), organized an aggressive marketing campaign and distributed certificates to everyone right and left, and everything was fine.

In 2016, Mozilla Foundation employees wrote a whole page on their Wiki with a list of various violations that WoSign allowed at different times. Basically, these violations are dated 2015-2016.

What were they? As you know, you can not get a certificate for the "left" domain. You must somehow prove that you are the owner of this domain - as a rule, you must either receive something by physical mail, or place a certain token on the site at a strictly defined address. WoSign at least once issued a certificate for AliCDN, which was not requested by AliExpress itself (they use the services of Symantec) - however, the certificate was issued and published.

WoSign allowed to use non-privileged ports for verification - that is, any user (not only the administrator) of this server could receive a certificate for any domain on this server hosted. WoSign allowed the use of domains 3, 4 and the following levels for verification of the second level domain. Thus, the researchers, within the framework of the WoSign resistance test, managed to get a valid certificate for GitHub.

WoSign allowed to use any files, that is, to specify an arbitrary path on the site where the token will be placed - thus, the researchers managed to get signed certificates for Google and Facebook.

Well, in principle, this is all nonsense, because under certain conditions, WoSign made it possible to issue any certificate for any domain without checking at all (laughter in the audience).

After this, all the little things fade into the background, but all was Mozilla led about 13 complaints. WoSign issued certificates in hindsight, using software on the validation server that did not have patches, including updating the parts responsible for security, for many years.

WoSign bought StartCom and refused to publish any information, which violates the policies of the Mozilla Foundation, but in comparison with everything else, it is not so painful.

What is the result?

In September of this year, WoSign was banned for a year in Chrome and Firefox, but Microsoft still continues to trust Microsoft and the Windows operating system.

You can not just stop being a certificate authority.

And we say this, in fact, about trifles. If you talk about the same Verisign (Symantec), then you need to understand that you can’t just take and withdraw the Symantec root certificate, because they signed the same AliExpress, they signed Google, they signed a lot of people and users will have problems with access to resources. Therefore, one way or another, by clicking the fingers, as it was once supposed, it is impossible to revoke a certificate.

Large certificate authorities, there is such an expression, they are “ Too big to fail ” - too large to cease to exist.

There is an alternative to the TLS (Public Key Infrastructure) PKI, it is called DANE , built around DNSSEC, but it has so many infrastructure problems — including delays and delays — that it’s pointless to talk about it now, you need to somehow live.

How to live with it?

I promised to give advice on how to choose the Certificate Authority. Based on the above, my advice is as follows:

Take for free - there won't be much difference.

If there is an API that allows you to automate the issue of certificate and warm-up certificates - great.

The second useful advice today (that is, we are talking about a distributed structure with a lot of balancers) - get not one certificate, but two or three certificates. Buy one from Verisign, the second from the Polish Unizeto, and get the third for free from Let's Encrypt. This does not greatly increase OPEX (operating costs), but it will seriously help you in case of unforeseen situations, which we will talk about later.

The question immediately: “Here, Let's Encrypt do not write out extended validity certificates (extended validity period and a beautiful green icon in the browser), what to do about it?” In fact, this is a security theater , because there are few who don’t It looks, so the mobile browser in Windows Phone does not even know how to display this icon. The user does not look at the presence of the icon in principle, not to mention its color.

Therefore, if you are not a bank and your security audit does not force you to buy extended validity, there will be no real benefit from it.

It is better to buy certificates with a short period of validity, because if something happens to the Certificate Authority, you will have room to maneuver. Diversify.

Let's talk more about the lifetime of certificates. Long-lived certificates (a year or two or three) have a number of advantages. For example, the fifth point does not hurt, when you do not need to renew the certificate every 2-3 months - you put it once and forgot it for three years. In addition, certificate authorities usually offer significant discounts to those who buy certificates for a year, three and five years.

At this point, the pluses end and the minuses begin.

If something happened over the certificates, for example, you lost the private key and it went to someone else - there is a mechanism in TLS that allows you to revoke the certificate back. Called CRL and OCSP, there are two of them. The problem is that at the moment they both work in the soft fail mode, that is, if the browser failed to connect to the CRL server and check the status of the certificate, then “and hell with it” - the certificate is supposedly normal.

Adam Langley of Google compared the soft fail CRL and OCSP with a seat belt in a car that is constantly stretched, and when you get into an accident it breaks. There is no sense in this, because the same attacker who can do man in the middle and substitute a stolen key - he can also block access to the CRL and OCSP, so these mechanisms simply will not work.

Hard fail CRL and OCSP, that is, those that will display an error if the CRL server is unavailable, are currently not used by almost anyone.

And again about the pain in the fifth point - yes, it is inevitable. But to be honest, you still have to automate the deployment of the certificate. Deployment is just what needs to be done, because replacing a certificate with your hands is the same technical debt.

It will not bite you instantly, but sooner or later problems can begin.

What does automated certificate management allow? You can add, delete and change them with one finger movement. You can release them with a short period of validity, you can use a lot of keys, you can configure client authorization by certificates. You will be able to very easily and very quickly handle situations from the series: “Ops, the intermediate certificate was withdrawn from our certification center”. This is not a theoretical story - it happened in October 2016.

One of the largest CA - GlobalSign, with a long history and very reliable, in October 2016 carried out technical work on the reclassification of root and intermediate certificates ... and accidentally withdrew all its intermediate certificates. Engineers quickly found the problem, but you need to understand that the CRL and OCSP include all browsers of all users - this is a fairly loaded service. GlobalSign, like everyone else, uses the CDN for this purpose, which was Cloudflare in the case of GlobalSign and, for some reason, Cloudflare was unable to promptly reset the cache - incorrect lists of revoked certificates continued to crawl on the Internet for several hours.

And cached in browsers. Moreover, the CRL and OCSP cache in the browser and the operating system work for about four days. That is, for four days, an error message was displayed when trying to access Wikipedia, Dropbox, Spotify, and Financial Times — all of these companies suffered. Of course, there are options for a more rapid solution to this problem, and they, as a rule, consist in changing the certificate authority and changing the certificate - then everything will be fine. But this still needs to be reached.

Note that in a situation where everything depends on CDN, everything depends on attendance - that is, whether the wrong answer had time to cached on this CDN node, sites with high attendance suffer more, because the likelihood is much higher.

We said that “to solve the situation is simple - you need to replace the certificate,” but first you need to understand what is happening. What is the situation with revoked root certificates from the point of view of the resource manager?

Prime time, everything is fine, the load is below average - there are no problems, nothing gets out of monitoring ... and at the same time traffic has dropped by 30%. The problem is of a distributed nature and it is very difficult to catch it - you need to understand that here you depend on the vendor, not controlling either the CRL or the OCSP, therefore an unknown damned nonsense occurs and you don’t know what to do.

What helps to find such problems

Firstly, if you have a lot of balancers and certificates for them from different vendors (as we said), you will see a correlation. Where there was a certificate from GlobalSign - problems started there, on the other balancers everything is fine - that's the connection.

It should be understood that there were no distributed services for checking the CRL and OCSP as of November 2016.That is, there was no service simultaneously distributed enough to monitor Cloudflare and at the same time able to walk in CRL and OCSP for an arbitrary domain.

But we still have a tool called tcpdump, which also clearly shows what the problem is - sessions break at about the time of the “TLS server hello”, that is, it’s obvious that the problems are somewhere in TLS, which means that although would be pretty clear where to look.

An additional advantage of the fact that you have different keys on different machines is that if one of them is leaked, then at least you will find out where. That is, you can write root cause analysis.

Some deployment ideology says that the private key should never leave the machine on which it was generated. This is “technological nihilism”, but such a situation exists.

Returning to the story with GlobalSign, we see that TLS is still at the forefront of technology. However, we still have insufficient tools and still insufficient understanding of why such problems will arise.

And the search for the source of the problems can be spent hours, and when it is discovered, it should be solved very quickly. Therefore - automation of certificate management.

This requires either a certificate authority providing an API - for example, Let's Encrypt, which is a good choice if you do not need a wildcard certificate.. In 80% of cases, you do not need a wildcard, since it is primarily a way to save money - buy not 20 certificates, but one. But since Let's Encrypt gives them away for free, you may not need a wildcard.

Further, there is also a toolkit for automating certificate management with other certificate authorities, such as SSLmate and similar things. Well, in the end, if none of this suits you, you will have to write your own plugin for Ansible.

So, 25 minutes of the report passed, the 50th slide is before us and we were finally able to buy a certificate (laughter in the hall).

How to tune it? At what points should stop?

The first is that HTTP describes a header called Strict Transport Security, which indicates that: for those clients who once came to it via HTTPS, that later it is worth going to it via HTTPS, and never going to HTTP. This header should be on any server that supports HTTPS simply for the reason that voluntary encryption from the series “if it works out to encrypt, we work, but no, it doesn't” simply does not work. Tools like the SSLstrip are already too popular. Users will not look “there is a lock or not” - they will simply work with the page, so their browser must remember and know in advance that you need to go via HTTPS. HTTPS only makes sense when it is enforced.

Another useful header is Public Key Pinning, that is, when binding a public key, and we said earlier that it’s difficult for us to trust certification authorities, so when a user visits the site we can give him a list of public key hashes that can be used to identify this site. That is - other certificates, besides those listed, cannot be used by this site. If the browser sees such a certificate, it is either a theft or some kind of fraud.

Accordingly, Public Key Pinning also needs to include all keys that will be valid for a period - that is, if you set up for a period of 30 or 60 days, you must have hashes of all certificates that will be on this site for 30 or 60 days . All future certificates that you just release now, and will only be used in a month.

Again, it’s very difficult to do with your hands, it’s a big pain, so automation is needed.

Other issues are ciphers. The Mozilla wiki has a link where you can find a config generator for almost all popular web servers: NGINX, Apache, HAProxy, and lighttpd (if someone has one). Go there periodically, go there often - it may be worth automating this process somehow. If you have a regular cryptographer, then everything is fine, but it is not clear what you are doing here. If you do not have a full-time cryptographer, then the Mozilla Foundation has it. It follows that all the necessary changes in ciphers - all the work that cryptographers do in an attempt to understand “how ciphers can be vulnerable” is available on this page.

There is a golden rule of cryptography: “Do not invent your own cryptography”. it refers not only to development, but also to administration.

There are a lot of unobvious things, for example - this is the list of those ciphers that Mozilla recommends installing by default. Note that underlined - two identical cipher, fully, which must be supported by the server and issued to the client in the list of available. They differ only in that one of them is an AES cipher with a key length of 256, and the other is 128 in length.

If the cipher with a length of 128 is sufficiently reliable, why do we need 256? If there is 256, why do we need 128? Who knows why?

Let me tell you a story about Randall (Rijndael) - the name sounds like it's already some Tolkien.

The AES encryption standard was ordered by the US federal government in 1998: a competition and tender was organized that won the French cipher Rijndael (the word is so strange because it is made up of the names of its two inventors — and both are French). In 2001, the code was finally approved by the NSA and began to be used, including in the Department of Defense and the US Army. And here the developers of the cipher came across the requirements of the American army, which, to put it mildly, are outdated.

What were they?

The military demanded that the cipher provided had 3 levels of security. Three different security levels. Moreover, the weakest of them is used for the least important data, and the most stable one is used for the most important data (directly Top Secret). Cryptographers faced with this situation very carefully read the requirements for the tender and found out that there is no requirement that the weakest cipher was not strong - so they issued ciphers with three levels of keys, where even the weakest (128-bit) still cannot be cracked / decrypted in the foreseeable future.

Accordingly, AES-256 is an absolutely redundant thing that was needed because the military wanted it so much. AES-128 is quite sufficient today, but it already has very little value, because AES supports in most modern processors - out of the box and at the level of iron, so we do not bother about this topic. I am telling this as an illustration of the fact that in the list of ciphers that is published by the Mozilla Foundation there will be things that you may not be obvious.

Or here's another example.

How modern encryption may not be obvious to the average person. There is such a property of session encryption algorithms called Perfect Forward Secrecy - it does not have a normal translation into Russian, so everyone uses the English term. This property is Perfect Forward Secrecy has a lot of cipher, including ephemeral varieties of the protocol Diffie-Hellman'a , very common.

This property implies that as soon as the session key is generated, it cannot be recovered or hacked using the original private key, the one that corresponds to the public key and is recorded in the certificate.

The Perfect Forward Secrecy property makes it impossible to passive traffic analysis and analysis in what is called out-of-path analysis - analysis of the saved traffic history is impossible. There are very many DPIs and WAF solutions, because about 70% of HTTPS requests use certain types of algorithms that support PFS and pass through these DPI solutions like a knife through butter, without any analysis. Moreover, of these 70% of traffic, about 10% is legitimate traffic, and the rest is bots and hackers who know how to get through the DPI. Maybe at that moment someone already knew where I was going.

If it so happens that the state requires you to reveal the private keys of your service and transfer them to the opportunity to decrypt the traffic that was once collected and stored three months before, thanks to PFS and Diffie-Hellman, we can only wish good luck of the way.

Protocol summary

1. There is no doubt that SSLv2 (version two) is already dead. If someone has it, stop using it right now, this is a very carelessly written protocol.
2. SSLv3 and TLSv1.0 are exactly the same dead and should not be used with the exception that TLS is not supported in IE6 at all, thank God very few people use it at the moment, but in principle such people still remain.

More importantly, a whole bunch of TVs do not support TLS in principle, it only supports older versions of the SSL protocol. Some chatter is connected with this around certification of systems for work with payment cards - PCI DSS. Just over a year ago, the PCI DSS Board decided that, starting in June 2016, SSLv3 and early versions of TLS (that is, version 1.0) cannot be used on services that process public card data.

In principle, reasonable requirements, with the exception that a bunch of vendors then grabbed rusty agricultural killing tools and went to the PCI DSS board with the words: “In this case, we cannot work with either TVs or old smartphones — these are huge amounts of traffic.” and we can not do anything ourselves. "

In February 2016, the PCI DSS Board postponed the fulfillment of this requirement either until 2018, or 2019. We stop here and sigh, because SSL of any version is vulnerable, and TLS of any version is not supported by a bunch of TVs, in particular Korean ones. It's a shame.

Nevertheless, the TLS protocol continues to live and develop (with the exception of version 1.0).

The current version is 1.2 and is currently in development 1.3. And it is possible that TLS is growing even too quickly, because in TLS 1.2 version there was an opportunity to implement a rather interesting mechanism: to force the server to sign a strictly defined token shipped to it by the client and, accordingly, to receive this signature, under certain conditions. That is, we can say that: “We went over an encrypted connection to this server, so many times, during such a certain period of time”. What is the necessary prototype for the organization of the blockchain - please, DDoSCoin cryptocurrency as a concept, and on GitHub there is a rather detailed document and you can look at the code.

Well, literally tops

OCSP stapling is a very useful thing. We have already said that OCSP certificate revocation is soft fail. On the other hand, it is possible to implement OCSP stapling - when the server itself goes to the validation server and receives the necessary signature from there, which it then gives to the client.

For most high-load sites, certificate authorities may even require such a thing, since it is unprofitable for them to take on such a burden. Keep it yourself.

The problem is that in some cases OCSP stapling results in a real hard fail and, for example, if the OCSP server for a certification authority is not available for a certain time and you do not have time to get a valid signature from it for a particular moment, then you have a downtime and you are offended . Therefore, before you enable stapling, you need to get the data and understand who your certification authority is at the moment and what is its structure.

Further, when implementing TLS, you need to understand that TLS'ny handshake is expensive and time consuming . That is, the packet-rate increases, the load, the pages slowly open - so even if you didn’t use keep-alive connection with a certain lifespan for some reason, Adam Langley himself ordered in TLS.

Another important thing that I will not dwell on in detail is that if you have some kind of unencrypted content, you need to do something about it. If you are a large company with a large infrastructure, then you can come to your banner network and say: "Give me this in encrypted form." Because if you are a small service, then in return you will not hear anything good.