How not the most successful default behavior can mask the wrong work for years.

It is very convenient when, thanks to the correctly selected defaults, everything works by itself and out of the box and there is no need to configure anything. This story is that the selected defaults should always be workable, otherwise there is a risk of unforeseen failure after many years of trouble-free work.

We encountered undocumented behavior of Windows Server in Microsoft Azure web roles, which for many years masked the incorrect configuration of our Cloud OCR SDK service , until at one not the most beautiful moment did not lead to serious problems for individual users.

In February 2016 - by this time the service had been working for several years - individual users began reporting problems when trying to establish a secure connection. The programs of these users reported some problems with the certificate of service. Since “it worked before, but now it’s broken,” first of all, users decided that the certificate had expired, and quite predictably, they were Very Unhappy with ^TM . In fact, it expired only a few months later.

Attempts to reproduce the problem were most often unsuccessful - even on exactly those platforms that users pointed out. With the help of third-party verification tools ( one and two ), it was sometimes possible to get a message about improper installation of intermediate certificates.
')
From this point on, technical details are needed. Certificates are issued by special organizations - certification centers that are trusted by all. Each issued certificate is digitally signed, which client programs verify.

Each certification center has one or more self-signed root certificates that client programs unconditionally trust. The certificate issued by the certification center may in principle be signed by the root certificate of this center, but for this, the private key of the certificate is needed every time a new certificate is issued, this increases the risk of leakage of the root key's private key. Leakage of the private key of the root certificate, which millions of computing systems undoubtedly trust, is a problem.

Therefore, certification centers use intermediate certificates. The intermediate certificate is signed once by the root certificate, and all issued certificates are already signed by the intermediate one. There are options with several intermediate certificates in the chain. The client program for verifying digital signatures needs all the certificates in the chain - the program stores the root certificate and unconditionally trusts it, with its help it can verify the signature only on the intermediate certificate, and to verify the signature on the service certificate it will certainly need this intermediate certificate.

To verify signatures, client programs have enough certificates without private keys — private keys are kept secret by the certification center (root and intermediate certificates) and the owner of the service (the private key of the service certificate).

The client program has several ways to obtain intermediate certificates.

The first option is that intermediate certificates can be installed on the machine where the program is running. This is inconvenient, it is not worth expecting users to install intermediate certificates.

The second option - the program can try to download them over the network from the servers of the certification center. This is convenient, but not supported by all SSL client implementations. It is also unreliable - it is necessary that at the time of establishing the connection there was also access to the servers of the certification center. This method slows down the establishment of the first connection.

The third method is the most reliable and versatile. The server itself sends all intermediate certificates to the client at the same time as sending its certificate at the beginning of establishing a secure connection. Approximately: “here is my certificate, it is signed by these certificates, of which the latter is signed by the root one, which you, the most valuable client program, should be marked as trustworthy, if you please check all signatures, make sure and let's establish the connection”.

Third-party verification tools sometimes showed that our service gives an intermediate certificate when establishing a connection, and sometimes it does not. After checking, it turned out that, following the available examples, we incorrectly configured the service deployment process.

Here is the typical definition of a service with a web role:

<?xml version="1.0" encoding="utf-8"?> <ServiceDefinition name="CoolCloudService" xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition" schemaVersion="2014-06.2.4" > <WebRole name="CoolRole"> <Sites> <Site name="Web" > <Bindings> <Binding name="HttpIn" endpointName="HttpIn" /> <Binding name="HttpsIn" endpointName="HttpsIn" /> </Bindings> </Site> </Sites> <Endpoints> <InputEndpoint name="HttpIn" protocol="http" port="80" /> <InputEndpoint name="HttpsIn" protocol="https" port="443" certificate="ProductionCert"/> </Endpoints> <Certificates> <!--!!!   !!!--> <Certificate name="ProductionCert" storeLocation="LocalMachine" storeName="My"/> </Certificates> </WebRole> </ServiceDefinition>

Usually, inside the Certificates element, only the service certificate is indicated. It is not right. It is correct to indicate there all intermediate certificates in the chain. It does not matter that they are not mentioned in the Endpoints section.

That's right:

  <Certificates> <!-- ! --> <Certificate name="IntermediateForProductionCert" storeLocation="LocalMachine" storeName="CA"/> <!--      ,   --> <Certificate name="ProductionCert" storeLocation="LocalMachine" storeName="My"/> </Certificates>

The enumeration of all intermediate certificates results in their installation in the repositories on each instance of the role during the instance initialization. After that, IIS will be able to send intermediate certificates to client programs when establishing a secure connection.

WIN? No, not so fast.

After correcting the settings and posting the changes, we contacted the user who was successfully in the close time zone, and he replied that no, it did not help, ALL BAD, nothing works.

The next challenger was the image of the operating system, on top of which the service worked. Shortly before problems with the verification of certificates, the service was transferred to the next, more new, image with the next set of updates. With rare exceptions, in which Microsoft reports potential problems in advance, only “neutral” updates are selected that break nothing. This time there were two updates in the list, in the description of which the certificate hashes and the renewal of the secure connection were mentioned.

Service temporarily transferred to the previous image ... and the problem was solved.

WIN? No, it's not even close WIN.

Firstly, it was not clear which of the two changes solved the problem. Staying forever on an earlier image is impossible - in a couple of months, the Azure infrastructure will force the service to a newer image, and then the problems may repeat. The option to transfer the service to a newer image of the operating system and check for users somehow did not fit us.

Secondly, it was not clear why the problem repeated itself at this time, and not earlier - the settings did not change, but “it worked before, and then it broke.”
On the test service with incorrect “to” settings — with only one certificate indicated — all third-party diagnostic tools showed that the certificates were installed correctly, returned by the service in response to a request from the client program. The code for viewing installed certificates added to the test service showed that there was no intermediate certificate in the repositories.

There is no certificate in the repositories, but the service gives it to the users. Teleportation, apparently. The usual thing.

We also had user posts. Among them was a description like this: “Here is a piece of code in PHP, we execute it in a loop, it once works successfully, and once it fails, as this problem worsened, it increasingly worked with failure.

We also have code that periodically checks the validity of our certificate and, just in case, checks its chain of trust using the X509Chain.Build () method. Previously, he worked normally, and in the period of time when users encountered a problem, this method sometimes did not work, producing the following set of messages:

A certificate chain could not be built to a trusted root authority.
RevocationStatusUnknown The revocation function was unable to check the revocation for the certificate.
OfflineRevocation The revocation function was unable to check revocation because the revocation server was offline.

Suspiciously similar to the inability to access the servers of the certification center.

But what if a copy of the role is able to obtain the missing intermediate certificates from the certification center and carefully hide them so that IIS can give them to the client programs? It would be EXTREMELY EXTREMELY, to make such an assumption without weighty evidence is very frivolous ~~, with the same success it is possible to express the assumption that Windows eats kittens~~ .

It was necessary to check. Attempts to configure a firewall or file name resolution by editing the hosts file on an instance of the role role did not result. This was expected - until the moment when the opportunity to perform these actions first appears, a role instance works for several minutes, after which it may be too late - the certificate will be downloaded and carefully hidden.

Therefore, we needed a way to completely eliminate the role instance to the infrastructure of the certification center. Conveniently, Azure makes it pretty easy.

We had enough of a “virtual network” (virtual network), in which we created a subnet (subnet) with a “network security group” (network security group), then added the NetworkConfiguration element in the service settings so that the service could be published to this virtual network. It was easy.

In the “network security group”, network access restrictions are configured. They added a rule prohibiting outgoing requests to the range of addresses where the certification center infrastructure is located.

If after that the test service is published, the problem declared by users begins to be reproduced regardless of the freshness of the operating system image. IIS ceases to give intermediate certificate. If the “security group” changes the rule so that requests are not limited, and publish the service again, the problem stops playing.

After a thorough check of this black box, we found that an attempt to obtain an intermediate certificate was made at the initial initialization of the role instance, at its reboot, at its re-initialization from the operating system image and at the re-publication of the service package. Restarting the application pool in IIS does not lead to an attempt to obtain the missing certificate. Thus, an attempt to obtain an intermediate certificate is associated with the moment of site deployment in IIS - presumably, it occurs at the time of installing the site certificate. This article (How Certificate Revocation Works) mentions a certain CryptoAPI disk cache of certificates. CryptoAPI is part of Windows Server.

In an ordinary situation, an attempt to obtain a missing certificate is successful, the incorrect configuration of the service is masked, there are no warnings about this in the role instance logs. If at the wrong moment the attempt to obtain the missing certificate fails, a specific instance of the role is started up in an unsuccessful way, there are no messages in the logs, but now some client programs cannot establish a secure connection with it.

You need to add scaling and a load balancer to distribute incoming requests between role instances. Different instances of the role can be launched at different times and, therefore, with different availability of the infrastructure of the certification center, as a result, some instances are in better shape than others. The load balancer, at its discretion, sends different requests to different instances, so different requests can lead to a different set of certificates in the response. Low-active users see that "it works, it does not work," and very active ones mean that some share of requests is completed successfully, and the rest - with a failure.

The puzzle has developed. Now WIN.

This is definitely not the best default behavior.

Now we have code that, when checking our certificate, at the same time checks that all certificates in the chain are in the appropriate repositories of the role instance. Thanks to the power of fresh pool queries, the examples in the Microsoft Azure documentation will be fixed soon.

And if you have a service with web roles, it may very well be that it is still configured incorrectly.

Dmitry Mescheryakov,
product department for developers

Source: https://habr.com/ru/post/280840/

All Articles

How not the most successful default behavior can mask the wrong work for years.

More articles: