Case We build a cloud for a large company
I have long wanted to tell you about how we built a cloud solution for one of our customers.

So, our Customer is a large international company with hundreds of offices around the world. The main infrastructure is concentrated in two high-end data centers in Europe and there are no complaints about them. But local components in regional offices are controlled by a multitude of regional service providers, and this creates a nightmare at the management level, both with the solution of IT problems directly and in controlling the expenditure of the IT budget. The customer decided that transferring most of the non-critical regional services to Microsoft Azure would allow him to save on servicing his IT infrastructure, concentrate control over the expenditure of finances in the central office and, at the same time, implement several modernization projects. We have already implemented for this Customer a hybrid Exchange solution based on Office 365 with local components in several countries where legislation required it, so he turned to us and Microsoft to design and implement a cloud platform to host approximately 3000 servers for 3 x years.
All this happened at the end of 2015 - the beginning of 2016 and, at the moment, the platform has been created, and we have already migrated about 500 servers there. The theme of clouds is one of the most popular lately and there are quite a lot of documentation and materials describing what a particular service can do and how you can use it. Therefore, we will talk about the other side of the clouds - what problems can be encountered on the way of transferring your on-premises infrastructure.
Update rate
During the reading of this article, you may get the false feeling that I am scolding Azure. In fact, it is not. It’s just that some of the problems are related to the fact that this cloud service is very actively and rapidly developing. This is not even a feature of Azure, but a common feature of clouds. It will not work once to learn how to do something and use it for years. Learn and develop will have to constantly. And the solutions that you sell to customers should also be developed. It's hard to blame Microsoft. But it creates difficulties considerably.
')
In addition to the new services, which are described below, PowerShell is a very vivid example. Working with the cloud involves a high degree of automation of operations. The larger your environment, the more relevant it is to you. In addition, some operations cannot be done through the portal GUI at all. Updates for Azure PowerShell come out almost every quarter and very often they significantly expand or change the functionality of cmdlets (new keys are added, existing ones change, types of returned objects change, etc.). This means that you need to constantly monitor all the news, check the functionality of your scripts after the updates, see if there is an opportunity to make something easier or better.
We had a funny story connected with the update of PowerShell. Our engineer wrote a rather large piece of code to add the missing functionality to the cmdlet for working with virtual disks. And at the beginning of the next week there was an update, in which this very cmdlet received a new parameter that did exactly the same thing. It was nice (after all, our vision of what is missing coincided with the vendor) and a little sad for the time spent.
Azure Versions
The reason for many difficulties at the end of 2015 was that Microsoft Azure cloud actively migrated from the
Classic (Azure Service Management) model to the ARM (Azure Resource Management) . In a sense, this can be called a transition from version 1 to version 2. Since all Azure innovations are primarily focused on ARM, the Customer wanted ARM components to be used everywhere except for situations of exceptional need. This is not to mention the fact that only ARM makes it possible to properly configure access rights to various components in Azure in accordance with the standards for the provision of IT services. The problem was that at that time in ARM there was no part of the functionality that was already available in Classic. In addition, the joint work of ARM- and Classic- components was possible far from always and not completely.
This may seem insignificant, after all, different versions of server operating systems also have different functionalities and this is normal. Here the difference was that the speed of cloud services development is much higher and the architects who worked on this project from our side, are accustomed to talk about solutions based on Azure based on the Classic-functional, believing that the new versions of the components are able, at least, all the most. And, as it turned out, the same difficulties are experienced by Microsoft architects.
Network
Expanding the customer's IT infrastructure to the cloud begins with the network. Your first task is to create networks in the cloud and link them to your existing infrastructure.
This is where the first surprise awaited us. It turned out that the virtual network topology proposed by the architect from Microsoft at the initial stage of the project was based on the idea that there could be two
Virtual Network Gateways in one Azure Virtual Network - one for the ExpressRoute connection, the other for Vnet-to-Vnet VPN . The idea was to provide additional isolation of the customer’s internal networks from traffic from the DMZ.
As it turned out, ARM did not allow this. We had to go on the move to connect all the VNET to one ExpressRoute in order to get routing between them and User Defined Routing to ensure security.
Another unpleasant feature of working with a virtual network was the restriction on the number of rules in one Network Security Group (NSG). Here you need to note a few technical aspects of networking in Azure, each of which, individually, was only a small inconvenience, but together they became a problem:
- You cannot create more than 500 rules in one NSG.
- Most of the functionality for virtual machines in Azure requires the availability of the IP addresses of Microsoft public services on ports 80 and 443. Microsoft regularly updates and publishes this list. At the moment for some regions there are already several hundred addresses.
- NSG rules can be created for a sequential range of addresses or ports, but not for arbitrary dialing. That is, you can open traffic to ports 80-443 with one rule, but if you want exactly 80 and 443, without being between them, then you will need two rules (the same story for IP addresses).
As a result, we not only needed to prepare scripts to automatically update our NSG rules (after all, couldn’t we rewrite them with our hands every week?), But worse, we didn’t have so many rules to use for its intended purpose - control behind the traffic between our networks.
Fortunately, this problem will soon be a thing of the past - Microsoft has announced changes to the NSG, which will allow more flexible work with the rules.
Restrictions
Since we have touched upon the issue of quotas in Azure (500 rules for one NSG), then it is worth noting that they themselves are a headache if you have a large project. The set of services in Azure is constantly expanding and, it is logical that they have their limitations. The problem is that there is no snap-in that allows you to see all the restrictions in one place. That is, you have to rely on a whole hodgepodge of individual teams that collect information for you and several web pages with a list of current limits. This, of course, is not very convenient. Especially when some unexpected limit comes up that you haven’t thought about beforehand.
Data storage
One of the examples of a rather tricky quota, about which far from all is thought, is the
performance of such a resource as a Storage Account . The fact is that the VHD disk on Standard Storage for most sizes of virtual machines has a maximum performance of 500 IOPs, while the Storage Account itself is 20,000 IOPs. At the same time, the maximum disk size is 1023GB, and the maximum capacity of the Storage Account is 500TB. Do you already see what's the catch? When you place a disk on a single Storage Account 41, you can theoretically find yourself in a situation where, at the maximum load of all disks, their performance will begin to be artificially limited. In this case, you have not yet taken and 10% of the maximum capacity and each subsequent disk will make the situation only worse.
The most annoying thing is that the system will not warn you about this in any way. You will learn about it only if you think about such things in advance and either do not place more than 40 disks on one Storage Account or monitor Throttling from its side and when activated activate you move actively used disks to another place.
Given that your server deployment is most likely automated, you need to think about how your automation tools will choose the location of virtual disks, especially if it is theoretically possible to launch simultaneous deployment of multiple servers.
Marketplace
It's funny, but one of the difficulties when working with Azure is its main advantage - the extensive Marketplace. A lot of services and the constant expansion of the list of available services is great. The problem is that with such a diversity, developers are physically unable to test the interaction of their product with others, and if you start using it immediately after release, you may be the first to do it for the combination that you use.
Here are some interesting examples:
- Immediately after the release of Azure Site Recovery (a service to protect your servers by replicating them to the cloud), he demanded to open all traffic on 443 and 80 ports to the Internet to conduct a server failover, because the address list for this service was not yet added to Azure Whitelist ( it is clear that they corrected it very quickly, but we broke our head over this, in due time,).
- Many of the virtual machine features in Azure are tied to VM Extensions. For example, encryption and backup. There are many operations that clean up a set of Extensions for a virtual machine. Moreover, these are quite frequently used operations, such as deploying a server from VHD (the main method for solving many problems with servers and a mandatory step when transferring them between Resource Group) or even restoring a server from Azure VM Backup. Despite this, there is no convenient tool to save a list of these Extensions and you have to do it yourself.
Conclusion
What thought wanted to convey in this article? Personally, I’m far from thinking that the clouds in the foreseeable future will completely replace the on-premises infrastructure, but hoping that you can manage to hide from them is rather silly. But this is not necessary! Working with Azure is very interesting. If you like to constantly learn new things and follow the release of a new functional, thinking that you can use it to improve your solutions, then you will not be disappointed.
PS Those of you who work with Azure may notice that most of the problems described in this article are no longer relevant. Microsoft is very actively following the feedback of the community and refining its services (although the story with the NSG has not yet been corrected!).