📜 ⬆️ ⬇️

How we built cloud infrastructure in Azure

Case We build a cloud for a large company


I have long wanted to tell you about how we built a cloud solution for one of our customers.


So, our Customer is a large international company with hundreds of offices around the world. The main infrastructure is concentrated in two high-end data centers in Europe and there are no complaints about them. But local components in regional offices are controlled by a multitude of regional service providers, and this creates a nightmare at the management level, both with the solution of IT problems directly and in controlling the expenditure of the IT budget. The customer decided that transferring most of the non-critical regional services to Microsoft Azure would allow him to save on servicing his IT infrastructure, concentrate control over the expenditure of finances in the central office and, at the same time, implement several modernization projects. We have already implemented for this Customer a hybrid Exchange solution based on Office 365 with local components in several countries where legislation required it, so he turned to us and Microsoft to design and implement a cloud platform to host approximately 3000 servers for 3 x years.

All this happened at the end of 2015 - the beginning of 2016 and, at the moment, the platform has been created, and we have already migrated about 500 servers there. The theme of clouds is one of the most popular lately and there are quite a lot of documentation and materials describing what a particular service can do and how you can use it. Therefore, we will talk about the other side of the clouds - what problems can be encountered on the way of transferring your on-premises infrastructure.

Update rate


During the reading of this article, you may get the false feeling that I am scolding Azure. In fact, it is not. It’s just that some of the problems are related to the fact that this cloud service is very actively and rapidly developing. This is not even a feature of Azure, but a common feature of clouds. It will not work once to learn how to do something and use it for years. Learn and develop will have to constantly. And the solutions that you sell to customers should also be developed. It's hard to blame Microsoft. But it creates difficulties considerably.
')
In addition to the new services, which are described below, PowerShell is a very vivid example. Working with the cloud involves a high degree of automation of operations. The larger your environment, the more relevant it is to you. In addition, some operations cannot be done through the portal GUI at all. Updates for Azure PowerShell come out almost every quarter and very often they significantly expand or change the functionality of cmdlets (new keys are added, existing ones change, types of returned objects change, etc.). This means that you need to constantly monitor all the news, check the functionality of your scripts after the updates, see if there is an opportunity to make something easier or better.

We had a funny story connected with the update of PowerShell. Our engineer wrote a rather large piece of code to add the missing functionality to the cmdlet for working with virtual disks. And at the beginning of the next week there was an update, in which this very cmdlet received a new parameter that did exactly the same thing. It was nice (after all, our vision of what is missing coincided with the vendor) and a little sad for the time spent.

Azure Versions


The reason for many difficulties at the end of 2015 was that Microsoft Azure cloud actively migrated from the Classic (Azure Service Management) model to the ARM (Azure Resource Management) . In a sense, this can be called a transition from version 1 to version 2. Since all Azure innovations are primarily focused on ARM, the Customer wanted ARM components to be used everywhere except for situations of exceptional need. This is not to mention the fact that only ARM makes it possible to properly configure access rights to various components in Azure in accordance with the standards for the provision of IT services. The problem was that at that time in ARM there was no part of the functionality that was already available in Classic. In addition, the joint work of ARM- and Classic- components was possible far from always and not completely.

This may seem insignificant, after all, different versions of server operating systems also have different functionalities and this is normal. Here the difference was that the speed of cloud services development is much higher and the architects who worked on this project from our side, are accustomed to talk about solutions based on Azure based on the Classic-functional, believing that the new versions of the components are able, at least, all the most. And, as it turned out, the same difficulties are experienced by Microsoft architects.

Network


Expanding the customer's IT infrastructure to the cloud begins with the network. Your first task is to create networks in the cloud and link them to your existing infrastructure.


This is where the first surprise awaited us. It turned out that the virtual network topology proposed by the architect from Microsoft at the initial stage of the project was based on the idea that there could be two Virtual Network Gateways in one Azure Virtual Network - one for the ExpressRoute connection, the other for Vnet-to-Vnet VPN . The idea was to provide additional isolation of the customer’s internal networks from traffic from the DMZ.

As it turned out, ARM did not allow this. We had to go on the move to connect all the VNET to one ExpressRoute in order to get routing between them and User Defined Routing to ensure security.

Another unpleasant feature of working with a virtual network was the restriction on the number of rules in one Network Security Group (NSG). Here you need to note a few technical aspects of networking in Azure, each of which, individually, was only a small inconvenience, but together they became a problem:


As a result, we not only needed to prepare scripts to automatically update our NSG rules (after all, couldn’t we rewrite them with our hands every week?), But worse, we didn’t have so many rules to use for its intended purpose - control behind the traffic between our networks.

Fortunately, this problem will soon be a thing of the past - Microsoft has announced changes to the NSG, which will allow more flexible work with the rules.

Restrictions


Since we have touched upon the issue of quotas in Azure (500 rules for one NSG), then it is worth noting that they themselves are a headache if you have a large project. The set of services in Azure is constantly expanding and, it is logical that they have their limitations. The problem is that there is no snap-in that allows you to see all the restrictions in one place. That is, you have to rely on a whole hodgepodge of individual teams that collect information for you and several web pages with a list of current limits. This, of course, is not very convenient. Especially when some unexpected limit comes up that you haven’t thought about beforehand.

Data storage


One of the examples of a rather tricky quota, about which far from all is thought, is the performance of such a resource as a Storage Account . The fact is that the VHD disk on Standard Storage for most sizes of virtual machines has a maximum performance of 500 IOPs, while the Storage Account itself is 20,000 IOPs. At the same time, the maximum disk size is 1023GB, and the maximum capacity of the Storage Account is 500TB. Do you already see what's the catch? When you place a disk on a single Storage Account 41, you can theoretically find yourself in a situation where, at the maximum load of all disks, their performance will begin to be artificially limited. In this case, you have not yet taken and 10% of the maximum capacity and each subsequent disk will make the situation only worse.

The most annoying thing is that the system will not warn you about this in any way. You will learn about it only if you think about such things in advance and either do not place more than 40 disks on one Storage Account or monitor Throttling from its side and when activated activate you move actively used disks to another place.

Given that your server deployment is most likely automated, you need to think about how your automation tools will choose the location of virtual disks, especially if it is theoretically possible to launch simultaneous deployment of multiple servers.

Marketplace


It's funny, but one of the difficulties when working with Azure is its main advantage - the extensive Marketplace. A lot of services and the constant expansion of the list of available services is great. The problem is that with such a diversity, developers are physically unable to test the interaction of their product with others, and if you start using it immediately after release, you may be the first to do it for the combination that you use.


Here are some interesting examples:


Conclusion


What thought wanted to convey in this article? Personally, I’m far from thinking that the clouds in the foreseeable future will completely replace the on-premises infrastructure, but hoping that you can manage to hide from them is rather silly. But this is not necessary! Working with Azure is very interesting. If you like to constantly learn new things and follow the release of a new functional, thinking that you can use it to improve your solutions, then you will not be disappointed.

PS Those of you who work with Azure may notice that most of the problems described in this article are no longer relevant. Microsoft is very actively following the feedback of the community and refining its services (although the story with the NSG has not yet been corrected!).

Source: https://habr.com/ru/post/321838/


All Articles