On the server park, clusters and data centers Intel

I bring to your attention an interview with Sergey Kuznetsov , the head of the IT department at Intel in Moscow and St. Petersburg. Sergey told many interesting details about his work and about the infrastructure of the company as a whole - the conversation turned out to be quite diverse and informative.

- Many of the “old-timers” of the Intel IT Galaxy are familiar to you from the “server room”, but they, and not only the new members of the community, will be interested to learn about your new job and position. Please tell us a little about your unit.

- Intel provides for a certain career growth and rotation for all its employees. In particular, I used to work in supporting the laboratory in the Moscow office and was the technical manager of the innovation center. Currently, I am the head of IT department in Moscow and St. Petersburg offices of Intel. The responsibilities of this department include coordinating the global programs of the corporation on these local sites and their operational performance, i.e. daily work. Also, my responsibilities include planning IT resources, developing sites in accordance with corporate strategy and with the needs of local business units, increasing the efficiency of using server equipment on sites, studying models of using systems with insufficient utilization and the possibility of consolidating services on remote sites , as well as ways to use innovative products aimed at improving equipment utilization ...

- The range of tasks is impressive. How do you work in a new position, what can you say about ways to increase the efficiency of your work?
')
- After moving to the position of head of IT-department, I, of course, felt the change in the load and range of tasks that have to be addressed. If you are responsible for some service, then the scope of your responsibility is relatively small and limited to this service. As soon as you come to the management of a group of people and a group of services, it turns out that you are responsible not only for the performance of the department itself, but also for planning its activities, for the reliability of the services provided by the department, for the efficiency of the people who support these services, and many operational issues mentioned above. Along with technical issues, one has to solve many diplomatic tasks. From the point of view of a technical specialist, the duties of a manager are more diverse and one has to work a lot.

As for the effectiveness of my work, when I was responsible for a certain service or for the support of a group of users, it was enough that I carried out my daily duties. As soon as you become responsible for a serious area of work, for a branch, then personal efficiency becomes scarce and you have to keep several notebooks, noting the projects on the site, it’s very hard to keep track of your personal calendar, current tasks and requests from various groups of users of our services so that one request did not go unheeded. And here, more than ever, it is the discipline that is important in fixing all arising tasks.

- Is the load big, is there enough time for everything? Do you and your employees have to work at night?

- I think I will not reveal the big secret that the employees of the IT department at Intel, as well as at many other companies, are working according to irregular schedules. If some service falls, of course, we do not leave users in trouble. We must ensure the work of the services provided at any time of the day or night. To cope with such situations faster, we have recovery plans after emergencies that guide us regardless of when the incident occurred. At the same time, at Intel, there is a notion of a balance between the work and personal life of employees, and we try to relax well. For example, we organize team events. And if some employee was busy in the second half of the day setting up an important service, then, in agreement with his management, he may come back the next day to rest and fill the balance between work and his own affairs.

- Let's talk about the structure of the server park of Intel, about those servers that are in the area of responsibility of your department.

“Unlike many large corporations that produce and market in the consumer market, Intel is an R & D company (Research & Development). The basis of our company's business is not only production, but also research and development. Accordingly, in the Intel server park, the number of machines designed for research and development and various computing significantly exceeds the number of infrastructure servers that support business and support the corporate IT environment. We have three key server segments: global servers, responsible for global infrastructure, infrastructure local servers that provide users with work in separate branches, and servers intended for research and development activities. The latter are also divided into two categories: servers for computing, the so-called compute server, and servers for interactive work and for measuring the performance of various applications - this is a performance server. There are also servers related to production, but there are no factories in Russia, so let's leave them aside. Global functions such as e-mail, Internet, IM services, SharePoint infrastructure, Project server, SAP maintenance, business process support are all assigned to global servers. And then, at each local site, we have groups that need servers to support their research activities. These are version control and software quality control systems, local database servers, servers for supporting local Web applications and service systems.

A few years ago, Intel organized a working group to develop strategies and solutions for optimizing server resources based on information about data centers that exist on all sites and their usage patterns. At the same time, small data centers were consolidated into larger ones, and small sites were able to work with them over data networks. It should be borne in mind that for small representative branches, where there are only employees of sales and marketing, but there are no large research groups, separate data centers are not created. Thus, over the past few years, we have moved from the focus to consolidating data centers to a strategy focused on assessing and optimizing the value of a local data center for business.

In Russia, for example, one data center was organized on each of the sites. We cannot do without them, because in each of the Intel branches in Russia (research branches, we are not talking about sales offices) serious developments are being conducted that require a large amount of work with locally located servers — interactive, computing, and measuring performance. software. However, at the moment IT is actively researching to ascertain the requirements and possibilities of working with remotely located research and development servers. In particular, several groups in Russia are already using computational resources located at remote sites.

- Members of the Intel IT Galaxy community already know that the winners of the 3 Days with IT @ Intel contest will have the opportunity to familiarize themselves with the Intel data center in Nizhny Novgorod. What can they see there, is it really a serious modern data center?

- The strategy for the development of data centers in Intel implies that each data center is organized and equipped with the latest technology, and serious investments are made in its creation. Therefore, any of our data center is a serious solution, which includes industrial-scale ventilation, infrastructure for ensuring uninterrupted power supply (both on the basis of batteries and diesel generators, which can allow you to survive for as long as you like). We always appreciate the place where you plan to place the data center, in terms of the effectiveness of the implementation of power and cooling systems.

The state of the systems located in the data center is constantly monitored, the temperature is optimal in terms of energy consumption (it is not so low that extra money is spent on cooling, but it is absolutely safe for the operation of servers and other equipment). By the way, the borders of the temperature regime are narrow enough, which in itself is evidence of the competent organization of the data center. In less efficiently organized data centers, it is impossible to maintain such a narrow scope of temperature changes. I want to emphasize the use of technology hot & cold aisle, when the racks are oriented in the direction of the withdrawal of air from them and the air flow is organized in such a way that the cooling is as efficient as possible. Many factors are taken into account, the physical distribution of servers in racks is optimized in terms of electricity consumption, the load on the phases, the volume occupied in the space. For example, for safety, heavy (by weight) servers are not placed at the top of the rack with the bottom unloaded, otherwise the rack will be unbalanced and may even tip over (say, an earthquake) if the fixing mounts of the rack are damaged.

In our data centers always provide a high degree of security. Access there is, of course, carefully monitored, employees undergo a series of security trainings - regarding the data center equipment, the data held in it belonging to the intellectual property company, on the one hand, and, on the other, the physical security of the data center personnel, the security of all ongoing work.

- How many servers are approximately in your area of responsibility? Are there clusters?

- The number of servers at each site is determined by the needs of the local business. Usually it is a fairly small number, a few dozen infrastructure servers that support the functioning of the business and are responsible for various IT functions. In addition, depending on the number of engineering and research groups on the site and the nature of the tasks they solve, there are a significant number of research servers, their number can reach several hundred or even thousands.

The server hardware available in Russia is used, among other things, for resource-intensive computing. Of course, to increase the efficiency of use, computers are combined into clusters, including those based on blade systems. If necessary, such solutions can be applied within each branch, but recently there has been a tendency to use remote resources. We have, for example, created a very powerful cluster for computing in Nizhny Novgorod, and for some resource-intensive packet computing it is preferable to work with it. Due to the fact that we are trying to load a large computational pool with batch computing from other sites, we are able to achieve sufficiently high utilization of the resources located there.

But geographical consolidation of resources does not eliminate the need for local data centers, because the latency of the WAN channel still remains too high for remote execution of interactive applications. Remote servers for interactive research work are much more difficult to use, users face discomfort even with delays of 100 ms or more. However, the amount of local work does not always allow using the server's power as efficiently as possible, therefore, for interactive servers in laboratories, measures are currently being implemented to increase their energy efficiency, such as automatic shutdown of unused machines at night, and consolidation of low-powered servers.

- How often is the server park updated, how exactly does this happen? How does the introduction of new platforms and technologies on the number of servers?

“Intel is implementing a strategy involving a four-year cycle of server utilization. In the first quarter of its life cycle, new equipment is installed in the data center and current services are migrated to it. Next comes the usual server operation. Somewhere at the end of the third year of the server’s life, planning for its decommissioning begins. In the last quarter of the fourth year of the life cycle, new equipment is installed, replacing the old systems, and migration of services is planned.

One of the interesting points is how we select the architectures and servers to be used. Every year brings new technologies, new models of server hardware, new brands appear. Every year, owners of services for certain projects study new platforms, equipment from different manufacturers, and conduct comparative tests. As a result, selected models and server configurations, approved as a "corporate platform" for the organization of various services. In other words, the selected optimal configurations become recommended for the purchase and deployment of relevant services throughout the year. A year later, the procedure is repeated.

As for improving the efficiency of using computing equipment in a company, this is another interesting topic. Work here goes in two directions. First, we reduce the number of servers by increasing their computing power. Suppose that if our servers are busy with serious computational work, then we need computational power adequate to these tasks. And the less “iron” servers will perform the same amount of work, the better it will affect the energy consumption, cooling equipment, and the amount of space they occupy. Now we are actively buying and deploying servers with new Intel Xeon processors, which are very effective in terms of consolidating computing resources, and we replace them with four-year servers with a ratio of about 1:10.

Secondly, the question arises, what to do with infrastructure servers? The fact is that most infrastructure servers do not fully use the full power of modern processors. For example, these are file servers, which are usually loaded with I / O, are actively working with disk arrays, or are not very seriously used by the hosting server.

The company, of course, seeks to improve the efficiency of the use of such equipment. Virtualization is used for this, i.e. we take one powerful machine based on new Intel Xeon processors and raise several virtual servers on it. Moreover, if we still take exactly the same machine, we deploy similar virtual servers on it and combine all this into a cluster, then we have a failsafe system. Even if a single “iron” system fails, virtual servers continue to work anyway, if a virtual machine falls, we can easily either restore the infrastructure server from the image we have or transfer its functions to another virtual server. In addition to virtualization, services are consolidated on one machine. If any service is used by us by one working group and not very intensively, we interrogate other working groups and, for example, load the web server for hosting with services from other working groups. If the services do not intersect with each other, then we simply install additional services on a single physical server without virtualization in order to increase processor utilization efficiency, where appropriate. We have certain infrastructure servers that we are not virtualizing yet, because with regard to them this technology is still being tested and so far it has been decided to leave some services on physical servers in terms of the efficiency of their use.

The infrastructure we support implies maximum stability, compatibility with current applications and reliable use. Special teams select all available solutions, test them for compatibility with existing infrastructure services, and approve certain models and configurations as the recommended corporate platform for certain services.

- Is virtualization on computing servers used?

- When we talk about infrastructure, we take into account that the company's business, the activity of offices, the performance of our programmers and developers depend on the infrastructure servers. When it comes to laboratories, in most cases, both for IT and for developers, they serve as a kind of testing ground.And, of course, we run the most advanced solutions primarily in our laboratories. In laboratories, we are pleased to use the latest research in the field of virtualization. For batch processing, virtualization is not very interesting within a single branch. Here we must understand that a certain amount of resources is spent on maintaining the viability of the cloud itself, and without virtualization, we can use all the computing power to process the data. However, virtualization is very useful for creating instant emulation of any infrastructure or it helps a lot to combine a large number of geographically distributed machines into a single cloud.

- The process of updating the server park is also associated with energy efficiency and energy conservation. Apparently, the transition to new platforms is noticeable?

- In our infrastructure services there are processes that are different in computing capacity. For processes with high resource consumption, we use high-performance servers based on top-end processors of the last line and due to this we consolidate less efficient servers. As a result, we get the maximum performance per unit area of the data center. For less resource-intensive applications of infrastructure servers, we consider the purchase of machines with the maximum energy-saving technologies in our Nehalem lineup, which allow us to achieve the best performance per watt of energy consumed. In addition, for example, now we are exploring the possibilities of not only increasing the utilization of equipment, but also automatic power off of servers, which, due to the specifics of our business processes, are not loaded with work at night.To this end, we are actively using technologies provided by server manufacturers for remote access and management.

— Intel «- », - ?

- We have conducted a study on the use of so-called “data centers in containers”, including the question of how such solutions are suitable for working in a real corporate environment, with our real infrastructure. A number of tests were carried out, but their final results have not yet been reported. However, there is no doubt that this technology is very interesting and can be useful if it is necessary to urgently deploy data centers in a new locality or during natural disasters. In particular, the “container data center,” which was used by our engineers to evaluate its effectiveness, was sent to the Emergency Control Center in Haiti as humanitarian aid.

- We touched on the topic of disasters ... Unfortunately, even major accidents are becoming more and more likely. Tell us how important data resources are reserved in Intel?

- I don’t know about plans to use container data centers if in case of any disasters our data centers in Russia fail. But if we talk about more realistic scenarios, about reserving servers, services and data center systems, then we certainly have such opportunities. We have backup power supply and ventilation systems. Each infrastructure server and service has an owner who regularly prepares and specifies plans for restoring and mitigating the consequences of accidents, where he signs what to do if his server or service fails.

All server infrastructure equipment in Intel data centers is serviced by manufacturing companies. We buy servers with service, which allows you to be calm about the functioning of their hardware. As for the troubles that may be caused by various conflicts in the systems, our company takes the reservation of services and data very seriously. We have well-established and proven recovery schemes for years of experience after various incidents, everyone knows what and in what cases he needs to do, regular training is held regularly.

- The more computing power, the more data is processed and they need to be stored somewhere. Can you tell us something about data warehousing at Intel?

- Yes, this is one of the important components of any data center. As recommended by technology, the file array should be a separate specialized device, the main function of which is data storage.

Such storage facilities, of course, exist in each of our data centers. De facto, these are large RAID arrays that are configured into specialized devices, which are disk shelves connected by high-speed optical data transmission channels to the control device. The latter is equipped with network interfaces of considerable bandwidth and has a processing power sufficient to process a very large number of file requests. If not specialized devices were used here, but ordinary servers packed with disks, we would never be able to achieve the same high query processing speed and the same level of reliability. However, specialized industrial-level solutions are not only very high-performance, but also very expensive. They have a closed architecture. Actuallyit is a conglomerate of "iron" solutions, architecture, interfaces, and its specialized operating system. Due to the close linkage of these components, high performance and reliability is ensured, but the price is appropriate.

In terms of practice, of course, I would like to reduce the cost of data storage. For example, our research groups often request additional space to store their information, but due to the high cost of decisions of this class, we cannot always satisfy their needs, because we have to weigh needs and available budgets. For storing less important data, we can afford to use ordinary servers with disk arrays. Of course, we are interested in the analysis of the possibility of using different storage systems, including those that will be built on our new Intel Xeon C5500 and C3500 processors.

About reservation systems is also an interesting question. It uses special hardware and software systems that can manage more than a hundred tapes. The company has a policy of in-site backup storage, so even in the event of an accident in one of the branches with a complete failure of the data center, the information can be restored by taking a backup from external storage. Yes, the amount of information is growing, and simply increasing the capacity and the number of tapes can not cope with them. Therefore, we strive to optimize the data to be backed up. For example, in agreement with the owners of information as a result of an IT risk analysis, it may not reserve intermediate results of calculations, other process information, the lifetime of which is measured by several days.Such a solution saves significant funds on the amount of data backed up.

— , . ?

- Of course, there is a problem, there are some ways to solve it. For example, for quite a long time, for serious cluster computing, we have used blade systems that neutralize the bottleneck in server-to-server connections by having our own data exchange interface between servers. Also, all the servers we purchase by default have two gigabit ports. I can say that even with active use of servers, we will not soon be able to create such a powerful data stream in order to block two gigabit ports on one server. With regard to storage systems, then, as we said, they use specialized solutions with much higher bandwidth.

- Thank you for the informative conversation!

Well, that's all :) Soon we will post something else interesting! In the meantime, the heat, I urge everyone to participate in the "summer" contest Intel, with summer-like pleasant prizes.
Successes!

Source: https://habr.com/ru/post/101298/

All Articles

On the server park, clusters and data centers Intel

More articles: