
In our first article, as previously announced, we are in a hurry to share our experience in such a rarely discussed issue as the rapid deployment of hundreds of servers in a highly loaded project.
How to deploy several hundred servers in a geographically remote data center in the absence of physical access to equipment? How does
Badoo solve this problem?
We will tell you about this with the following example.
Below we will talk about the very first stage of configuring server hardware; how quickly and in time we completed a specific task, and not about writing optimal scripts. In case this topic seems interesting to you, we will be happy to tell you about installing the OS on the server and setting up the working environment, which also has its subtleties.
')
So, we were tasked to deploy several hundred new servers that had just arrived at our two data centers.
If successful, we should get:
- description of the entire "iron" component for each server;
- correspondence of the ordered equipment to the received one (there are cases that the configuration that was ordered is not the one that was ordered);
- ready to install OS and server software (with updated versions of RAID, ROM, etc., configured RAID-arrays, no problems with hardware).
Information that we have:
- servers are mounted in racks / cabinets and included;
- known factory logins and passwords;
- all servers have an IPMI interface (management interface);
- there are no pre-sets from the manufacturer on the servers (in particular, no RAID configuration is configured, power settings are not set);
- all servers are connected to the network equipment by at least two interfaces, are in a previously known VLAN ;
- New servers receive their IP addresses dynamically, respectively, they are available immediately after switching on;
- It is known how many server configurations should have been in the delivery.
Of course, it was not without additional difficulties. First, our engineers did not have physical access to the servers. Secondly, the “iron” in the delivery had several different configurations. And thirdly, everything that we knew about our servers is actually only their factory logins and passwords.
Most often, this kind of tasks are suggested to be solved with the help of
dd ,
PXE server and
rsync , setting tasks, involving data center employees. But we approached the question differently.
The solution we found involves some automation. Please note that all the scripts below are for informational purposes only and do not claim to be perfect.
To complete the task we needed:
- several text files that are deleted upon completion;
- some very simple scripts using expect;
- configured network boot server (in our case, xCAT );
- a customized and working OS image (no matter what, the main thing is that this image includes all the utilities we need);
- customized equipment inventory system (in our case, this is a glpi project ).
First we needed to know which IP addresses were obtained by our new servers. To do this, we made a text file nodes with data about logins and passwords in the format
ILOHOSTNAME1 ILOPASSWORD1
ILOHOSTNAME2 ILOPASSWORD2
We obtained this data from the labels available on each server using a barcode scanner. Standard login was known in advance - Administrator. Sticker example:
Now it was possible to run a command that collected hostname and IP matches for us:
for i in $ (cat nodes | awk {'print $ 1'}); do j = $ (cat nodes | grep $ i | awk {'print $ 2'}); ssh DHCPD_SERVER_FQDN "sudo cat / var / log / messages | grep $ i | tail -1 | sed 's / $ /' $ j '/ g'"; done
As a result, we received lines like the ones below, put them in the nodeswip file:
Jul 1 10:31:23 local @ DHCPD_SERVER dhcpd: DHCPACK on 10.10.10.213 to 9c: 8e: 99: 19: 3a: 68 (ILOUSE125NDBF) via 10.10.10.1 W3G554L7
Jul 1 10:31:35 local @ DHCPD_SERVER dhcpd: DHCPACK on 10.10.10.210 to 9c: 8e: 99: 19: b6: aa (ILOUSE125NDBA) via 10.10.10.1 BJCP691P
Jul 1 10:31:47 local @ DHCPD_SERVER dhcpd: DHCPACK on 10.10.10.211 to 9c: 8e: 99: 19: 58: 7c (ILOUSE125NDBG) via 10.10.10.1 67MG91SV
Now we needed to create a standard user with a set of rights available to him on all IPMI interfaces of new servers. It was also necessary to obtain the MAC addresses of the net and mgm interfaces for further structuring. For this we have executed the command
for i in $ (cat nodeswip | awk {'print $ 8'}); do j = $ (grep $ i nodeswip | awk {'print $ 14'}); expect expwip.sh $ i $ j | grep Port1NIC_MACAddress; done;
where the sh script expwip.sh looked like this:
#! / usr / bin / expect
set timeout 600
set ip [lindex $ argv 0]
set pass [lindex $ argv 1]
spawn ssh Administrator @ $ ip
set answ "$ pass"
set comm1 "create / map1 / accounts1 username = deployer password = PASSWORD name = deployer group = admin, config, oemhp_vm, oemhp_power, oemhp_rc"
expect "Administrator @ $ ip's password:"
send "$ answ \ r"
expect "</> hpiLO->"
send "$ comm1 \ r"
expect "</> hpiLO->"
send "show / system1 / network1 / Integrated_NICs \ r"
expect "</> hpiLO->"
send "exit \ r"
expect eof
The resulting list of MAC addresses of the net interfaces of our servers was added to the spreadsheet editor, which allowed us to see all matches.
Then we performed the configuration steps for the DHCP server, then sent the servers to netboot. After that, we had to send IPMI interfaces to reboot in order for them to occupy their intended addresses. This was done using the command
expect reset_ilo.sh $ i $ j
where $ i is the server address received earlier, $ j is the factory administrator password
The reset_ilo.sh script looked like this:
#! / usr / bin / expect
set timeout 600
set ip [lindex $ argv 0]
set pass [lindex $ argv 1]
spawn ssh Administrator @ $ ip
set answ "$ pass"
set comm1 "reset / map1"
expect "Administrator @ $ ip's password:"
send "$ answ \ r"
expect "</> hpiLO->"
send "$ comm1 \ r"
expect eof
Next, we proceeded to automatic generation of RAID-arrays, updating all possible firmware versions on hardware, obtaining comprehensive information on the layout of servers in a convenient form. All these operations were performed at network boot.
First, an init script was launched that “prepared” the RAID array:
LD = `/ usr / sbin / hpacucli ctrl slot = 0 logicaldrive all show | awk '$ 0 ~ / RAID 5 / || / Raid 0 / || / Raid 1 / {print $ 1 "" $ 2} '`
LD = $ {LD: -NULL}
if ["$ LD"! = "NULL"]; then / usr / sbin / hpacucli ctrl slot = 0 $ LD delete delete; fi
/ usr / sbin / hpacucli ctrl slot = 0 create type = ld drives = `/ usr / sbin / hpacucli ctrl slot = 0 | show | awk '$ 1 ~ / physicaldrive / {split ($ 2, arr,": "); print $ 2} '| tr "\ n" "," | sed' s /, $ // '`raid = 1 + 0
if [`/ usr / sbin / hpacucli ctrl slot = 0 physicaldrive all show | grep physicaldrive | wc -l` -gt 1]; then r = `/ usr / sbin / hpacucli ctrl slot = 0 physicaldrive all show | grep physicaldrive | wc -l`; let t = $ r% 2; if [$ t -ne 0]; then let tl = $ r-1; / usr / sbin / hpacucli ctrl slot = 0 create type = ld drives = `/ usr / sbin / hpacucli ctrl slot = 0 physical drive all show | grep physical drive | head - $ tl | awk '$ 1 ~ / physicaldrive / {split ($ 2, arr, ":"); print $ 2}' | tr "\ n" ", | | sed 's /, $ //'` raid = 1 + 0; / usr / sbin / hpacucli ctrl slot = 0 array all add spares = `/ usr / sbin / hpacucli ctrl slot = 0 physical drive | tail -1 | awk '$ 1 ~ / physicaldrive / {split ($ 2, arr, ":"); print $ 2}' | tr "\ n" "," | sed 's /, $ //' `; fi; fi
As a result, we received 1 + 0 or “mirror”. Then, the agent was launched, which sent information about the hardware to our inventory system. We use the fusion inventory agent, in whose settings we did not change anything except the address of the collection server. The result is visible in the
Fusion Inventory interface:
The last to run was a script that updated all the firmware on the hardware. For this, several
puppet classes were used that were run on new servers. Below is an example of a class that “looks” at the current server configuration and, if necessary, updates the firmware version of the RAID controller to the required one. In the same scenario, other updates of the hardware firmware were also performed.
class hp_raid_update_rom {
exec {"updateraid":
command => "wget ​​-P / tmp / http: //WEBSERVER/install/soft/firmware/hp/raid/5_12/CP015960.scexe; wget -P / tmp / http: // WEBSERVER / install / soft / update_hp_raid_firmware_512. sh; chmod + x /tmp/CP015960.scexe; chmod + x /tmp/update_hp_raid_firmware_512.sh; /tmp/update_hp_raid_firmware_512.sh; echo '5.12'> / tmp / firmware_raid ",
onlyif => "/ usr / bin / test` / sbin / lspci | grep -i 'Hewlett-Packard Company Smart Array G6' | wc -l`! = '0' && / usr / bin / test `/ usr / sbin / hpacucli ctrl all show detail | grep -i firmware | awk {'print \ $ 3'} `! = '5.12' && ([! -f / tmp / firmware_raid] || [cat / tmp / firmware_raid`! = ' 5.12 ']) ",
path => "/ usr / bin: / bin",
require => Exec ["remove_report_file", "remove_empty_report_file"],
}
exec {"remove_report_file":
command => "/ bin / rm / tmp / firmware_raid",
onlyif => "[-f / tmp / firmware_raid] && [` cat / tmp / firmware_raid` == `/ usr / sbin / hpacucli ctrl all show detail | grep -i firmware | awk {'print \ $ 3'}`] ",
path => "/ usr / bin: / bin",
}
exec {"remove_empty_report_file":
command => "/ bin / rm / tmp / firmware_raid",
onlyif => "[-f / tmp / firmware_raid] && [` cat / tmp / firmware_raid | wc -l` == '0'] ",
path => "/ usr / bin: / bin",
}
}
Thus, we solved the task, using only our own resources. All our machines were ready to install the combat OS, install the software and start servicing Badoo users.
All the above describes only the preparatory stage of setting up the equipment, the issues of installing the OS and configuring the working environment are left out of the article If you are interested in this topic, we will be happy to prepare material on xCAT and puppet and share our ways of solving specific problems using these tools.
You can safely leave your suggestions, questions and comments on the above in the comments - we are always open for dialogue!
Badoo Company