📜 ⬆️ ⬇️

Chef-a surprises or a story of an investigation

image Not so long ago, at Acronis, we switched to provisioning parts of our virtual machines at Chef. Nothing particularly new: all virtual machines are created by means of ESXI, and the central chef-server distributes their recipes to them, thereby automatically raising their surroundings on them, based on their roles. This system worked without problems and failures for quite a long time. She freed us from a lot of manual work, constant monitoring of the environment of machines and the need to remember what software and settings are on them, because it is enough to open the chef-server web console, select the node we need and see all its roles and settings.

Everything was fine until we were given the task of transferring one site from external hosting to our servers, which eventually led to the hunt for bugs and investigations in the style of Scooby Doo.

If interested, welcome under cat.

At first, as always, I asked our IT service to raise an empty Debian virtual machine on ESXI. Standard settings, nothing extra. This has already been done more than once and not twice, and the process has been debugged, so I did not expect any kind of trick. Having received IP and credit cards for access, I executed the standard command knife bootstrap to install a chef-client on it and connect it to the central chef-server. Everything went smoothly, and I climbed into the web interface to set all the necessary roles and set attributes for the new node.
')
Here it should be noted that during installation, chef goes over the entire virtual machine and collects all the necessary data about the system: hardware parameters, OS settings, and so on. Then it sends them to the chef-server, where they can be seen in the settings of the node. Usually this data is quite a lot, and I rarely look at them, but this time my attention was attracted by the fact that this new node somehow has very few of them. Comparing its attributes with the attributes of other machines, I was once again convinced of this. But the strangest thing was that the last displayed attribute of the new node was called gce and was absent from the rest of the machines. In addition, if you expand it, displayed nothing empty.

image

Since Virtualka planned to send directly to the production, I was not satisfied with this situation, and I decided to dig out what was wrong with her. First of all, I looked into the browser console and saw there was a strange error that the browser refused to load the iframe with www2.searchnut.com/?subid=704003&domain= www2.searchnut.com/?subid=704003&domain= on the page with https (the web console chef-server opens only via https). Going on this address, I saw a bunch of results telling that this is a kind of malware and even offering options to remove it. Looking into the source code of the page, I saw that in the place where the attribute value is usually displayed, 6 pieces of the same html-code were output:

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html> <head> <title>metadata.google.internal [6]</title> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <meta http-equiv="pragma" content="no-cache"> <style> body { margin:0; padding:0; } </style> </head> <body scroll="no"> <div style="position:absolute; top:0px; left:0px; width:100%; height:100%;"> <div id="plBanner"><script id="parklogic" type="text/javascript" src="http://parking.parklogic.com/page/enhance.js?pcId=5&domain="></script></div> <iframe src="http://www2.searchnut.com/?subid=704003&domain=" style="height:100%; width:100%; border:none;" frameborder="0" scrolling="yes" id="park" /> </div> </body> </html> 

At that time there were not many options. I rashil that either the virtual machine had time to break, or the chef itself broke and injected some rubbish into the package. In any case, nothing good was expected, and I decided to write directly to the support chef. Having described the problem and sent a ticket, I simultaneously deployed another virtual machine on my machine, and performed the same actions on it. What was my surprise when I saw that there was no gce on it! She worked without problems, like everyone else.

While waiting for a response from the support, I decided to look for what the abbreviation gce can mean. It turned out that Google Compute Engine was behind GCE, that is, a cloud from Google, which allowed virtual hosts to host, and chef can prop them.

At this point, they answered me from the support and offered to execute the command ohai -l debug > ohai.log 2>&1 to see how this attribute appears at all. It is important to note that Ohai is part of the chef system, which collects all the information about the system. In the logs I saw the following:

 [2014-10-03T08:27:30-04:00] DEBUG: has_ec2_mac? == false [2014-10-03T08:27:30-04:00] DEBUG: looks_like_ec2? == false [2014-10-03T08:27:30-04:00] DEBUG: can_metadata_connect? == true [2014-10-03T08:27:30-04:00] DEBUG: looks_like_gce? == true [2014-10-03T08:27:31-04:00] DEBUG: has_euca_mac? == false [2014-10-03T08:27:31-04:00] DEBUG: has_euca_mac? == false [2014-10-03T08:27:31-04:00] DEBUG: has_euca_mac? == false [2014-10-03T08:27:31-04:00] DEBUG: looks_like_euca? == false 

I highlighted the most important indentation. For some reason, Ohai decided that this virtual is hosted on Google Cloud, although this was not the case. In addition, I was attracted by the string can_metadata_connect, which on all other virtual women showed false. Having asked a reasonable question in IT, I received a very reasonable answer - this is the usual ESXI virtual machine on our native server, and there is no question about any GCE.

Chef, like Ohai - opensource products and their originator can be found in github, where I went. Searching the Ohai source for the looks_like_gce line, I came across an interesting piece of code in the gce_metadata.rb file:

 GCE_METADATA_ADDR = "metadata.google.internal" unless defined?(GCE_METADATA_ADDR) GCE_METADATA_URL = "/computeMetadata/v1beta1/?recursive=true" unless defined?(GCE_METADATA_URL) def can_metadata_connect?(addr, port, timeout=2) t = Socket.new(Socket::Constants::AF_INET, Socket::Constants::SOCK_STREAM, 0) saddr = Socket.pack_sockaddr_in(port, addr) connected = false begin t.connect_nonblock(saddr) rescue Errno::EINPROGRESS r,w,e = IO::select(nil,[t],nil,timeout) if !w.nil? connected = true else begin t.connect_nonblock(saddr) rescue Errno::EISCONN t.close connected = true rescue SystemCallError end end rescue SystemCallError end Ohai::Log.debug("can_metadata_connect? == #{connected}") connected end 

It followed from this that if Ohai can request the metadata.google.internal resource from virtual and get an answer, then the machine is automatically considered GCE. Searching for the metadata.google.internal line in Google, I found that this is the address of the Google Cloud's internal API, which is accessible only from its network, and, accordingly, I could not access it from my node.

Checking this conviction on the old virtuals, I saw:

 $ wget http://metadata.google.internal --2014-10-04 13:47:29-- http://metadata.google.internal/ Resolving metadata.google.internal... failed: nodename nor servname provided, or not known. wget: unable to resolve host address 'metadata.google.internal' 

But with this new virtuals request passed without problems:

 $ wget http://metadata.google.internal.com --2014-10-04 13:50:38-- http://metadata.google.internal.com/ Resolving metadata.google.internal.com... 74.200.250.131 Connecting to metadata.google.internal.com|74.200.250.131|:80... connected. HTTP request sent, awaiting response... 302 Found Location: http://ww2.internal.com [following] --2014-10-04 13:50:39-- http://ww2.internal.com/ Resolving ww2.internal.com... 208.73.211.246, 208.73.211.166, 208.73.211.232, ... Connecting to ww2.internal.com|208.73.211.246|:80... connected. HTTP request sent, awaiting response... 200 (OK) Length: 1372 (1.3K) [text/html] Saving to: 'index.html' 100%[==============================================>] 1,372 --.-K/s in 0s 2014-10-04 13:50:40 (28.4 MB/s) - 'index.html' saved [1372/1372] 

Looking into the contents of index.html, I saw the very ill-fated HTML code. But how could this be? After all, all virtuals are exactly the same, using the same DNS 8.8.8.8. And what ww2.internal.com, which is the request? Running nslookup I saw:

 $ nslookup metadata.google.internal Server: 8.8.8.8 Address: 8.8.8.8#53 Non-authoritative answer: metadata.google.internal.com canonical name = internal.com. Name: internal.com Address: 74.200.250.131 

For some reason, metadata.google.internal recurred in metadata.google.internal.com , and that was all the trouble. Quickly looking at /etc/resolv.conf, I saw the line of search com , which wasn’t on other machines.

The answer was simple. When creating this virtualka, she was given the name new-site.com . The installer of the operating system picked up this name, separated com and added it to resolv.conf automatically.

Thus, when Ohai ran through the system and made a request to metadata.google.internal, he received a response from metadata.google.internal.com, and thought that he was on the GCE machine. After that, I made requests to the GCE API, receiving in response only these pages with iframe.

It turns out that everyone who calls his virtual machine something. com will automatically receive such a problem. Naturally, I unsubscribed to the support chef, where I was assured that they would send the ticket directly to the guys who write this Ohai. So, I hope this problem will be fixed soon.

This investigation is coming to an end. Guilty punished, the good won again. Thank you for being with us!

Source: https://habr.com/ru/post/239335/


All Articles