📜 ⬆️ ⬇️

How to force the process to use the new DNS server address from the updated resolv.conf without restarting the process itself

I work as a Unix system administrator. One day, a ticket from a programmer with a pass from the application-server log in the title: " pgbouncer cannot connect to server " fell to our service operations department. Looking at the pgbouncer logs, I saw that lookup fails periodically occur when accessing our DNS. It was found that this is not due to the work of our DNS servers, but to the unreliability of the UDP protocol itself: sometimes packet loss occurs for various reasons.
image
As a result, it was decided to install on each server with pgbouncer'ami caching BIND. And then an interesting problem arose: pgbouncer did not re-read the /etc/resolv.conf file from the HUP signal and continued to access the old DNS servers. And reloading the bouncers is absolutely impossible: there are problem projects that are very sensitive to breaking sessions with the base.

In this article I will tell you how pgbouncer or any other program that uses the getaddrinfo () library call can be made to re-read resolv.conf and start using the new DNS server completely painless for clients (without downtime).

Let's get started

')
Immediately, I’ll make a reservation that in my case, the pgbouncer was version 1.5.2 and compiled with libevent-1.4 under FreeBSD .

If you look at the pgbouncer source, you can see the following comment in the dnslookup.c file:
/* * Available backends: * * udns - libudns * getaddrinfo_a - glibc only * libevent1 - returns TTL, ignores hosts file. * libevent2 - does not return TTL, uses hosts file. */ 

This means that in the case when pgbouncer is compiled from libevent1 , the getaddrinfo_a () function from the standard libc library is used for asynchronous resolv of addresses.
It was experimentally established that asynchronous getaddrinfo_a () uses the usual function getaddrinfo () from libc. We will put a breakpoint on the last function. This fact will save us from having to compile pgbouncer with debugging symbols, since gdb knows the getaddrinfo function, despite the fact that libc is compiled without debugging symbols.

Add to the pgbouncer config a non-existent database that refers to a non-existent domain (useful for tests):
 test = host=test.xaxa.blabla12313212.su user=pgsql dbname=template1 pool_size=10 

In a separate window, run pgbouncer:
 su -m pgbouncer -c '/usr/local/bin/pgbouncer /usr/local/etc/pgbouncer.ini' 

In another window, connect to the process using the gdb debugger:
 gdb /usr/local/bin/pgbouncer `cat /var/run/pgbouncer/pgbouncer.pid` 

Put a breakpoint and let the process continue:
 (gdb) b getaddrinfo Breakpoint 1 at 0x800f862a4 (gdb) c Continuing. 

In another window we will try to connect to our database with a non-existent domain in order to initiate an attempt to resolve a resolve:
 su -m pgbouncer -c 'export PGPASSWORD="123" && /usr/local/bin/psql -Utest test -h10.9.9.16 -p6000'; 

In gdb, we see that we hit the bull's eye:
 Breakpoint 1, 0x0000000800f862a4 in getaddrinfo () from /lib/libc.so.7 (gdb) 


How does getaddrinfo () work?

Using the manuals and a search engine, it was found that this function, when first called, reads the resolv.conf file, initializes the structure with a bunch of data in memory, among which you can find a list of DNS servers. Further, the function tries to rezolv address using the first address from the list. If the DNS server does not respond, the function makes the next DNS server active from the list active. And so in a circle. The function reads resolv.conf only once .

At first, I wanted to patch the pgbouncer virtual memory by finding 4 bytes of the DNS server address in a network order or host order format. To do this, even a C memory dumper program was written that allowed you to dump the process memory and search for a specific order of bytes. But, as it turned out, it is impossible to find these addresses in memory in this form. Understanding the source getaddinfo () turned out to be beyond my strength: a lot of text and all sorts of goto almost broke my mind. Besides, I am not a programmer, and C began studying only a month ago.

By the way, my program using ptrace and procfs would be suitable for pgbouncer compiled from libevent2: there the ip-addresses of DNS servers are stored just in the form of four bytes. But the description of this experience is beyond the scope of the article.

What to do?

Fortunately, with the help of a search engine, I found the res_init () saving function in the standard library:
The res_init () routine reads the configuration file (if any; see
resolver (5)) to get the Internet domain name
address of the local name server (s)

It is this function that is called when getaddrinfo () is first called and initializes the structure we need!
Repeated function call re-initializes the structure and re-reads resolv.conf.

Check in practice

Connect with the tracer to our frozen pgbouncer and begin to grep the trace dump file:
 ktrace -f out.ktrace -p `cat /var/run/pgbouncer/pgbouncer.pid` kdump -l -f out.ktrace | grep resolv 

In the window with gdb, we will call the res_init () function:
 (gdb) call res_init() Breakpoint 1, 0x0000000800f862a4 in getaddrinfo () from /lib/libc.so.7 

In the window with the output of the trace we see:
 37933 pgbouncer NAMI "/etc/resolv.conf" 


Goal achieved

We managed to force the process to re-read resolv.conf, while not dropping the server and not breaking the active tcp states. At the time of freezing requests are also not lost.

If we want the local caching DNS to be used immediately, we need to do the following steps:

  1. Change the BIND servers in forwarders to new (other) working DNS servers that have not been used in resolv.conf before and will not be used, and then do rndc reload
  2. Ban local firewall access to the old DNS-servers (except 127.0.0.1)
  3. Initiate the pgbouncer call to a non-existent database server:
     su -m pgbouncer -c 'export PGPASSWORD="123" && /usr/local/bin/psql -Utest test -h127.0.0.1 -p6000'; 

  4. Verify with tcpdump that pgbouncer is accessing 127.0.0.1 on port 53:
     tcpdump -n -i lo0 port 53 | grep xaxa "> 127.0.0.1.53" - 

    Where xaxa is part of the server name from pgbouncer.conf
  5. Unblock old DNS in firewall
  6. Return the BIND forwarders settings to the original state

And last

If you want to repeat my experience, I strongly recommend training on a test bench.
If you want to “bullet” a command in gdb in batch mode , keep in mind that gdb must first be given time to read the characters, and then functions should be called: somehow, because of this, I screwed up a lot, killing one of the Mi working pgbouncer'ov.
I now have batch mode for gdb like this:
 printf 'shell sleep 3\ncall res_init()\ndetach\nquit\n' > /tmp/pb.gdb && gdb -batch -x /tmp/pb.gdb /usr/local/bin/pgbouncer `cat /var/run/pgbouncer/test.pid` 


I hope my experience will help someone a little better understand how the processes work in operating systems.

Source: https://habr.com/ru/post/209356/


All Articles