(Without) painful NGINX Ingress

So, you have a Kubernetes cluster, and to forward external traffic to services within the cluster, you have already configured the NGINX Ingress Controller , well, or for now you are going to do it. Great!

I also went through this, and at first everything looked very simple: the installed NGINX Ingress controller was at a distance of one helm install . And then it only remained to tie the DNS to the load balancer and create the necessary Ingress resources .

After a few months, all external traffic for all environments (dev, staging, production) was sent through the Ingress servers. And everything was fine. And then it became bad.

We all know very well how it happens: first you are interested in this new wonderful thing, you start to use it, and then the trouble begins.

My first Ingress crash

First, let me warn you: if you are not yet concerned about accept queue overflows , start worrying.

Do not forget about the queues

What happened was that the application behind NGINX began to respond with long delays, which in turn led to the filling of the NGINX listen backlog . Because of this, NGINX began to drop connections, including those that tried to install Kubernetes to test the service’s performance ( liveness / readiness probes ).

And what happens when the under does not respond to such requests? Kubernetes thinks something went wrong and restarts it. The problem is that this is one of those situations where the restart of the hearth only aggravates the situation: the reception queue continues to overflow, Kubernetes continues to restart the hearth, and eventually draws them into the whirlpool of falls and subsequent restarts.

TCP listen queue overflow according netstat

What lessons can be learned from this situation?

Examine your NGINX configuration to the last letter. Find out what should be there and what should not. Do not blindly trust the default settings.
Most Linux distributions out of the box are not configured to work as high-load web servers. Recheck the appropriate kernel parameters with sysctl -a .
Measure the delay of your services and set the appropriate timeouts based on the expected maximum value + margin for small deviations.
Customize your applications so that in the event of an overload they begin to reject requests or gently reduce the load. For example, in NodeJS applications, increasing delays in a message loop may mean that the server is already struggling to handle the current traffic.
Use more than one NGINX Ingress Controller.

The importance of monitoring

My advice number 0: run a production Kubernetes cluster (or something similar) without setting up quality monitoring of its work. Monitoring itself will not relieve problems from problems, but collected telemetry makes it much easier to find the root causes of failures, which makes it possible to correct them in the process of further work.

Some useful metrics from `node netstat` *

If you have succumbed to the Prometheus craze, you can use node_exporter to collect node-level metrics. This is a handy tool that allows you to identify including the problems just described.

Some metrics derived from the NGINX Ingress Controller

The NGINX Ingress controller is able to generate metrics for Prometheus itself. Do not forget to adjust their collection.

Know your config

The beauty of the Ingress Controller is that you can rely on this wonderful program to generate and reload the proxy configuration and not worry about it anymore. You don't even have to be familiar with the underlying technology (NGINX in this case). True? Not!

If you have not done this yet, be sure to look at the configuration generated for you. For the NGINX Ingress Controller, you can get the contents of /etc/nginx/nginx.conf using kubectl .

 $ kubectl -n <namespace> exec <nginx-ingress-controller-pod-name> -- cat /etc/nginx/nginx.conf > ./nginx.conf

 # $ cat ./nginx.conf daemon off; worker_processes auto; pid /run/nginx.pid; worker_rlimit_nofile 1047552; worker_shutdown_timeout 10s ; events { multi_accept on; worker_connections 16384; use epoll; } http { real_ip_header X-Forwarded-For; # ... } # ...

Now try to find something incompatible with your installation. Want an example? Let's start with worker_processes auto ;

The optimal value depends on many factors, including (but not limited to) the number of processor cores, the number of hard drives with data and the load pattern. If you have difficulty in choosing the right value, you can start by setting it equal to the number of processor cores (the value “auto” tries to determine it automatically).

First problem: at the moment (will it ever be fixed?) NGINX knows nothing about cgroups , which means that in the case of auto , the value of the number of CPU host CPU and not the number of “virtual” processors as defined in Kubernetes will be used resource requests / limits .

Let's do an experiment. What happens if we try to load the following NGINX configuration file on a dual-core server in a container limited to only one CPU? How many workflows will be running?

 # $ cat ./minimal-nginx.conf worker_processes auto; events { worker_connections 1024; } http { server { listen 80; server_name localhost; location / { root html; index index.html index.htm; } } }

 $ docker run --rm --cpus="1" -v `pwd`/minimal-nginx.conf:/etc/nginx/nginx.conf:ro -d nginx fc7d98c412a9b90a217388a094de4c4810241be62c4f7501e59cc1c968434d4c $ docker exec fc7 ps -ef | grep nginx root 1 0 0 21:49 pts/0 00:00:00 nginx: master process nginx -g daemon off; nginx 6 1 0 21:49 pts/0 00:00:00 nginx: worker process nginx 7 1 0 21:49 pts/0 00:00:00 nginx: worker process

Thus, if you plan to limit the processor resources that are available to NGINX Ingress, you should not allow nginx to create a large number of workflows in one container. It is best to explicitly indicate their required number using the worker_processes directive.

Now consider the listen directive. The value of the backlog parameter is not explicitly indicated and the default for Linux is 511 . If the kernel parameter net.core.somaxconn is equal to, say, 1024 , then the backlog must be assigned the appropriate value. In other words, make sure that the nginx configuration is configured according to the kernel parameters.

But do not stop there. Such an exercise is necessary for each line of the generated configuration file. Just look at all the parameters that allows you to change the Ingress-controller. Correct without hesitation everything that is not suitable for your case. Most NGINX parameters can be configured using ConfigMap records and / or annotations.

Kernel options

With or without Ingress, always check and tune the parameters of the nodes according to the expected load.

This is a rather complicated topic, so I do not plan to disclose it in detail here. Additional materials on this issue can be found in the Links section.

Kube-Proxy: Conntrack Table

Those who use Kubernetes, I think, do not need to explain what the Services are and what they are for. However, it will be useful to consider some features of their work.

In each node of the Kubernetes cluster, kube-proxy is executed, which is responsible for implementing virtual IP for Services of a type other than ExternalName . In Kubernetes v1.0, proxies were run exclusively in user space. An iptables proxy was added to Kubernetes v1.1, however this was not the default mode. Starting with Kubernetes v1.2, iptables-proxy is used by default.

In other words, packets sent to IP services are sent (directly or through a balancer) to the appropriate Endpoint ( address:port pairs of pods that correspond to the service's selector label ) using iptables rules, controlled by kube-proxy . Connections to service IP addresses are tracked by the kernel using the nf_conntrack module, and this information is stored in RAM.

Since the various conntrack parameters must be consistent with each other (for example, nf_conntrack_maxw and nf_conntrack_buckets ), kube-proxy, when starting a job, sets reasonable default values.

 $ kubectl -n kube-system logs <some-kube-proxy-pod> I0829 22:23:43.455969 1 server.go:478] Using iptables Proxier. I0829 22:23:43.473356 1 server.go:513] Tearing down userspace rules. I0829 22:23:43.498529 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_max' to 524288 I0829 22:23:43.498696 1 conntrack.go:52] Setting nf_conntrack_max to 524288 I0829 22:23:43.499167 1 conntrack.go:83] Setting conntrack hashsize to 131072 I0829 22:23:43.503607 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400 I0829 22:23:43.503718 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600 I0829 22:23:43.504052 1 config.go:102] Starting endpoints config controller ...

This is a good option, but you may need to increase the values of these parameters if monitoring shows that you are running out of space allocated for conntrack. However, it must be remembered that an increase in the values of these parameters leads to an increased memory consumption, so it is more careful there.

Conntrack usage monitoring

Sharing (not) is taking care

Until recently, we had only one instance of NGINX Ingress, responsible for proxying requests to all applications in all environments (dev, staging, production). I learned from my own experience that this is a bad idea. Do not put all your eggs in one basket.

I think the same can be said about the use of one cluster for all environments, however, we found that this approach leads to more efficient use of resources. We ran the dev / staging pods at the QoS level with non-guaranteed delivery (best-effort QoS tier), thus using the resources left from production-based applications.

The flip side of the coin here is that we find ourselves limited in the actions that can be performed in relation to the cluster. For example, if we need to conduct load testing of a staging service, we will have to be very careful not to affect the combat services running in the same cluster.

Although containers generally provide a good level of isolation, they still depend on shared kernel resources that are subject to abuse.

One setting of Ingress per environment

We have already said that there is no reason why you should not use one Ingress-controller for each environment. This gives an extra level of protection in case problems start to happen with services in dev and / or staging.

Some advantages of this approach are:

It is possible to use different settings for different environments.
Allows you to test Ingress updates before applying them in production.
You can avoid inflating the NGINX configuration with records of many upstream servers and services associated with ephemeral and / or unstable environments.
Configuration reloads are accelerating and the number of these events is reduced throughout the day (we will later discuss why we should strive to minimize the number of reloads).

Ingress classes to help

One way to force different ingress controllers to manage different Ingress resources is to use different ingress class names for each ingress installation, and then annotate Ingress resources to specify who should manage whom.

 # Ingress controller 1 apiVersion: extensions/v1beta1 kind: Deployment spec: template: spec: containers: - args: - /nginx-ingress-controller - --ingress-class=class-1 - ... # Ingress controller 2 apiVersion: extensions/v1beta1 kind: Deployment spec: template: spec: containers: - args: - /nginx-ingress-controller - --ingress-class=class-2 - ... # This Ingress resource will be managed by controller 1 apiVersion: extensions/v1beta1 kind: Ingress metadata: annotations: kubernetes.io/ingress.class: class-1 spec: rules: ... # This Ingress resource will be managed by controller 2 apiVersion: extensions/v1beta1 kind: Ingress metadata: annotations: kubernetes.io/ingress.class: class-2 spec: rules: ...

Ingress reload issues

At this point, we had a dedicated ingress controller running for the production environment. Everything was fine until we decided to transfer one WebSocket application to Kubernetes + ingress.

Soon, I noticed a strange tendency in the use of memory production ingress.

What the hell is going on here ?!

Why is memory usage so high? With the help of kubectl exec I went into one of the ingress containers and found a group of workflows that hung in shutting down status.

 root 17755 17739 0 19:47 ? 00:00:00 /usr/bin/dumb-init /nginx-ingress-controller --default-backend-service=kube-system/broken-bronco-nginx-ingress-be --configmap=kube-system/broken-bronco-nginx-ingress-conf --ingress-class=nginx-ingress-prd root 17765 17755 0 19:47 ? 00:00:08 /nginx-ingress-controller --default-backend-service=kube-system/broken-bronco-nginx-ingress-be --configmap=kube-system/broken-bronco-nginx-ingress-conf --ingress-class=nginx-ingress-prd root 17776 17765 0 19:47 ? 00:00:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf nobody 18866 17776 0 19:49 ? 00:00:05 nginx: worker process is shutting down nobody 19466 17776 0 19:51 ? 00:00:01 nginx: worker process is shutting down nobody 19698 17776 0 19:51 ? 00:00:05 nginx: worker process is shutting down nobody 20331 17776 0 19:53 ? 00:00:05 nginx: worker process is shutting down nobody 20947 17776 0 19:54 ? 00:00:03 nginx: worker process is shutting down nobody 21390 17776 1 19:55 ? 00:00:05 nginx: worker process is shutting down nobody 22139 17776 0 19:57 ? 00:00:00 nginx: worker process is shutting down nobody 22251 17776 0 19:57 ? 00:00:01 nginx: worker process is shutting down nobody 22510 17776 0 19:58 ? 00:00:01 nginx: worker process is shutting down nobody 22759 17776 0 19:58 ? 00:00:01 nginx: worker process is shutting down nobody 23038 17776 1 19:59 ? 00:00:03 nginx: worker process is shutting down nobody 23476 17776 1 20:00 ? 00:00:01 nginx: worker process is shutting down nobody 23738 17776 1 20:00 ? 00:00:01 nginx: worker process is shutting down nobody 24026 17776 2 20:01 ? 00:00:02 nginx: worker process is shutting down nobody 24408 17776 4 20:01 ? 00:00:01 nginx: worker process

To understand what happened, you need to take a step back and take a look at how the configuration reload process is implemented in NGINX.

Having received the signal, the main process checks the correct syntax of the new configuration file and tries to apply the configuration contained in it. If he succeeds, the main process starts new workflows and sends messages to old workflows with a request to complete. Otherwise, the main process rolls back the changes and continues to work with the old configuration. The old workflows, upon receiving the command to complete, stop accepting new requests and continue to serve current requests until all such requests are serviced. After this, the old workflow is completed.

Let me remind you that we are proxying WebSocket connections, which by their nature are long-lived. A WebSocket connection can be maintained for hours, or even days, depending on the application. The NGINX server does not know whether it is possible to terminate the connection during the reboot, we must facilitate this work. (For example, you can apply a strategy for forcibly terminating connections that are idle for a certain amount of time, both on the client and on the server. Do not leave such things for later.)

Let's return to our problem. If we have so many workflows at the completion stage, then the ingress configuration has rebooted many times, and the workflows could not complete the work due to long-lived connections.

So it really was. We found out that the NGINX Ingress-controller periodically generated various configuration files due to changing the order of upstream servers and server IP addresses.

 I0810 23:14:47.866939 5 nginx.go:300] NGINX configuration diff I0810 23:14:47.866963 5 nginx.go:301] --- /tmp/a072836772 2017-08-10 23:14:47.000000000 +0000 +++ /tmp/b304986035 2017-08-10 23:14:47.000000000 +0000 @@ -163,32 +163,26 @@ proxy_ssl_session_reuse on; - upstream production-app-1-80 { + upstream upstream-default-backend { # Load balance algorithm; empty for round robin, which is the default least_conn; - server 10.2.71.14:3000 max_fails=0 fail_timeout=0; - server 10.2.32.22:3000 max_fails=0 fail_timeout=0; + server 10.2.157.13:8080 max_fails=0 fail_timeout=0; } - upstream production-app-2-80 { + upstream production-app-3-80 { # Load balance algorithm; empty for round robin, which is the default least_conn; - server 10.2.110.13:3000 max_fails=0 fail_timeout=0; - server 10.2.109.195:3000 max_fails=0 fail_timeout=0; + server 10.2.82.66:3000 max_fails=0 fail_timeout=0; + server 10.2.79.124:3000 max_fails=0 fail_timeout=0; + server 10.2.59.21:3000 max_fails=0 fail_timeout=0; + server 10.2.45.219:3000 max_fails=0 fail_timeout=0; } upstream production-app-4-80 { # Load balance algorithm; empty for round robin, which is the default least_conn; - server 10.2.109.177:3000 max_fails=0 fail_timeout=0; server 10.2.12.161:3000 max_fails=0 fail_timeout=0; - } - - upstream production-app-5-80 { - # Load balance algorithm; empty for round robin, which is the default - least_conn; - server 10.2.21.37:9292 max_fails=0 fail_timeout=0; - server 10.2.65.105:9292 max_fails=0 fail_timeout=0; + server 10.2.109.177:3000 max_fails=0 fail_timeout=0; } upstream production-app-6-80 { @@ -201,61 +195,67 @@ upstream production-lap-production-80 { # Load balance algorithm; empty for round robin, which is the default least_conn; - server 10.2.45.223:8000 max_fails=0 fail_timeout=0; + server 10.2.21.36:8000 max_fails=0 fail_timeout=0; server 10.2.78.36:8000 max_fails=0 fail_timeout=0; + server 10.2.45.223:8000 max_fails=0 fail_timeout=0; server 10.2.99.151:8000 max_fails=0 fail_timeout=0; - server 10.2.21.36:8000 max_fails=0 fail_timeout=0; } - upstream production-app-7-80{ + upstream production-app-1-80 { # Load balance algorithm; empty for round robin, which is the default least_conn; - server 10.2.79.126:3000 max_fails=0 fail_timeout=0; - server 10.2.35.105:3000 max_fails=0 fail_timeout=0; - server 10.2.114.143:3000 max_fails=0 fail_timeout=0; - server 10.2.50.44:3000 max_fails=0 fail_timeout=0; - server 10.2.149.135:3000 max_fails=0 fail_timeout=0; - server 10.2.45.155:3000 max_fails=0 fail_timeout=0; + server 10.2.71.14:3000 max_fails=0 fail_timeout=0; + server 10.2.32.22:3000 max_fails=0 fail_timeout=0; } - upstream production-app-8-80 { + upstream production-app-2-80 { # Load balance algorithm; empty for round robin, which is the default least_conn; - server 10.2.53.23:5000 max_fails=0 fail_timeout=0; - server 10.2.110.22:5000 max_fails=0 fail_timeout=0; - server 10.2.35.91:5000 max_fails=0 fail_timeout=0; - server 10.2.45.221:5000 max_fails=0 fail_timeout=0; + server 10.2.110.13:3000 max_fails=0 fail_timeout=0; + server 10.2.109.195:3000 max_fails=0 fail_timeout=0; } - upstream upstream-default-backend { + upstream production-app-9-80 { # Load balance algorithm; empty for round robin, which is the default least_conn; - server 10.2.157.13:8080 max_fails=0 fail_timeout=0; + server 10.2.78.26:3000 max_fails=0 fail_timeout=0; + server 10.2.59.22:3000 max_fails=0 fail_timeout=0; + server 10.2.96.249:3000 max_fails=0 fail_timeout=0; + server 10.2.32.21:3000 max_fails=0 fail_timeout=0; + server 10.2.114.177:3000 max_fails=0 fail_timeout=0; + server 10.2.83.20:3000 max_fails=0 fail_timeout=0; + server 10.2.118.111:3000 max_fails=0 fail_timeout=0; + server 10.2.26.23:3000 max_fails=0 fail_timeout=0; + server 10.2.35.150:3000 max_fails=0 fail_timeout=0; + server 10.2.79.125:3000 max_fails=0 fail_timeout=0; + server 10.2.157.165:3000 max_fails=0 fail_timeout=0; } - upstream production-app-3-80 { + upstream production-app-5-80 { # Load balance algorithm; empty for round robin, which is the default least_conn; - server 10.2.79.124:3000 max_fails=0 fail_timeout=0; - server 10.2.82.66:3000 max_fails=0 fail_timeout=0; - server 10.2.45.219:3000 max_fails=0 fail_timeout=0; - server 10.2.59.21:3000 max_fails=0 fail_timeout=0; + server 10.2.21.37:9292 max_fails=0 fail_timeout=0; + server 10.2.65.105:9292 max_fails=0 fail_timeout=0; } - upstream production-app-9-80 { + upstream production-app-7-80 { # Load balance algorithm; empty for round robin, which is the default least_conn; - server 10.2.96.249:3000 max_fails=0 fail_timeout=0; - server 10.2.157.165:3000 max_fails=0 fail_timeout=0; - server 10.2.114.177:3000 max_fails=0 fail_timeout=0; - server 10.2.118.111:3000 max_fails=0 fail_timeout=0; - server 10.2.79.125:3000 max_fails=0 fail_timeout=0; - server 10.2.78.26:3000 max_fails=0 fail_timeout=0; - server 10.2.59.22:3000 max_fails=0 fail_timeout=0; - server 10.2.35.150:3000 max_fails=0 fail_timeout=0; - server 10.2.32.21:3000 max_fails=0 fail_timeout=0; - server 10.2.83.20:3000 max_fails=0 fail_timeout=0; - server 10.2.26.23:3000 max_fails=0 fail_timeout=0; + server 10.2.114.143:3000 max_fails=0 fail_timeout=0; + server 10.2.79.126:3000 max_fails=0 fail_timeout=0; + server 10.2.45.155:3000 max_fails=0 fail_timeout=0; + server 10.2.35.105:3000 max_fails=0 fail_timeout=0; + server 10.2.50.44:3000 max_fails=0 fail_timeout=0; + server 10.2.149.135:3000 max_fails=0 fail_timeout=0; + } + + upstream production-app-8-80 { + # Load balance algorithm; empty for round robin, which is the default + least_conn; + server 10.2.53.23:5000 max_fails=0 fail_timeout=0; + server 10.2.45.221:5000 max_fails=0 fail_timeout=0; + server 10.2.35.91:5000 max_fails=0 fail_timeout=0; + server 10.2.110.22:5000 max_fails=0 fail_timeout=0; } server {

For this reason, the NGINX Ingress controller rebooted the configuration several times a minute, filling the memory with terminating workflows until it became the victim of the OOM killer.

After I updated the NGINX Ingress controller for the patched version and indicated the command line parameter --sort-backends=true , things were --sort-backends=true .

The number of unnecessary reboots after installing the revised version dropped to zero.

Thank you @aledbf for helping me find and correct this error!

We continue to minimize configuration reloads.

It is important to remember that configuration reloads are expensive and should be avoided, especially when working with WebSocket connections. For this reason, we decided to install a separate Ingress controller specifically for long-lived connections.

WebSocket- , , .

- Ingress-, .

NGINX Ingress IP- , Ingress- . (autoscaling), HorizontalPodAutoscalers .

Horizontal pod autoscaler

, , — , horizontal pod autoscaler , .

 Name: <app> Namespace: production Labels: <none> Annotations: <none> CreationTimestamp: Fri, 23 Jun 2017 11:41:59 -0300 Reference: Deployment/<app> Metrics: ( current / target ) resource cpu on pods (as a percentage of request): 46% (369m) / 60% Min replicas: 8 Max replicas: 20 Conditions: Type Status Reason Message ---- ------ ------ ------- AbleToScale False BackoffBoth the time since the previous scale is still within both the downscale and upscale forbidden windows ScalingActive True ValidMetricFound the HPA was able to succesfully calculate a replica count from cpu resource utilization (percentage of request) ScalingLimited True TooFewReplicas the desired replica count was less than the minimum replica count Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 14d 10m 39 horizontal-pod-autoscaler Normal SuccessfulRescale New size: 10; reason: cpu resource utilization (percentage of request) above target 14d 3m 69 horizontal-pod-autoscaler Normal SuccessfulRescale New size: 8; reason: All metrics below target

--horizontal-pod-autoscaler-upscale-delay kube-controller-manager 3 .

, , 4 (3 autoscaler + 1 ), autscaler , .

Your opinion?

- ? !

Links

Source: https://habr.com/ru/post/340238/

All Articles