Optimize the initialization stage of Django

If your Django project works on synchronous workers and you periodically restart them (for example, in gunicorn this is the --max-requests option), it would be useful to know that by default after each restart of the worker, the first request to it is processed much longer, than follow.

In this article I will tell you how I solved this and other problems causing abnormal delays on random queries.

The article will provide examples for the gunicorn wsgi server. But they are relevant for any ways to run a project on synchronous workers. The same is true for uWSGI and mod_wsgi .

Recently we transferred our Django project to the Kubernetes cluster. There are readiness / liveness probes that can pull every running instance of a wsgi server (in our case, it is gunicorn) for the specified http handle. We have this /api/v1/status :

 class StatusView(views.APIView): @staticmethod def get(request): overall_ok = True try: with django.db.connection.cursor() as cursor: cursor.execute('SELECT version()') cursor.fetchone() except Exception: log.exception('Database failure') db = 'fail' overall_ok = False else: db = 'ok' try: cache.set('status', 1) except Exception: log.exception('Redis failure') redis = 'fail' overall_ok = False else: redis = 'ok' if overall_ok: s = status.HTTP_200_OK else: s = status.HTTP_500_INTERNAL_SERVER_ERROR return Response({ 'web': 'ok', 'db': db, 'redis': redis, }, status=s)

So, before moving to Kubernetes, we had Zabbix, who every minute made a request for /api/v1/status via loadbalancer. And this health check has never especially filed. But after the move, when checks began to be performed for each individual gunicorn instance and with greater frequency, it suddenly turned out that sometimes we do not fit into a timeout of 5 seconds.

Nevertheless, everything worked fine, there were no problems for users. Therefore, I did not pay much attention to this, but I set myself a background task to figure out what was the matter. And that was able to find out:

By default, gunicorn starts the master process, which forks the number of processes specified by the argument --workers . Moreover, the wsgi module, passed to gunicorn as the main argument, is loaded by each worker after the fork. But there is an option --preload . If you specify it, the wsgi module will be loaded BEFORE the fork. Hence the rule:

Always try to fire gunicorn with the option --preload , which will reduce the initialization time of each worker. As a result, the initialization will mostly occur only in the master process, then the already initialized worker processes will be forked.

I repeat that most of these optimizations make sense if your Django project works on synchronous workers and you periodically restart them ( --max-requests ).

Nevertheless, we managed to find out that using --preload not enough, and the first request to a freshly started worker still takes more time than subsequent ones. Trace showed that the preloading of wsgi does little, and most of Django is initialized only during the first request. Therefore, the decision was born "in the forehead":

In the wsgi initialization, add a fake request to the health / status endpoint to immediately initialize a maximum of subsystems.

For example, I added the following to wsgi.py :

 # make request to /api/v1/status to prepare everything for first user request def make_init_request(): from django.conf import settings from django.test import RequestFactory f = RequestFactory() request = f.request(**{ 'wsgi.url_scheme': 'http', 'HTTP_HOST': settings.SITE_DOMAIN, 'QUERY_STRING': '', 'REQUEST_METHOD': 'GET', 'PATH_INFO': '/api/v1/status', 'SERVER_PORT': '80', }) def start_response(*args): pass application(request.environ, start_response) if os.environ.get('WSGI_FULL_INIT'): make_init_request()

As a result, workers began to initialize an order of magnitude faster, because already forged fully ready for the next request.

The problem with initialization stopped ... almost. To my shame, I did not know about this feature. It turns out that by default, Django will reconnect to the database with each request. The setting CONN_MAX_AGE , which only (?) For historical reasons, causes your Django application to work as a php script from zero, is responsible for this. So the rule is:

In the Django adapter database settings, add CONN_MAX_AGE=None so that the connections are permanent.

I would not even notice this. But, for some reason, the call to psycopg2.connect sometimes psycopg2.connect exactly 5 seconds. I have not fully understood this. A parallel running script that calls this function every 10 seconds worked stably and connected to the database in less than a second in the time it was running (a couple of weeks).

But these two rules conflict with each other, because before the fork, connections to the database and the cache are created in the main process. Child processes inherit the open sockets of the main process. As a result, this leads to indefinite behavior, when several processes will work simultaneously with one socket. Therefore, before the fork, you need to close all connections:

 # Close connections to database and cache before or after forking. # Without this, child processes will share these connections and this is not supported. def close_network_connections(): from django import db from django.core import cache from django.conf import settings for conn in db.connections: db.connections[conn].close() django_redis_close_connection = getattr(settings, 'DJANGO_REDIS_CLOSE_CONNECTION', False) settings.DJANGO_REDIS_CLOSE_CONNECTION = True cache.close_caches() settings.DJANGO_REDIS_CLOSE_CONNECTION = django_redis_close_connection if os.environ.get('WSGI_FULL_INIT'): make_init_request() # in case wsgi module preloaded in master process (ie `gunicorn --preload`) if os.environ.get('WSGI_FULL_INIT_CLOSE_CONNECTIONS'): close_network_connections()

So when using --preload and WSGI_FULL_INIT , WSGI_FULL_INIT must also be WSGI_FULL_INIT_CLOSE_CONNECTIONS .

As a result, abnormal delays were completely eliminated. But there are a couple of extreme cases where they can still occur:

If all workers start to restart at the same time. This is a very likely situation, since if the requests between the workers are distributed approximately evenly, then max-requests will come at about the same time. Therefore:

Start gunicorn with max-requests-jitter so that the workers do not restart at the same time, even if they do it quickly enough.

Also, a delay may occur during the first query, when connections are made to the database and other external systems.

This can be solved, but I have no idea how to write code independent of the wsgi server used. gunicorn can add make_init_request() to post_worker_init for post_worker_init and call the handler there again, then the worker will be 100% ready before receiving the first request. In order not to complicate things, it was decided to do without it for now, because we have already achieved that there are no more delays in practice.

Source: https://habr.com/ru/post/345856/

All Articles

Optimize the initialization stage of Django

More articles: