Many people choose Django for its simplicity. The code on Django is simple and concise, we think less about crutches and more about business logic.
Gevent is also chosen because it is simple, very smart and does not carry callback hell.
In my head there is a great idea to combine two simple and convenient things together. We patch Django and enjoy simplicity, brevity and performance, make many requests to other sites, create subprocesses, generally use our new asynchronous Django to the maximum.
But combining them, we unwittingly put a few rakes in our path.
')
Django ORM and database connection pool
Django was created as a framework for synchronous applications. Each process receives a request from the user, processes it completely, sends the result, and only then can it start processing another request. The principle of operation is simple, like steamed turnip. Great architecture for creating a blog or news site, but we want more.
Take a simple example that mimics the hectic activity with HTTP requests. Let it be a service to shorten links:
Without gevent, this code would work incredibly slowly and would hardly serve two or three simultaneous requests, but with gevent everything flies.
We run our project through uwsgi (which has de facto become the standard for Deploy Python sites:
uwsgi --http-socket 0.0.0.0:8000 --gevent 1000 -M -p 2 -w testproject.wsgi
We try to test ten requests for link shortening at the same time and rejoice: all requests are processed without errors in the shortest possible time.
We launch our new service, sit and look at its successful development. The load grows from 10 to 75 simultaneous requests, and he does not care about such a load.
Suddenly, on one of the nights, several thousand emails arrive in the mail with the following content:
Traceback:
...
> link = LinkModel.objects.get (url = url)
OperationalError: FATAL: remaining connections are reserved for non-replication superuser connections
And this is good if you set the
en_US.UTF-8 locale in
postgresql.conf , because if you used the default Ubuntu / Debian configuration, you will receive a thousand emails with a message like:
OperationalError: ?????: ?????????? ????? ??????????? ??????????????? ??? ??????????? ????????????????? (?? ???? ??????????)
The application created too many connections to the database (by default - a maximum of 100 connections), for which it was punished.
This is the very first pitfall:
there is no database connection pool in Django , because it is simply not needed in the synchronous code. One synchronous Django process cannot process requests in parallel, it only serves one request at a time, and therefore it does not need to create more than one connection to the database.
In fact...In fact, Django can work in multi-threaded mode, in which a single process can process several requests. It is this server that the manage.py runserver command launches , and the documentation says that this mode is completely unsuitable for combat use.
One solution: we urgently need a pool of database connections.
There are relatively few implementations of the pool for Django, for example,
django-db-pool and
django-psycopg2-pool . The first pool is based on
psycopg2.TreadedConnectionPool , which throws an exception when trying to take a connection from an empty pool. The application will behave in the same way as before, but at the same time other applications will be able to create a connection to the database. The second pool is based on
gevent.Queue : when attempting to take a connection from an empty pool, the greenlet will be blocked until another greenlet puts the connection into the pool.
Most likely you will choose the second solution as more logical.
Queries to the database inside the greenlets
We have already patched the application using gevent and we have few synchronous calls, so why not get the most out of the greenlets? We can simultaneously do several HTTP requests or create subprocesses. We might want to use the database in greenlet:
def some_view(request): greenlets = [gevent.spawn(handler, i) for i in xrange(5)] gevent.joinall(greenlets) return HttpResponse("Done") def handler(number): obj = MyModel.objects.get(id=number) obj.response = send_http_request_somewhere(obj.request) obj.save(update_fields=['response'])
Several hours passed, and suddenly our application completely ceased to work: for any request we get
504 Gateway Timeout . What happened this time? For clarification, you will have to read some Django code.
All connections are stored in
django.db.connections , which is an instance of the
django.db.utils.ConnectionHandler class. When the ORM is ready to make a request, it requests a connection to the database by calling
connections ['default'] .
ConnectionHandler .__ getattr__, in turn, checks for a connection in
ConnectionHandler._connections , and if it is empty, it creates a new connection.
All open connections must be closed after use. This is handled by the
request_finished signal, which runs in
django.http.HttpResponseBase.close . Django closes connections from the database at the very last moment, when no one will contact them for sure, which is quite logical.
The whole snag is precisely how ConnectionHandler stores connections to the database. To do this, it uses
threading.local , which after monkey patching turns into
gevent.local.local . Once declared, this data structure works as if it were unique in every greenlet. The controller
some_view has started being processed in one greenlet, and
ConnectionHandler._connections already has a connection to the database. We created some new greenlets, in which the
ConnectionHandlers._connections was empty, and for these greenlets more connections were taken from the pool. After our new greenlets disappeared, the contents of their
local () were gone, the connections to the database were irretrievably lost and nobody would return them back to the pool. Over time, the pool is completely emptied.
When developing on Django + gevent, you should always remember this nuance and close the database connections at the end of each greenlet by calling
django.db.close_connection . It must also be called when an exception occurs, for which you can use a small contextmanager-decorator.
An example of such a decorator class autoclose(object): def __init__(self, f=None): self.f = f def __call__(self, *args, **kwargs): with self: return self.f(*args, **kwargs) def __enter__(self): pass def __exit__(self, exc_type, exc_info, tb): from django.db import close_connection close_connection() return exc_type is None
It is reasonable to use this wrapper: close all connections before each switching of greenlets (for example, before
urllib2.urlopen ), and also make sure that connections are not closed inside an incomplete transaction or loop through an iterator like
Model.objects.all () .
We use Django ORM separately from Django
We can comprehend the same problems if we create an analogue of cron or Celery, which will make inquiries to the database from time to time. The same thing awaits us if we raise Django using
gevent.WSGIServer and simultaneously raise any services with a different protocol that Django ORM will use. The main thing is to return connections to the database pool in time, then the application will work stably and bring you joy.
findings
In this post it was said about the elementary rules that it is necessary to use the pool of database connections and that you need to return the connection back to the pool immediately after use. You would accurately consider this if you only used gevent and psycopg2. But Django ORM operates with such high-level abstractions that the developer does not deal with database connections, and over time these rules can be forgotten and rediscovered.