Parameterized queries and django orm performance

Faced with significant performance losses in using django orm, I began to look for a way out of the situation, considering different ways of using orm. What I did - look tackle.

How do I write a regular piece of code using django orm?

As a rule, this piece is included in a certain function, well, for example, view, receives parameters and generates a result based on these parameters.
')
As an example, consider the following elementary situation: we want to get a list of the names of groups that include the current user. The simplest and most obvious one that comes to mind in the first place is to get a list of groups through a relationship and find out their names:

def myview(request): u = request.user a = [g.name for g in u.groups.all()] ...

Let's check what the performance of this piece will be, bearing in mind that the user object request.user has already been received at the stage of preliminary processing of the request.

Create thetest group and attach the very first user to it:

 >>> u = User.objects.all()[0] >>> g = Group(name='thetest') >>> g.save() >>> u.groups.add(g) >>> u.groups.all() [<Group: thetest>]

I will use this case in all further tests. Since everything is done through the shell, I also use in them the u variable obtained at this stage.

So, test number 1, we execute the intended piece of code. Check if it really returns the desired list:

 >>> a = [g.name for g in u.groups.all()] >>> a [u'thetest']

To measure performance, we will execute it 1000 times.

 >>> def test1(): ... import datetime ... t1 = datetime.datetime.now() ... for i in xrange(1000): ... a = [g.name for g in u.groups.all()] ... t2 = datetime.datetime.now() ... print "%s" % (t2 - t1) ... >>> test1() 0:00:01.437324

A thousand revolutions of our cycle were completed in about one and a half seconds, which gives 1.5 milliseconds per request.

Experienced jang-painters probably already got ready to poke my nose at the fact that this piece is far from optimal. Indeed, at first glance, it is possible to write a more optimal section of code that performs the same actions without constructing the group object and retrieving from the database only the data that we really need:

 >>> a = [g['name'] for g in u.groups.values('name')] >>> a [u'thetest']

Well, measure it and this piece.

 >>> def test2(): ... import datetime ... t1 = datetime.datetime.now() ... for i in xrange(1000): ... a = [g['name'] for g in u.groups.values('name')] ... t2 = datetime.datetime.now() ... print "%s" % (t2 - t1) ... >>> test2() 0:00:01.752529

This seems unnatural, but is the second version of our code less optimal than the first?

In fact, this is the way it is. The loss on the values () call and the additional query analysis turned out to be higher than the potential savings on the design of the Group object and on getting the values of all its fields.

But let me? And why should we actually re-construct and analyze the request every time, if in our view we always execute the same request , and only the user object on which this request is executed will differ?

Unfortunately, django initially does not allow you to prepare a request in advance, referring to a prepared request as needed. There are no corresponding calls, and the syntax of forming a query implies using only specific values as query parameters.

It is necessary to climb a little bit on the source. Taking this opportunity, I would like to express my gratitude to the developers of django_extensions and their wonderful team shell_plus, which greatly facilitates the introspection of objects.

It turns out that the QuerySet object (this is the one that is obtained, for example, at the time of accessing objects.all ()) has a query property, an object of class django.db.models.sql.query.Query. Which in turn has the sql_with_params () method

This method returns a set of parameters, completely ready to be passed to cursor.execute () - that is, a string of an SQL expression and additional parameters. The most remarkable thing is that these very additional parameters are the parameters that are passed to the QuerySet object during its formation:

 >>> u.groups.all().values('name').query.sql_with_params() ('SELECT `auth_group`.`name` FROM `auth_group` INNER JOIN `auth_user_groups` ON (`auth_group`.`id` = `auth_user_groups`.`group_id`) WHERE `auth_user_groups`.`user_id` = %s ', (1,))

Now, if we receive a prepared SQL query and substitute various parameter values into it, we will be able to execute the query without wasting resources on preparing the query.

To do this, create a special class that hides inside all the details of the hack that we are going to make.

 from django.db import connection from django.db.models.query import QuerySet,ValuesQuerySet import django from threading import local class PQuery(local): def __init__(self,query,connection=connection,**placeholders): self.query = query self.connection = connection self.placeholders = placeholders self.replaces = {} sql = None try: sql = self.query.query.sql_with_params() # 1.4 except AttributeError: sql = self.query.query.get_compiler(connection=self.connection).as_sql() # 1.3, lower? self.places = list(sql[1]) self.sql = sql[0] self.is_values = isinstance(query,ValuesQuerySet) self.cursor = None for i in xrange(len(self.places)): x = self.places[i] found = False for p in self.placeholders: v = self.placeholders[p] if x == v: found = True if not p in self.replaces: self.replaces[p] = [] self.replaces[p].append(i) if not found: raise AttributeError("The placeholder %(ph)s not found, please add some_name=%(ph)s to the list of constructor parameters" % { 'ph':repr(x) }) def execute(self,**kw): try: for k in kw: for i in self.replaces[k]: self.places[i] = kw[k] except KeyError,ex: raise TypeError("No such placeholder: %s" % k) if not self.cursor: self.cursor = self.connection.cursor() self.cursor.execute(self.sql,self.places) if not hasattr(self,'fldnms'): self.fldnms = [col[0] for col in self.cursor.description] if self.is_values: return [dict(zip(self.fldnms,row)) for row in self.cursor.fetchall()] return [self.query.model(**dict(zip(self.fldnms,row))) for row in self.cursor.fetchall()] def __call__(self,**kw): return self.execute(**kw) ParametrizedQuery = PQuery # compatibility issue

UPD: 2012-08-06 19:20:00 MSK - amended the code for compatibility with multitrading, fixes minor bugs when performing complex queries and improving usability.

Previous code version

 from django.db import connection from django.db.models.query import QuerySet,ValuesQuerySet class ParametrizedQuery: def __init__(self,query,connection=connection,**placeholders): self.query = query self.connection = connection self.placeholders = placeholders self.replaces = {} sql = self.query.query.sql_with_params() self.places = list(sql[1]) self.sql = sql[0] self.is_values = isinstance(query,ValuesQuerySet) self.cursor = None for p in self.placeholders: v = self.placeholders[p] self.replaces[p] = self.places.index(v) def execute(self,**kw): for k in kw: self.places[self.replaces[k]] = kw[k] if not self.cursor: self.cursor = self.connection.cursor() self.cursor.execute(self.sql,self.places) if not hasattr(self,'fldnms'): self.fldnms = [col[0] for col in self.cursor.description] if self.is_values: return [dict(zip(self.fldnms,row)) for row in self.cursor.fetchall()] return [self.query.model(**dict(zip(self.fldnms,row))) for row in self.cursor.fetchall()]

What does this class do? It receives a request and extracts prepared SQL and parameters from it. We can form such a request, in which each of the parameters that we are going to substitute, has a special value that we know in advance. We will use these values to search for the place where we want to substitute the values passed during execution.

Several additional implementation details will also help us save resources.

The fldnms property contains an array of field names obtained when the query is first executed. Subsequent calls will use the finished array.
The replaces property contains a mapping of substitution names to parameter numbers.
Each object of our class will hold its own cursor. The potential acceleration from this step is a consequence of the first, the fact that creating a cursor is quite a costly operation, and secondly, the following phrase from the description of pyodbc , which can be used as a database backend: "It is also more efficient if you execute the same SQL repeatedly with different parameters. The SQL will be prepared only once. ("If you’ve prepared multiple times.
The is_values property will help us determine that the query should not return a model object, which will save on the creation of such an object when returning results.

We will slightly modify the original request so that it is convenient to slip the substitutions there:

 >>> q = Group.objects.filter(user__id=12345).values('name') >>> q.query.sql_with_params() ('SELECT `auth_group`.`name` FROM `auth_group` INNER JOIN `auth_user_groups` ON (`auth_group`.`id` = `auth_user_groups`.`group_id`) WHERE `auth_user_groups`.`user_id` = %s ', (12345,))

The value 12345 we use as a substitution:

 >>> p = ParametrizedQuery(q,user_id=12345) >>> [g['name'] for g in p.execute(user_id=u.id)] [u'thetest']

When executing the query p.execute (), the real value of the user ID was substituted for the 12345 substitution place.

Now let's try to see how the code performance will change:

 >>> def test3(): ... import datetime ... t1 = datetime.datetime.now() ... for i in xrange(1000): ... a = [g['name'] for g in p.execute(user_id=u.id)] ... t2 = datetime.datetime.now() ... print "%s" % (t2 - t1) ... >>> test3() 0:00:00.217270

This is the result! The request execution time has decreased by 7 times .

How to use it in real code?

First, you need a place in which the prepared request could be stored. Secondly, at some point you need to fill this variable. For example, at the time of the first execution of the function code. Well and thirdly, of course, we use a call to a parameterized query instead of directly executing the query.

 def myview(request): if not hasattr(myview,'query'): myview.query = ParametrizedQuery(Group.objects.filter(user__id=12345).values('name'),user_id=12345) a = [g['name'] for g in myview.query.execute(user_id=request.user.id)] ...

All code was executed on:
- django.VERSION = (1, 4, 0, 'final', 0)
- mysql DBMS (django.db.backends.mysql)
- table engine = MYISAM
- connection through localhost
- Python 2.7.2+ (default, Oct 4 2011, 20:03:08) [GCC 4.6.1] on linux2
- Linux host seva 3.0.0-22-generic # 36-Ubuntu SMP Tue Jun 12 17:13:04 UTC 2012 i686 athlon i386 GNU / Linux

Specialist comments are welcome.

Source: https://habr.com/ru/post/149066/

All Articles

Parameterized queries and django orm performance

More articles: