It is written, because Another holivar appeared in the comments on SQL vs ORM in High-Load Project (HL)
Preamble
In the note you can find, in places, banal things. Most of them are available in the documentation, but a modern person often likes to miss everything superficially. And many simply did not have the opportunity to test themselves in HL projects.
Reading the article, remember:
- You can never implement an HL project based on just one ORM manipulation.
- Never put complicated things on the shoulders of the database. You need it to store info, not count factorials!
- If you cannot implement the idea that interests you using simple ORM tools, do not use ORM to directly solve the problem. And the more so do not climb into a lower level, crutches break. Find a more elegant solution.
- Sorry for the mockingly humorous tone of the article. Otherwise boring :)
- All information is based on Django version 1.3.4.
- Keep it simple!
And-and-and yes, the article will show the errors in understanding ORM, which I encountered in more than three years of working with Django.
ORM not understood
I'll start with the classic mistake that haunted me for a long time. In terms of beliefs in the Uruguayan monkey tribe. I strongly believed in the omnipotence of Django ORM, namely in
Klass.objects.all()
eg:
all_result = Klass.objects.all() result_one = all_result.filter(condition_field=1) result_two = all_result.filter(condition_field=2)
')
In my dreams, thinking went like this:
- I chose all that I was interested in, by one query on the first line.
- In the second line, I will no longer have a request, but will work with the result obtained on the first condition.
- In the third line, I also will not have a query to the database, and I will have an output of interest to me with the second condition based on the results of the first query.
You, probably, already guess that magic monkeys do not exist and in this case we have three requests. But, I will disappoint you. In this case, we still have two requests, and to be even more precise, then there is not a single request based on the results of the work of this script (but in the future, of course, we will not be doing so). Why, you ask?
I explain in order. Let us prove that in this code there are three requests:
Hooray! We proved that we have three requests. But the main phrase is “in calculations”. In fact, we come to the second part - proof that we have only two requests.
For this problem we will be helped by the following understanding of ORM (in 2 sentences):
- So far we have not calculated anything - we are only forming a query using ORM tools. As soon as we started to calculate, we calculate by the received generated query.
So, in the first line we designated the variable
all_result with the query of interest - select all.
In the second and third line, we refine our request for a sample of additional. conditions. Well, therefore, received 2 requests. What should prove
Attentive readers (why did you look at the previous paragraphs again?) Should have already guessed that we didn’t make any requests. And in the second and third lines, we just as easily formed a query that interests us, but we didn’t contact the database with it.
So we were engaged in nonsense. And the calculations will begin, for example, from the first line of the subordinate code:
for result in result_one: print result.id
Not always necessary functions and reasonable samples.
Let's try to play with the templates, and the favorite function of some
__unicode __ () .
You know - cool feature! In any place, at any time and under any circumstances we can get the name that interests us. Super! And it's super as long as we have in the output does not appear
ForeignKey . As soon as it appears, count everything is gone.
Consider a small example. We have news in one line. There are regions to which these news are attached:
class RegionSite(models.Model): name = models.CharField(verbose_name="", max_length=200,) def __unicode__(self): return "%s" % self.name class News(models.Model): region = models.ForeignKey(RegionSite, verbose_name="") date = models.DateField(verbose_name="", blank=True, null=True, ) name = models.CharField(verbose_name="", max_length=255) def __unicode__(self): return "%s (%s)" % (self.name, self.region)
We need to display the 10 latest news, with the name as defined in
News .__ unicode __ ()Uncover sleeves, and write:
news = News.objects.all().order_by("-date")[:10]
In the template:
{% for n in news %} {{ n }} {% endfor %}
And here we dug ourselves a hole. If this is not news or not 10 - but 10 thousand, then be prepared for the fact that you will receive 10 000 requests + 1. And all because of the
muddling ForeignKey .
An example of an extra 10 thousand queries (and say thank you for having a small model - so all the fields and values ​​of the model would be chosen, be it 10 or 50 fields):
SELECT `seo_regionsite`.`id`, `seo_regionsite`.`name` FROM `seo_regionsite` WHERE `seo_regionsite`.`id` = 1 SELECT `seo_regionsite`.`id`, `seo_regionsite`.`name` FROM `seo_regionsite` WHERE `seo_regionsite`.`id` = 1 SELECT `seo_regionsite`.`id`, `seo_regionsite`.`name` FROM `seo_regionsite` WHERE `seo_regionsite`.`id` = 2 SELECT `seo_regionsite`.`id`, `seo_regionsite`.`name` FROM `seo_regionsite` WHERE `seo_regionsite`.`id` = 1
Why it happens? All up to genital is simple. Every time you get the name of the news, we have a request to the
RegionSite to return its
__unicode __ () value, and substitute in brackets to display the name of the region of the news.
Similarly, a bad situation begins when we, for example, in a template using ORM, are trying to get to the value we need, for example:
{{ subgroup.group.megagroup.name }}
You will not believe what a hard request may be there :) I’m not even saying that you may have dozens of such samples in the template!
We are not so easy to take us - we sobbed and took advantage of the following excellent ORM feature -
.values ​​() .
Our line of code in the magic-keyboard way turns into:
news = News.objects.all().values("name", "region__name").order_by("-date")[:10]
A pattern:
{% for n in news %} {{ n.name }} ({{ n.region__name }}) {% endfor %}
Note the double underscore. It will be useful to us soon. (For those who do not know - double underscore, as if the connection between the models, if you speak roughly)
With such uncomplicated manipulations, we got rid of 10 thousand requests and left only one. By the way, yes, it will work out with JOIN and with the fields selected by us!
SELECT `news_news`.`name`, `seo_regionsite`.`name` FROM `news_news` INNER JOIN `seo_regionsite` ON (`news_news`.`region_id` = `seo_regionsite`.`id`) LIMIT 10
We are happy to madness! After all, we have just become ORM-optimizers :) Fig something there! I will tell you :) This optimization is optimization until we have 10 thousand news. But we can even faster!
To do this, take into account our prejudices in the number of requests and urgently increase the number of requests 2 times! Namely, let's prepare the data:
regions = RegionSite.objects.all().values("id", "name") region_info = {} for region in regions: region_info[region["id"]] = region["name"] news = News.objects.all().values("name", "region_id").order_by("-date")[:10] for n in news: n["name"] = "%s (%s)" % (n["name"], region_info[n["region_id"]])
And then the output in the template of our freshly variable is:
{% for n in news %} {{ n.name }} {% endfor %}
Yes, I understand ... With these lines we violated the concept of MVT. But this is just an example that can be easily converted into strings that do not violate the MVT standards.
What have we done?
- We have prepared data on regions and entered information about them in the dictionary:
SELECT `seo_regionsite`.`id`, `seo_regionsite`.`name` FROM `seo_regionsite`
- Chose from the news all that interests us + pay attention to the single underscore.
SELECT `news_news`.`name`, `news_news`.`region_id` FROM `news_news` LIMIT 10
It was with a single underscore that we chose the direct value of the bundle in the database. - Python tied two models.
Believe me, on a single ForeignKey you will hardly notice a gain in speed (especially if there are few selectable fields). However, if your model has a connection through foridzhn with more than one model - this is where the celebration of this decision begins.
Continue izgolyatsya over double and single underscore.
Consider a simple example to the banality:
item.group_id vs. item.group.id
Not only when building queries, but when processing the results, you can run into this feature.
Example:
for n in News.objects.all(): print n.region_id
The request will be only one - when selecting news
Example 2:
for n in News.objects.all(): print n.region.id
Requests will be 10 thousand + 1, because in each iteration we will have our id request. It will be similar to:
SELECT `seo_regionsite`.`id`, `seo_regionsite`.`name` FROM `seo_regionsite` WHERE `seo_regionsite`.`id` = 1
This is the difference due to one sign.
Many advanced dzhangovody now poke a finger in a voodoo doll with my code. And at the same time they ask me a question - what are you doing in a blizzard with the preparation of data, and where are the
values_list ("id", flat = True) ?
Consider a wonderful example showing the need for accuracy in working with
value_list :
regions_id = RegionSite.objects.filter(id__lte=10).values_list("id", flat=True) for n in News.objects.filter(region__id__in=regions_id): print n.region_id
With these code lines, we:
- Prepare a list of regions of interest to us id-schnick of regions for some abstract condition.
- The resulting result is inserted into our news request and we get:
SELECT `news_news`.`id`, `news_news`.`region_id`, `news_news`.`date`, `news_news`.`name` FROM `news_news` WHERE `news_news`.`region_id` IN (SELECT U0.`id` FROM `seo_regionsite` U0 WHERE U0.`id` <= 10 )
Request in request! Uuuuh, I love :) Especially choose 10 thousand news with nested select with IN (10 thousand aids)
You certainly understand what this means? :) If not, then understand - nothing, absolutely nothing good!
The solution to this issue is also simple to genius. Recall the beginning of our article - no query appears without calculating a variable. And we make a remark, for example, on the second line of the code:
for n in News.objects.filter(region__id__in=list(regions_id)):
And with this solution we get 2 simple requests. Without investment.
You have not yet captured the spirit of the padl, ORM reserved for us? Then drop even deeper. Consider the code:
regions_id = list(News.objects.all().values_list("region_id", flat=True)) print RegionSite.objects.filter(id__in=regions_id)
With these two lines, we select the list of regions for which we have news. Everything in this code is great, except for one moment, namely the resulting request:
SELECT `seo_regionsite`.`id`, `seo_regionsite`.`name` FROM `seo_regionsite` WHERE `seo_regionsite`.`id` IN (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9) LIMIT 21
Ahaha, ORM, stop it! What are you doing!
Not only that he from all the news (I have 256 in their example, sort of) he chose the regions id and just substituted them, so he also took from somewhere limit 21. About the limit everything is simple - this is how the print of a large number of array values ​​(I I did not find another excuse), but with the meanings there is clearly an ambush.
The solution, as in the previous example, is simple:
print RegionSite.objects.filter(id__in=set(regions_id)).values("id", "name")
By removing the extra elements via
set (), we received a completely adequate request, as expected:
SELECT `seo_regionsite`.`id`, `seo_regionsite`.`name` FROM `seo_regionsite` WHERE `seo_regionsite`.`id` IN (1, 2, 3, 4, 9) LIMIT 21
Everyone is happy, everyone is happy.
Looking slightly over the historically written code, I’ll highlight another pattern you should be aware of. And again the code example:
region = RegionSite.objects.get(id=1) t = datetime.datetime.now() for i in range(1000): list(News.objects.filter(region__in=[region]).values("id")[:10]) # list(News.objects.filter(region__id__in=[region.id]).values("id")[:10]) # list(News.objects.filter(region__in=[1]).values("id")[:10]) # list(News.objects.filter(region__id__in=[1]).values("id")[:10]) print datetime.datetime.now() - t
Each of the iteration lines was sequentially included (so that only one worked). Total we can get the following approximate numbers:
- 1 line - 6.362800 s
- 2 line - 6.073090 sec.
- 3 line - 6.431563 sec
- 4 line - 6.126252 sec
The differences are minimal, but visible. Preferred 2 and 4 options (I mostly use 4m). The main loss of time is how quickly we create the query. Trivial, but significant, I think. Every reader will make conclusions on their own.
And we will finish the article with a scary word - a
transaction .
Special case:
- You have InnoDB
- You need to update the data in the table in which customers do not write, but only read (for example, a list of products)
Update / insert is done once or twice.
- Prepare 2 dictionaries - to insert data and update data
- Each of the dictionaries throw in its function
- PROFIT!
An example of a real update feature:
@transaction.commit_manually def update_region_price(item_prices): """ """ from idea.catalog.models import CatalogItemInfo try: for ip in item_prices: CatalogItemInfo.objects.filter( item__id=ip["item_id"], region__id=ip["region_id"] ).update( kost=ip["kost"], price=ip["price"], excharge=ip["excharge"], zakup_price=ip["zakup_price"], real_zakup_price=ip["real_zakup_price"], vendor=ip["vendor"], srok=ip["srok"], bonus=ip["bonus"], rate=ip["rate"], liquidity_factor=ip["liquidity_factor"], fixed=ip["fixed"], ) except Exception, e: print e transaction.rollback() return False else: transaction.commit() return True
An example of a real add function:
@transaction.commit_manually def insert_region_price(item_prices): """ """ from idea.catalog.models import CatalogItemInfo try: for ip in item_prices: CatalogItemInfo.objects.create(**ip) except Exception, e: print e transaction.rollback() return False else: transaction.commit() return True
Knowing these moments, you can build effective applications using Django ORM, and not get into the SQL code.
Answers on questions:
Since such a dance has gone, then write when it is worth using ORM, and when it is not. (c)
lvoI think that ORM should be used whenever it is simple. It is not necessary to fold on the shoulders of ORM, and even more so the base requests of the type:
User.objects.values('username', 'email').annotate(cnt=Count('id')).filter(cnt__gt=1).order_by('-cnt')
Especially on HL-production. Get for yourself a separate system server in which so be izgolaytes.
If you are not able to write with simple “ORM requests”, then change the algorithm for solving the problem.
For example, the client in IM has filtering by characteristics, using regularizers. Cool, flexible stuff, until there are so many visitors to the site. Changed the approach, instead of the standard Client-ORM-Base-ORM-Client, rewrote to the Client-MongoDB-Python-Client. Data in MongoDB is generated by means of the ORM on the system server. As it was said before - HL cannot be achieved by ORM manipulations alone.
I wonder why Django. What are the advantages of this framework (and its ORM) compared to other frameworks / technologies. (c)
anjensanHistorically. Python began to study with Django. And the knowledge in the technology of its use I bring to the maximum. Now in a parallel study of Pyramid. I can while compare only with PHP, and their frameworks, CMS. Probably I will say a common phrase -
I spent my time inefficiently when I wrote in PHP .
Now I can name a couple of serious flaws in Django 1.3.4:
- Constant connection / disconnection with the base (corrected in older versions)
- The speed of the template-processor. According to the tests found in the network, it is sufficiently small. Need to change :)
In general, there is one cool technique, how to increase the speed of the generation of the template-processor.
Never pass variables to a template through
locals () - with volumetric functions and intermediate variables - you will get a silent slowly moving dying monster :)
What kind of programmer is it that makes it difficult to write a SQL query? (c)
andreynikishaevA programmer who values ​​his time on the program code, and not on the means of interaction between the Base-Code data processing. You need to know SQL - very often I work directly with the database console. But in the code - ORM. ORM is easier and faster subject to change, or addition. And also, if you write reasonably light queries, it is easy to read and understand.
Sorry, everyone! (Blah blah ... waiting for comments, suggestions, questions, suggestions)