Celery: best practices

If you are working with Django, then at some development stage you may need background processing of long running tasks. It is possible that for this kind of tasks you use some kind of tool for managing the queues of tasks. Celery is one of the most popular projects for solving similar problems in the world of python and Django at the moment, but there are other projects for this purpose.

While I was working on some projects using Celery to manage the task queues, some best practices emerged that I decided to document. But these are big words for what I think about the right approach to solving such problems, as well as some of the underused opportunities offered by the community of the Celery project.

No.1: Do not use DBMS as your AMQP broker

Let me explain why I think this is wrong (besides the limitations described in the Celery documentation).

The DBMS was not developed for those tasks that perform a full-fledged AMQP broker such as RabbitMQ. It will fall in “combat” conditions even on a project with not very large traffic / user base.
')
I assume that the most popular reason for why people decide to use a DBMS is that, as a rule, they already have one DBMS for a web application, so why not use it again. Getting started with this option is easy and you don’t need to worry about other components (such as RabbitMQ).

Suppose a not so hypothetical scenario: you have 4 background workers for processing that you put into the database. This means that you get 4 processes that often request a database about new tasks, not to mention the fact that each of them can have its own competing threads. At some point in time, you realize that the delay in processing tasks grows, and therefore more new tasks come in than are completed, it is necessary to increase the number of workers. Suddenly, the speed of your database starts to "cave in" due to the huge number of requests from the workers to the database, disk input / output exceeds the specified limits, and all this begins to affect your application, because the workers actually arranged a DDOS attack on your database.

This would not have happened when using a full-fledged AMQP broker, since the queue is placed in memory and thus eliminates the high load on the hard disk. Consumers (workers) do not need to frequently request information, since the queue has a mechanism for delivering a new task to the worker, and even if the AMQP broker is overloaded for any other reason, this will not lead to a crash and slowdown of the web application that interacts with the user .

I will go even further and say that you should not use the DBMS as a broker even during the development process, then there are such things as Docker and many pre-configured images that provide customized RabbitMQ out of the box .

No.2: Use more queues (i.e. not just one, which is given by default)

Celery is very easy to start using, and it immediately provides one default queue, in which all tasks are placed until another Celery behavior is explicitly prescribed. The most common example of what you can see:

@app.task() def my_taskA(a, b, c): print("doing something here...") @app.task() def my_taskB(x, y): print("doing something here...")

What happens if both tasks are placed in the same queue, unless otherwise specified in the celeryconfig.py file. I am fully understanding what this approach can justify, you have one decorator who creates convenient background tasks. Here I would like to note that taskA and taskB, while in the same queue, can do completely different things and thus one of them may be much more important than the other, so why are they all in the same basket? Even if you have one worker, then imagine such a situation that the less important task of taskB will be so massive that the more important task of taskA the worker cannot give the necessary attention. This brings us to the next point.

No.3: Use Worker Priorities

By solving the problem mentioned above, placing taskA in one queue, and taskB in another, and then assigning x workers to processing Q1 and the rest to processing Q2, since more tasks come to it. That way, you can be sure that taskB will get enough workers, while the rest will handle the lower priority task when it arrives, without causing long waiting times and processing. Therefore, determine your queues yourself:

 CELERY_QUEUES = ( Queue('default', Exchange('default'), routing_key='default'), Queue('for_task_A', Exchange('for_task_A'), routing_key='for_task_A'), Queue('for_task_B', Exchange('for_task_B'), routing_key='for_task_B'), )

And your routers, which determine where to send the task:

 CELERY_ROUTES = { 'my_taskA': {'queue': 'for_task_A', 'routing_key': 'for_task_A'}, 'my_taskB': {'queue': 'for_task_B', 'routing_key': 'for_task_B'}, }

This will allow to perform the workers for each task:

 celery worker -E -l INFO -n workerA -Q for_task_A celery worker -E -l INFO -n workerB -Q for_task_B

No.4: use Celery mechanisms to handle errors

Most of the tasks that I have seen do not have error handling mechanisms. If an error occurred in the task, it simply falls. This may be convenient for some tasks, however most of the tasks I saw interacted with external APIs and fell due to some kind of network errors or other “resource availability” problems. The simplest approach to handling such errors is to over-execute the task code, since, perhaps, the problems of interaction with the external API have already been eliminated.

 @app.task(bind=True, default_retry_delay=300, max_retries=5) def my_task_A(): try: print("doing stuff here...") except SomeNetworkException as e: print("maybe do some clenup here....") self.retry(e)

I like to define the default time for the task, which it will wait before trying to execute again and how many attempts it will take before it finally throws an error (default_retry_delay and max_retries parameters, respectively). This is the simplest form of error handling that I can imagine, but I have seen that it is almost never used. Of course, Celery has more complex error handling methods, they are described in the Celery documentation.

No.5: use Flower

Flower is a great tool for tracking the status of your tasks and Celery workers. The tool has a web interface and it allows such things as:

task progress
execution details
status of workers
launch new workers

You can see the full list of features here.

No.6: Track task status only if you need it.

The task status is information about whether the task was completed successfully or not. It may be useful for some statistical indicators. An important thing to be understood in this case: the status of the task is not the resulting data and the work that it performed, such information is most similar to the implicit changes recorded in the database (such as changes in the user’s friends list).

In most of the projects that I saw, I really didn’t care about the status of the task after it was completed, using the sqlite database, which is suggested by default, or spent better time using large PostgreSQL DBMS. Why just load the database of your application? Use CELERY_IGNORE_RESULT = True in your celeryconfig.py configuration file and discard such data.

No.7: do not transfer database objects \ ORM to the task

After discussing the above, at meetings of local python developer groups, some people suggested including an additional item in the list provided. What is he talking about? You should not transfer database objects, for example, a user model to a background task, since the serialized object may contain outdated and incorrect data. If it is necessary for you, transfer the user ID to the task, and in the task itself request the database about this user.

Source: https://habr.com/ru/post/269347/

All Articles