Using cursors, you will be able to portionwise retrieve from the database and process a large amount of data without spending the memory of the application. I am sure that every web developer faced a similar task at least once, and also to me more than once. In this article I will tell you in which tasks the cursors can be useful, and I will give ready-made code for working with them from PHP + Doctrine using the example of PostrgeSQL.
Let's imagine that we have a PHP project, big and heavy. Surely, it is written here with the help of some kind of framework. For example, symfony. It also uses a database, for example, PostgreSQL, and the database has a label for 2,000,000 order information records. And the project itself is an interface to these orders, which is able to display and filter them. And let me say, copes with this very well.
Now we have been asked (have you not been asked yet? Be sure to be asked) to upload the result of filtering orders to an Excel file. Let's quickly add a button with a table icon that will spit out a file with orders to the user.
How does a programmer who has not met such a task? It makes a SELECT to the database, reads the results of the query, converts the answer into an Excel file and sends it to the user's browser. The task works, the test is passed, but problems begin in production.
Surely, we have a memory limit set for PHP at some reasonable (debatable) 1 GB per process, and as soon as these 2 million lines no longer fit into this gigabyte, everything breaks down. PHP crashes with “out of memory” error, and users complain that the file is not being unloaded. This happens because we chose a rather naive way to unload data from the database - they are all first transferred from the base memory (and the disk under it) to the operational memory of the PHP process, then processed and unloaded into the browser.
To ensure that data is always placed in memory, you need to take it from the database in pieces. For example, 10,000 records have been read, processed, written to a file, and so many times.
Well, thinks the programmer, whom our task met for the first time. Then I will do a cycle and deflate the query results in chunks, indicating LIMIT and OFFSET. It works, but these are very expensive operations for the database, and therefore uploading the report begins to take not 30 seconds, but 30 minutes (still not that bad!). By the way, if, apart from OFFSET, nothing more comes to the programmer at that moment, then there are many more ways to achieve the same thing without forcing the database.
At the same time, the database itself has a built-in ability to continuously read data from it - cursors.
The cursor is a pointer to a string in the results of the query that lives in the database. When using them, we can perform SELECT not in the mode of immediate pumping of data, but open the cursor with this select. Next, we begin to receive data from the database as the cursor moves forward. This gives us the same result: we read the data in batches, but the base does not do the same job of finding the line with which it needs to start, as is the case with OFFSET.
The cursor opens only inside the transaction and lives as long as the transaction is alive (there is an exception, see WITH HOLD ). This means that if we slowly read a lot of data from the database, then we will have a long transaction. This is sometimes bad , you need to understand and take this risk.
Let's try to implement work with cursors in Doctrine. For a start, what does a query to open a cursor look like?
BEGIN; DECLARE mycursor1 CURSOR FOR ( SELECT * FROM huge_table );
DECLARE creates and opens a cursor for a given SELECT query. After creating the cursor, you can start reading data from it:
FETCH FORWARD 10000 FROM mycursor1; < 10 000 > FETCH FORWARD 10000 FROM mycursor1; < 10 000 > ...
And so on, until FETCH returns an empty list. This will mean that they have scrolled through to the end.
COMMIT;
Outline a class that is compatible with Doctrine, which will encapsulate the work with the cursor. And in order to solve 80% of the problem in 20% of the time , it will work only with the Native Queries. So we call it, PgSqlNativeQueryCursor.
Constructor:
public function __construct(NativeQuery $query) { $this->query = $query; $this->connection = $query->getEntityManager()->getConnection(); $this->cursorName = uniqid('cursor_'); assert($this->connection->getDriver() instanceof PDOPgSqlDriver); }
Here I generate a name for the future cursor.
Since there is a PostgreSQL-specific SQL code in the class, it is better to check whether our driver is PG.
We need three things from the class:
Open the cursor:
public function openCursor() { if ($this->connection->getTransactionNestingLevel() === 0) { throw new \BadMethodCallException('Cursor must be used inside a transaction'); } $query = clone $this->query; $query->setSQL(sprintf( 'DECLARE %s CURSOR FOR (%s)', $this->connection->quoteIdentifier($this->cursorName), $this->query->getSQL() )); $query->execute($this->query->getParameters()); $this->isOpen = true; }
As I said, cursors open in a transaction. So here I am checking that we have not forgotten to place a call to this method inside an already open transaction. (Thank God, the time has passed when I would be drawn to open a transaction right here!)
To simplify my task of creating and initializing a new NativeQuery, I simply clone the one that was fed into the constructor, and wrap it in DECLARE ... CURSOR FOR (here_original_query). Doing it.
Let's make the getFetchQuery method. It will return not the data, but another request that you can use as you like to get the data you are looking for in specified batches. This gives the calling code more freedom.
public function getFetchQuery(int $count = 1): NativeQuery { $query = clone $this->query; $query->setParameters([]); $query->setSQL(sprintf( 'FETCH FORWARD %d FROM %s', $count, $this->connection->quoteIdentifier($this->cursorName) )); return $query; }
The method has one parameter — this is the size of the stack, which will become part of the request returned by this method. I apply the same trick with cloning a query, overwrite the parameters in it and replace the SQL with the FETCH ... FROM ...; construct.
In order not to forget to open the cursor before the first call to getFetchQuery () (all of a sudden I will not get enough sleep), I will make its implicit opening directly in the getFetchQuery () method:
public function getFetchQuery(int $count = 1): NativeQuery { if (!$this->isOpen) { $this->openCursor(); } …
And the openCursor () method itself will be private. I don’t see cases at all when you need to call it explicitly.
Inside getFetchQuery (), I used FORWARD hardcore to move the cursor forward a specified number of lines. But the FETCH call modes are many different. Let's add them too?
const DIRECTION_NEXT = 'NEXT'; const DIRECTION_PRIOR = 'PRIOR'; const DIRECTION_FIRST = 'FIRST'; const DIRECTION_LAST = 'LAST'; const DIRECTION_ABSOLUTE = 'ABSOLUTE'; // with count const DIRECTION_RELATIVE = 'RELATIVE'; // with count const DIRECTION_FORWARD = 'FORWARD'; // with count const DIRECTION_FORWARD_ALL = 'FORWARD ALL'; const DIRECTION_BACKWARD = 'BACKWARD'; // with count const DIRECTION_BACKWARD_ALL = 'BACKWARD ALL';
Half of them take the number of lines in the parameter, and the other half - no. Here's what I got:
public function getFetchQuery(int $count = 1, string $direction = self::DIRECTION_FORWARD): NativeQuery { if (!$this->isOpen) { $this->openCursor(); } $query = clone $this->query; $query->setParameters([]); if ( $direction == self::DIRECTION_ABSOLUTE || $direction == self::DIRECTION_RELATIVE || $direction == self::DIRECTION_FORWARD || $direction == self::DIRECTION_BACKWARD ) { $query->setSQL(sprintf( 'FETCH %s %d FROM %s', $direction, $count, $this->connection->quoteIdentifier($this->cursorName) )); } else { $query->setSQL(sprintf( 'FETCH %s FROM %s', $direction, $this->connection->quoteIdentifier($this->cursorName) )); } return $query; }
We close the cursor with CLOSE, it is not necessary to wait until the transaction is completed:
public function close() { if (!$this->isOpen) { return; } $this->connection->exec('CLOSE ' . $this->connection->quoteIdentifier($this->cursorName)); $this->isOpen = false; }
Destructor:
public function __destruct() { if ($this->isOpen) { $this->close(); } }
Here is the whole class . Let's try it in action?
I open some conditional Writer in some conditional XLSX.
$writer->openToFile($targetFile);
Here I get NativeQuery to pull a list of orders from the database.
/** @var NativeQuery $query */ $query = $this->getOrdersRepository($em) ->getOrdersFiltered($dateFrom, $dateTo, $filters);
Based on this request, I declare the cursor.
$cursor = new PgSqlNativeQueryCursor($query);
And for him I receive a request to receive data in batches of 10,000 lines.
$fetchQuery = $cursor->getFetchQuery(10000);
I iterate until I get an empty result. In each iteration I perform FETCH, process the result and write to the file.
do { $result = $fetchQuery->getArrayResult(); foreach ($result as $row) { $writer->addRow($this->toXlsxRow($row)); } } while ($result);
Close the cursor and Writer.
$cursor->close(); $writer->close();
I write the file to disk in the directory for temporary files, and after the end of the recording I give it to the browser to avoid, again, buffering in memory.
REPORT READY! We used a constant amount of memory from PHP when processing all data and did not torture the base with a succession of heavy queries. And the unloading itself took up slightly more time than the database took to complete the request.
See if there are any places in your projects that can speed up / save memory with the cursor?
Source: https://habr.com/ru/post/455571/
All Articles