How to read a large file using PHP (without crashing a server)

Translation of the article by Christopher Pitt .

PHP developers do not often have to monitor the memory consumption in their applications. The PHP engine itself does a good job of cleaning up the garbage behind us, and the web server model with the execution context “dying” after the execution of each request allows even the worst code to not create big long problems.

However, in some situations, we may encounter problems of lack of RAM - for example, trying to run a composer on a small VPS, or when opening a large file on a server that is not rich in resources.

Fragmented terrain

The last problem will be considered in this lesson.

All code is available at https://github.com/sitepoint-editors/sitepoint-performant-reading-of-big-files-in-php

Meryl of Success

When conducting any optimizations of the code, we must always measure the results of its execution before and after, in order to evaluate the effectiveness (or destructiveness) of our optimizations.

Usually measure the CPU load and memory usage. It often happens that saving one leads to increased costs of the other and vice versa.

In the asynchronous model of the application (multiprocessor and multithreaded) it is always very important to monitor both the processor and the memory. In classic applications, resource control becomes a problem only when approaching server limits.

Measuring CPU usage inside PHP is a bad idea. It is better to use any utility like top from Ubuntu or macOS. If you have Windows, you can use Linux Subsystem to access top .

In this tutorial, we will measure memory usage. We will look at how memory is spent in traditional scripts, and then apply a couple of chips for optimization and compare the results. I hope, by the end of the article, the reader will get a basic understanding of the basic principles of optimizing memory consumption when reading large amounts of data.

We will measure the memory like this:

// formatBytes is taken from the php.net documentation memory_get_peak_usage(); function formatBytes($bytes, $precision = 2) { $units = array("b", "kb", "mb", "gb", "tb"); $bytes = max($bytes, 0); $pow = floor(($bytes ? log($bytes) : 0) / log(1024)); $pow = min($pow, count($units) - 1); $bytes /= (1 << (10 * $pow)); return round($bytes, $precision) . " " . $units[$pow]; }

We will use this function at the end of each script, and compare the values obtained.

What are the options?

There are many different approaches for efficiently reading data, but all of them can be divided into two groups: we either read and process the read portion of the data immediately (without first loading all the data into memory), or convert the data into a stream without bothering it content.

Let's imagine that for the first option we want to read the file and process it every 10,000 lines separately. You will need to keep at least 10,000 lines in memory and transfer them to the queue (in whatever form it is implemented).

For the second scenario, suppose we want to compress the contents of a very large API response. It does not matter to us what data is contained there, it is important to return them in a compressed form.

In both cases, you need to consider large amounts of information. In the first, we know the data format, in the second, the format does not matter. Consider both options.

Reading File Line by Line

There are many functions for working with files. Let's write with their help your reader:

 // from memory.php function formatBytes($bytes, $precision = 2) { $units = array("b", "kb", "mb", "gb", "tb"); $bytes = max($bytes, 0); $pow = floor(($bytes ? log($bytes) : 0) / log(1024)); $pow = min($pow, count($units) - 1); $bytes /= (1 << (10 * $pow)); return round($bytes, $precision) . " " . $units[$pow]; } print formatBytes(memory_get_peak_usage());

 // from reading-files-line-by-line-1.php function readTheFile($path) { $lines = []; $handle = fopen($path, "r"); while(!feof($handle)) { $lines[] = trim(fgets($handle)); } fclose($handle); return $lines; } readTheFile("shakespeare.txt"); require "memory.php";

Here we read the file with the works of Shakespeare. The file size is about 5.5MB and peak memory usage is 12.8MB .

And now let's use the generator :

 // from reading-files-line-by-line-2.php function readTheFile($path) { $handle = fopen($path, "r"); while(!feof($handle)) { yield trim(fgets($handle)); } fclose($handle); } readTheFile("shakespeare.txt"); require "memory.php";

The file is the same, and peak memory usage has dropped to 393KB ! But as long as we do not perform any operations with the read data, this has no practical use. For example, we can break a document into parts if we meet two empty lines:

 // from reading-files-line-by-line-3.php $iterator = readTheFile("shakespeare.txt"); $buffer = ""; foreach ($iterator as $iteration) { preg_match("/\n{3}/", $buffer, $matches); if (count($matches)) { print "."; $buffer = ""; } else { $buffer .= $iteration . PHP_EOL; } } require "memory.php";

Although we broke the document into 1,216 pieces, we used only 459KB of memory. All this, thanks to the peculiarities of the generators - the amount of memory for their work is equal to the size of the largest iterated part. In this case, the largest part consists of 101,985 characters.

Generators can be used in other situations, but this example demonstrates well the performance of reading large files. Generators may be one of the best options for data processing.

Piping between files

In situations where data processing is not required, we can forward data from one file to another. This is called pipeping (pipe), perhaps because we do not see what is happening inside the pipe, but we see what comes in and out of it ). This can be done using streaming methods. But first, let's write a classic script that stupidly transfers data from one file to another:

 // from piping-files-1.php file_put_contents( "piping-files-1.txt", file_get_contents("shakespeare.txt") ); require "memory.php";

It is not surprising that this script uses much more memory than the file being copied. This is due to the fact that it must read and store the contents of the file in memory until the file is copied completely. For small files there is nothing terrible, but not for big ones ...

Let's try to stream (or pipe) files, one to another:

 // from piping-files-2.php $handle1 = fopen("shakespeare.txt", "r"); $handle2 = fopen("piping-files-2.txt", "w"); stream_copy_to_stream($handle1, $handle2); fclose($handle1); fclose($handle2); require "memory.php";

The code is rather strange. We open both files, the first for reading, the second for writing. Then we copy the first to the second, and then close both files. It may be a surprise, but we spent only 393KB .

Something familiar. Doesn't this look like a generator reading every line? This is because the second fgets argument determines how many bytes of each line need to be read (by default, -1, that is, to the end of the line). Optional, the third stream_copy_to_stream does the same thing. stream_copy_to_stream reads the first stream on one line and writes to the second.

Piping this text is not very useful for us. Let's come up with a real example. Suppose we want to get a picture from our CDN and transfer it to a file or to stdout . We could do it like this:

 // from piping-files-3.php file_put_contents( "piping-files-3.jpeg", file_get_contents( "https://github.com/assertchris/uploads/raw/master/rick.jpg" ) ); // ...or write this straight to stdout, if we don't need the memory info require "memory.php";

In order to implement our plans this way it took 581KB . Now let's try to do the same with threads.

 // from piping-files-4.php $handle1 = fopen( "https://github.com/assertchris/uploads/raw/master/rick.jpg", "r" ); $handle2 = fopen( "piping-files-4.jpeg", "w" ); // ...or write this straight to stdout, if we don't need the memory info stream_copy_to_stream($handle1, $handle2); fclose($handle1); fclose($handle2); require "memory.php";

Spent a little less memory ( 400KB ) with the same result. And if we did not need to save the image in memory, we could immediately zastrizit it in stdout :

 $handle1 = fopen( "https://github.com/assertchris/uploads/raw/master/rick.jpg", "r" ); $handle2 = fopen( "php://stdout", "w" ); stream_copy_to_stream($handle1, $handle2); fclose($handle1); fclose($handle2); // require "memory.php";

Other streams

There are other streams to / from which you can stream:

php://stdin - read only
php://stderr - write only
php://input - read only (gives access to the naked body of the request)
php://output - write only (allows writing to output buffer)
php://memory and php://temp - read and write. Here you can store temporary data, the difference is that php://temp will store data in the file system as it grows, and php://memory will write everything into RAM to the last.

Filters

There is one more feature that we can use - these are filters . An intermediate option that gives us a little control over the flow, without the need to dive into its contents in detail. Suppose we want to compress a file. You can apply zip extension:

 // from filters-1.php $zip = new ZipArchive(); $filename = "filters-1.zip"; $zip->open($filename, ZipArchive::CREATE); $zip->addFromString("shakespeare.txt", file_get_contents("shakespeare.txt")); $zip->close(); require "memory.php";

Good code, but it consumes almost 11MB . With filters, get better:

 // from filters-2.php $handle1 = fopen( "php://filter/zlib.deflate/resource=shakespeare.txt", "r" ); $handle2 = fopen( "filters-2.deflated", "w" ); stream_copy_to_stream($handle1, $handle2); fclose($handle1); fclose($handle2); require "memory.php";

Here we use php://filter/zlib.deflate which reads and compresses incoming data. We can pipe compressed data to a file, or somewhere else. This code used only 896KB .

I know that this is not exactly the same format as the zip archive. But think about it, if we have the opportunity to choose a different compression format, spending 12 times less memory, is it worth it?

To unpack the data, apply another zip filter.

 // from filters-2.php file_get_contents( "php://filter/zlib.inflate/resource=filters-2.deflated" );

Here are a couple of articles for those who would like to dive deeper into the topic of threads: “ Understanding Streams in PHP ” and “ Using PHP Streams Effectively ”.

Stream customization

fopen and file_get_contents have a number of preset options, but we can change them as you please. To do this, you need to create a new thread context:

 // from creating-contexts-1.php $data = join("&", [ "twitter=assertchris", ]); $headers = join("\r\n", [ "Content-type: application/x-www-form-urlencoded", "Content-length: " . strlen($data), ]); $options = [ "http" => [ "method" => "POST", "header"=> $headers, "content" => $data, ], ]; $context = stream_content_create($options); $handle = fopen("http://example.com/register", "r", false, $context); $response = stream_get_contents($handle); fclose($handle);

In this example we are trying to make a POST request to the API. We register several headers, and access the API by file descriptor. There are many other options for customization, so it will not be superfluous to get acquainted with the documentation on this issue.

Creating your own protocols and filters

Before you finish, let's talk about creating custom protocols. If you look in the documentation , you can see an example:

 Protocol { public resource $context; public __construct ( void ) public __destruct ( void ) public bool dir_closedir ( void ) public bool dir_opendir ( string $path , int $options ) public string dir_readdir ( void ) public bool dir_rewinddir ( void ) public bool mkdir ( string $path , int $mode , int $options ) public bool rename ( string $path_from , string $path_to ) public bool rmdir ( string $path , int $options ) public resource stream_cast ( int $cast_as ) public void stream_close ( void ) public bool stream_eof ( void ) public bool stream_flush ( void ) public bool stream_lock ( int $operation ) public bool stream_metadata ( string $path , int $option , mixed $value ) public bool stream_open ( string $path , string $mode , int $options , string &$opened_path ) public string stream_read ( int $count ) public bool stream_seek ( int $offset , int $whence = SEEK_SET ) public bool stream_set_option ( int $option , int $arg1 , int $arg2 ) public array stream_stat ( void ) public int stream_tell ( void ) public bool stream_truncate ( int $new_size ) public int stream_write ( string $data ) public bool unlink ( string $path ) public array url_stat ( string $path , int $flags ) }

Writing your own implementation of this pulls into a separate article. But if you still get confused and do it, then you can easily register your wrapper for streams:

 if (in_array("highlight-names", stream_get_wrappers())) { stream_wrapper_unregister("highlight-names"); } stream_wrapper_register("highlight-names", "HighlightNamesProtocol"); $highlighted = file_get_contents("highlight-names://story.txt");

Similarly, you can create custom stream filters. An example of a dock filter class:

 Filter { public $filtername; public $params public int filter ( resource $in , resource $out , int &$consumed , bool $closing ) public void onClose ( void ) public bool onCreate ( void ) }

And it’s also easy to register:

 $handle = fopen("story.txt", "w+"); stream_filter_append($handle, "highlight-names", STREAM_FILTER_READ);

The filtername property in the new filter class must be equal to highlight-names . You can also use the inline filter php://filter/highligh-names/resource=story.txt . Creating filters is much easier than protocols. But the protocols have more flexible features and functionality. For example, one of the reasons for which filters are not suitable, but protocols are required - these are operations with directories where a filter will be needed to process each piece of data.

I strongly recommend experimenting with creating your own protocols and filters. If you manage to apply a filter to the stream_copy_to_stream function, you will get tremendous memory savings when working with large amounts of data. Imagine that you will have a filter for resizing images or a filter for encryption, and maybe even more awesome.

Total

Although this is not the most common problem with which we are suffering, it is very easy to mess with large files. In asynchronous applications, it’s generally very easy to put an entire server if you don’t control memory usage in your scripts

I hope that this lesson gave you a few new ideas (or refreshed them in memory) and now you can work with large files much more efficiently. Acquainted with generators and threads (and no longer using functions like file_get_contents ), our applications can be spared a whole class of errors. What seems like a good thing to aim for!

Source: https://habr.com/ru/post/345024/

All Articles