📜 ⬆️ ⬇️

Multi-threaded download in cURL for PHP

In this topic, in my opinion, a convenient and functional implementation of multithreaded cURL download for PHP is presented. Perhaps it will be useful to someone, and an invite will bring me;)

I didn’t use downloading via cURL, even if only out of interest, only lazy. Whether from the console, or by implementing the code in any PL. Solutions blocking download one link lying on every corner of the network, for example on php.net . However, if we consider implementations in PHP, then this approach is sometimes not suitable due to the high time spent on auxiliary operations (dns lookup, request waiting and the like). For downloading a large number of pages, the sequential version is not acceptable. If you are satisfied, then you can not read further :)

In Perl, for example, you can use fork () or threads ( use threads ) to parallelize single-threaded downloads. This is not counting the rich possibilities of the libraries of this language. I personally applied threads and lwp. However, we are talking about PHP, and here with paralleling big problems due to the lack of this feature in principle. If anyone knows how to create threads, please let us know, but I haven’t found any worthy solutions yet. Yes, in cURL there are curl_multi_ * functions, but here examples of implementations on their basis did not suit me. And, in the end, I decided to collect my bike.

Initially, I refer to the simplest example from off. directory . Let me bring it here :)
<? php
// create both cURL resources
$ ch1 = curl_init ();
$ ch2 = curl_init ();
')
// set URL and other appropriate options

curl_setopt ( $ ch1 , CURLOPT_URL , " www.example.com " );
curl_setopt ( $ ch1 , CURLOPT_HEADER , 0 );
curl_setopt ( $ ch2 , CURLOPT_URL , " www.php.net " );

curl_setopt ( $ ch2 , CURLOPT_HEADER , 0 );

// create the multiple cURL handle
$ mh = curl_multi_init ();


// add the two handles
curl_multi_add_handle ( $ mh , $ ch1 );
curl_multi_add_handle ( $ mh , $ ch2 );


$ running = null ;
// execute the handles
do {
curl_multi_exec ( $ mh , $ running );
} while ( $ running > 0 );


// close the handles
curl_multi_remove_handle ( $ mh , $ ch1 );
curl_multi_remove_handle ( $ mh , $ ch2 );
curl_multi_close ( $ mh );

?>

The code differs from the single-threaded approach in a more complex organization of interaction between application code and the library:
1) For each connection, its own curl_init () is executed and parameters are set via curl_setopt (). Everything is standard here, without any explanations.
2) For the general download control, a separate descriptor is created by the curl_multi_init () call, through which all further work will be done.
3) To this handle by calling curl_multi_add_handle (), separate connections created at the beginning are attached .
The preparatory stage is completed, now directly download:
4) The library download is performed automatically, there is no explicit call as it was with curl_exec (). It is replaced by a multiple call to curl_multi_exec (). Despite the similar name, this function performs a slightly different role - it blocks information on the change in the number of active threads (well, and the errors that occurred). The second parameter in the call is a reference to a numeric variable in which the number of currently active connections is stored. The number has changed - it means some thread has completed work. For this reason, the download cycle is implemented through
do {
curl_multi_exec ( $ mh , $ running );
} while ( $ running > 0 );

5) And finally, after downloading, the release of resources is performed. Important! Although the connections created by curl_init () and “cling” to the main descriptor, it does not automatically close them, it must be done manually by calling curl_multi_remove_handle () in addition to curl_close ().

Someone may have enough of such an implementation, and they may no longer read. I will go further.
What is wrong with this implementation? A couple of the most obvious moments:
  1. Hard limitation on downloading 2 links set right in the code
  2. received pages are output directly to STDOUT

This is only a part, the rest is discussed further.

I correct these shortcomings and get, for example, the following:
<? php
$ urls = array ( " www.example.com " , " www.php.net " );

$ mh = curl_multi_init ();

$ chs = array ();

foreach ( $ urls as $ url ) {
$ chs [] = ( $ ch = curl_init ());
curl_setopt ( $ ch , CURLOPT_URL , $ url );

curl_setopt ( $ ch , CURLOPT_HEADER , 0 );
// CURLOPT_RETURNTRANSFER - return the value as the result of a function, and not output to stdout

curl_setopt ( $ ch , CURLOPT_RETURNTRANSFER , 1 );
curl_multi_add_handle ( $ mh , $ ch );
}

$ prev_running = $ running = null ;

do {
curl_multi_exec ( $ mh , $ running );

if ( $ running ! = $ prev_running ) {
// get information about current connections

$ info = curl_multi_info_read ( $ mh );

if ( is_array ( $ info ) && ( $ ch = $ info [ 'handle' ])) {

// get the content of the loaded page
$ content = curl_multi_getcontent ( $ ch );

// there is some kind of page text processing

// for now let it be as in the original - output to STDOUT
echo $ content ;
}


// update the cached number of current active connections
$ prev_running = $ running ;
}

} while ( $ running > 0 );


foreach ( $ chs as $ ch ) {
curl_multi_remove_handle ( $ mh , $ ch );
curl_close ( $ ch );

}
curl_multi_close ( $ mh );
?>


Further, it is unlikely in most cases it will be enough just to display pages in STDOUT. Moreover, this happens in an arbitrary order, depending on the order of the actual download (and not the task calls to curl_multi_add_handle ()). Also, if a large volume is downloaded, then there is no point in waiting for all pages to be received - you can already begin to process them as they are received. But the option of getting all in a crowd also should not be removed from the accounts.
For this: 1) I implement everything in the form of a function, 2) I will enter a parameter specifying the callback function that will be called for each received file. If callback is not specified, the option of getting all pages at once is applied. Here is an example:
<? php
// example of the simplest callback. almost dummy-func.
function my_callback ( $ url , $ content , $ curl_status , $ ch ) {

echo "Download page [$ url]" ;
if (! $ curl_status ) {
echo "was successful. page text: \ n $ content \ n " ;

}
else {
echo "run with error # $ curl_status:" . curl_error ( $ ch ). "\ n" ;
}

}

function http_load ( $ urls , $ callback = false ) {
$ mh = curl_multi_init ();


$ chs = array ();
foreach ( $ urls as $ url ) {
$ chs [] = ( $ ch = curl_init ());

curl_setopt ( $ ch , CURLOPT_URL , $ url );
curl_setopt ( $ ch , CURLOPT_HEADER , 0 );
// CURLOPT_RETURNTRANSFER - return the value as the result of a function, and not output to stdout

curl_setopt ( $ ch , CURLOPT_RETURNTRANSFER , 1 );
curl_multi_add_handle ( $ mh , $ ch );
}

// if $ callback is set to false, then the function should not call $ callback, but issue pages as the result of the work

if ( $ callback === false ) {
$ results = array ();
}

$ prev_running = $ running = null ;

do {
curl_multi_exec ( $ mh , $ running );

if ( $ running ! = $ prev_running ) {
// get information about current connections

$ info = curl_multi_info_read ( $ ghandler );

if ( is_array ( $ info ) && ( $ ch = $ info [ 'handle' ])) {

// get the content of the loaded page
$ content = curl_multi_getcontent ( $ ch );

// downloaded link
$ url = curl_getinfo ( $ ch , CURLINFO_EFFECTIVE_URL );


if ( $ callback ! == false ) {
// call the callback handler
$ callback ( $ url , $ content , $ info [ 'result' ], $ ch );

}
else {
// add results to hash
$ results [ $ url ] = array ( 'content' => $ content , 'status' => $ info [ 'result' ], 'status_text' => curl_error ( $ ch ));

}

}

// update the cached number of current active connections
$ prev_running = $ running ;
}


} while ( $ running > 0 );

foreach ( $ chs as $ ch ) {
curl_multi_remove_handle ( $ mh , $ ch );

curl_close ( $ ch );
}
curl_multi_close ( $ mh );

// results
return ( $ callback ! == false )? true : $ results ;

}

$ urls = array ( " www.example.com " , " www.php.net " );

// simple issue option
print_r ( http_load ( $ urls ));


// option with callback
var_export ( http_load ( $ urls , my_callback ));

?>

Now it is much more interesting. Important point: with callback, the 4th parameter is the connection descriptor $ ch, and when issuing results by a hash, it is just a string description of the error that occurred (well, or an empty string, if everything is fine). Why? Because curl_error () requires the transmission of a descriptor that is closed at the end of the function. So in the callback it still exists and we can use it, but in the hash it can not give anything valuable. Alternatively, string descriptions of error codes can be found here .

So go ahead. I want to call a function not only for an array of links, but also to be able to download a single page to it. To do this, add just one line:
<? php function http_load ( $ urls , $ callback = false ) {
...
// even if a single parameter is passed, I consider it as an element of the array

// is it an analogue: $ urls = is_array ($ urls)? $ urls: array ($ urls);
$ urls = (array) $ urls ;

.... ?>

Now you can download links one by one: http_load ('google.com'). A sort of return to the origins.

Then I needed to set a lot more transferred headers for the connections. Specifying them one by one via curl_setopt () is not practical. It is better to use the curl_setopt_array function. Redoing and getting (part of the code):
<? php
{ // common to all connections headers
$ ext_headers = array (
'Expect:' ,
'Accept: text / html, application / xhtml + xml, application / xml; q = 0.9' ,

'Accept-Language: ru, en-us; q = 0.7, en; q = 0.7' ,
// 'Accept-Encoding: gzip, deflate', // need to be unpacked later. well, bye it
'Accept-Charset: utf-8, windows-1251; q = 0.7, *; q = 0.5' ,
);
$ curl_options = array (

CURLOPT_PORT => 80 ,
CURLOPT_RETURNTRANSFER => 1 , // return the value as the result of a function, and not output to stdout

CURLOPT_BINARYTRANSFER => 1 , // transfer to binary-safe
CURLOPT_CONNECTTIMEOUT => 10 , // connection timeout (lookup + connect)

CURLOPT_TIMEOUT => 30 , // timeout for receiving data
CURLOPT_USERAGENT => 'Mozilla / 5.0 (X11; U; Linux x86_64; en-US; rv: 1.9.1.1) Gecko / 20090716 Ubuntu / 9.04 (jaunty) Shiretoko / 3.5.1' ,

CURLOPT_VERBOSE => 2 , // level of information
CURLOPT_HEADER => 0 , // title does not work
CURLOPT_FOLLOWLOCATION => 1 , // follow redirects

CURLOPT_MAXREDIRS => 7 , // maximum number of redirects
CURLOPT_AUTOREFERER => 1 , // when redirecting, substitute in “Referer:” a value from “Location:”

// CURLOPT_FRESH_CONNECT => 0, // use a new connection every time
CURLOPT_HTTPHEADER => $ ext_headers ,
);
}


function http_load ( $ urls , $ callback = false ) {
global $ curl_options ;

$ mh = curl_multi_init ();

if ( $ mh === false ) return false ;

$ urls = (array) $ urls ;

$ chs = array ();

foreach ( $ urls as $ url ) {
$ chs [] = ( $ ch = curl_init ());


curl_setopt_array ( $ ch , $ curl_options ); // I set the headers together
curl_setopt ( $ ch , CURLOPT_URL , $ url );


curl_multi_add_handle ( $ mh , $ ch );
}
...
?>


Pretending to be firelis. Headings I commented. For a detailed explanation post here .
And in the follow-up to these headers, a third parameter is added to the function:
<?php function http_load( $urls, $callback = false, $urls_params = array() ) {} ?>
in which you can specify your own headers, which will be additionally committed to the connections when they are initialized. Thus, you can successfully send POST requests with parameters or specify your referrals and the format of the transmitted data (for example, with compression).
<? php
...
foreach ( $ urls as $ ind => $ url ) {
$ chs [] = ( $ ch = curl_init ());


curl_setopt_array ( $ ch , $ curl_options ); // I set the headers together
curl_setopt ( $ ch , CURLOPT_URL , $ url );


// are there additional parameters to initialize this connection?
if (isset ( $ urls_params [ $ ind ]) && is_array ( $ urls_params [ $ ind ])) {

curl_setopt_array ( $ ch , $ urls_params [ $ ind ]);
}

curl_multi_add_handle ( $ mh , $ ch );

}
...
?>


This is the function. You could also write about working with cookies and POST requests, but if you get an invite. And so he wrote a lot, how many have mastered? ;)

Source: https://habr.com/ru/post/68175/


All Articles