📜 ⬆️ ⬇️

Integration of JavaScript cookies into CURL requests

In this article I will talk about one unusual problem that I once had to face due to the nature of my work. Warning: this article is not for beginners. It is assumed that the reader already has web programming experience and is familiar with the PHP language, the CURL library and the basics of HTTP .

Let us turn to the description of the problem.
I needed to write a bot script for one site in order to automate some process from several steps, not counting authorization. Such was the task, at first glance, nothing special. However, at first it was so. Because The site used authorization and actively used cookies, it was decided to use CURL. I gradually sniffed HTTP requests to the site server and reproduced them in my script. The process, as they say, was going on ... The troubles began at the penultimate step, when the server completely unexpectedly for me refused to give the desired result. This brought me into a state of stupor, in which I remained for quite a long time, continuing to compare the logs of “artificial” and “natural” (browser) requests again and again in the hope of finding at least some kind of discrepancy. It took several hours before I realized that I was wasting time.

And then I turned my attention to the so-called JavaScript (or non-HTTP) cookies. Of course, CURL could not track their appearance and, accordingly, could not add them to their requests. But why should the server check JavaScript cookies? Good question, another time he would have interested me, but at that moment I had other concerns.
')
Looking ahead, I will say that my guess was correct. The server actually checked the JS cookies, and only on that ill-fated penultimate step. I do not know for what purpose it was done. Moreover, I am not sure that this was protection from bots. As they say, someone else's soul (server) - darkness.

So, I had to solve the following two tasks:
  1. find "significant" JavaScript cookies that affect the response of the server;
  2. find a way to insert these cookies into CURL requests.

As for the first task, in principle, everything was clear to me; in any case, I already imagined a rough plan of action. So I decided to immediately start with the second one: to find a means to add "my" cookies to the CURL request, in addition to those that appear there automatically, from the file specified by the CURLOPT_COOKIEFILE option.

The first thing that came to my mind was the idea to enable the CURLOPT_COOKIE option with a string composed of cookie parameters. Like that:
curl_setopt($hc, CURLOPT_COOKIE, "name1=value1; name2=value2; ..."); 

So I did: I added this line to the code ... and very soon I was convinced that it was not working. Rather it works, but not at all as I wanted. In the sent HTTP header, there were only cookies added by this option, but the cookies from the CURLOPT_COOKIEFILE file disappeared (they disappeared from the header, not from the file). Those. the contents of the cookie storage file were ignored. From this follows a simple and useless conclusion: the options CURLOPT_COOKIE and CURLOPT_COOKIEFILE / CURLOPT_COOKIEJAR cannot be used together.

In short, to solve the problem quickly failed. Search for ready-made solutions on the Internet also did not give anything. Maybe I was looking badly, but all that I was able to find on this topic was this question asked on the StackOverflow forum, and apparently unanswered.

In the meantime, the deadline came to an end, the customer demanded an explanation, and unfortunately, all my eloquence disappeared somewhere. Then for some reason I remembered a catch phrase from a Soviet comedy : "Either I take her to the registry office, or she leads me to the prosecutor." Neither the registry office nor the prosecutor, I did not want to. I really wanted to go somewhere and forget ... but I had to choose another way to fix problems.

By that time, I had already seen two solutions:
  1. refuse CURL services in terms of automatic processing of cookies and take this “black” job for yourself, i.e. Parse the cookies from the response headers, save them, and send them along with the requests. It sounds a bit scary, but it gives you complete control over the cookies.
  2. leave the auto processing of cookies, but add the ability to insert into the file cookies "their" (custom) parameters. The prospect is also not pleasant, since it involved manual editing of the cookie file.

After some hesitation, the second option was chosen, since it seemed easier to implement. In addition, I have long been interested in the format of the Netscape Cookie File, but there was no reason to get to know him better. And this reason has appeared.
Search for information on this format did not take long. From the very first page of Google issue, I got to the office. site CURL'a, in the archive of correspondence of users with the creator of this library, where he found what he was looking for .

The file format turned out to be quite simple - 7 fields (attributes) on each line, separated by tabs and going in this order:

The meaning of these fields IMHO is quite obvious. I will only note that tailmatch is the exact match flag of the site domain name.

Now that the file format was known, the rest was already a matter of technique.
As a result, I wrote a small class CookiejarEdit, the source code of which I quote right here:

 <?php class CookiejarEdit { protected $sFname= false; //   cookies protected $aPrefix= array( //      '', // #0: domain 'FALSE', // #1: tailmatch (   ) '/', // #2: path 'FALSE', // #3: secure (https-) ); protected $sPrefix= ''; //      cookie function __construct($sFn, $sDomain='', $aXtra=0) { if (!$sFn) return; $this->sFname= $sFn; $this->setPrefix($sDomain, $aXtra); } function __clone() { $this->setPrefix(); } /**** ** /    cookie: ** : ** 1) $sDomain -   'domain' ** 2) $aXtra -   -  : ** $aXtra['tailmatch'] -   'tailmatch' ** $aXtra['path'] -   'path' ** $aXtra['secure'] -   'secure' */ function setPrefix($sDomain, $aXtra=0) { if ($sDomain) $this->aPrefix[0]= $sDomain; if (is_array($aXtra)) { if (isset($aXtra['tailmatch'])) $this->aPrefix[1]= $aXtra['tailmatch']? 'TRUE': 'FALSE'; if (isset($aXtra['path'])) $this->aPrefix[2]= $aXtra['path']; if (isset($aXtra['secure'])) $this->aPrefix[3]= $aXtra['secure']? 'TRUE': 'FALSE'; } if ($this->aPrefix[0]) $this->sPrefix= implode("\t", $this->aPrefix). "\t"; } /**** **    cookies: */ function export() { return ($this->sFname)? file_get_contents($this->sFname) : false; } /**** **    cookies: */ function import($sCont) { if (!$sCont || strlen($sCont)<10) return false; file_put_contents($this->sFname, $sCont); return true; } /**** ** //  /  cookies: ** : ** 1) $aFields -      cookie ** $aFields[0] -  'name' ( ) ** $aFields[1] -  'value' ( ) ** $aFields[2] -      **  : ** 1) false -     ** 2) true -     ** 3) string -    /,    */ function setCookie($aFields) { if (!$this->sFname || !$this->sPrefix) return false; if (!is_array($aFields) || !($n_arr= count($aFields))) return false; $name= $aFields[0]; $cont= file_exists($this->sFname)? file_get_contents($this->sFname): ''; $cr= (strpos($cont, "\r\n") !== false)? "\r\n" : "\n"; $a_rows= explode($cr, trim($cont, $cr)); $i_row= -1; foreach ($a_rows as $i=> $row) { if (strpos($row, "\t".$name."\t") === false) continue; if (strpos($row, $this->sPrefix) !== 0) continue; $i_row= $i; break; } $ret= true; if ($n_arr> 1) { // add/modify: $val= $aFields[1]; $life= ($n_arr> 2 && $aFields[1]>= 0)? $aFields[1] : 1; if ($i_row<0) $i_row= count($a_rows); $n_exp= ($life> 0)? (time()+ $life* 24* 60* 60) : 0; $a_rows[$i_row]= $ret= $this->sPrefix. implode("\t", array($n_exp, $name, $val)); } else if ($i_row>= 0) { // remove: unset($a_rows[$i_row]); } file_put_contents($this->sFname, implode($cr, $a_rows).$cr); return $ret; } /**** ** /    cookies: */ function addCookie($sName, $sVal, $nLife=0) { return $this->setCookie(array($sName, $sVal, $nLife)); } /**** **     cookies: */ function removeCookie($sName) { return $this->setCookie(array($sName)); } } ?> 


The __clone (), export (), import () methods were added “purely to decorate” the code. Honestly, I do not see much sense in them, as well as in the additional $ aXtra argument for the setPrefix method, which I added just in case (although IMHO, the need for it cannot arise by definition). In any case, the code is working and ready to use (PHP> = 5.0).
I do not pretend to the originality of the idea and do not exclude the possibility that this is another “bicycle”. Perhaps similar and even simpler solutions have long existed. Nevertheless, my “bicycle” helped me and I will be glad if he helps someone else.

So, the tool for "advanced" work with the cookies was ready. But my adventures did not end there. It was necessary to track those “significant” JS cookies, to understand what values ​​they are assigned, and much more. But that's another story. And I probably will finish this one. Thanks for attention.

PS:
Rereading the text of the article once again, I realized that she still lacked a “real-life example.” It would be nice, I thought, to arrange a small demonstration on the example of a site. I could not use the site I worked with, for certain reasons. Therefore, it was necessary to find a replacement for him: some well-known site that does not require authorization, where JavaScript-based cookies are used on the server side. To my surprise and happiness, such a site was found very quickly: in my bookmarks. This is the well-known Yandex-Directory, the Freelance category (the one that is closest to me).

At first this page looks like this:


But if you go to the settings , select the item “standard with numbers” there:


and return to the catalog page, then we will achieve the “wonderful” effect: the preview from the page will disappear and only the “dry” numbers and text will remain:


Let's try to write the simplest bot to download the first page of this directory without thumbnails. I understand how silly it looks from the side: writing a bot just to change the look of the page. But do not forget that this is just an example. Let's imagine that getting a page without thumbnails is our most cherished wish).

To begin, we will conduct a reconnaissance. Pay attention to the URL of the catalog page after changing the settings: it has not changed. True, the line "? Rnd = xxx" was added to it, but this, apparently, is just an indication to the browser not to take the page from the cache. From this we can conclude that the settings are transmitted and stored, most likely through cookies.

Let's try to figure out exactly how this happens. A useful tool like Live HTTP Headers will help us with this:


This Firefox extension allows you to track all incoming and outgoing HTTP traffic in the browser, including cookies, through which the settings are saved in our example. This happens, obviously, after clicking the "Save" button and moving to the catalog page.

Go back to the settings page, after enabling the Live HTTP Headers sniffer. Select again the item “standard with numbers” and press the “Save” button. Now look at our catch in the sniffer. We are interested in the details of the catalog page request. I have them look like this:

 GET / yca / cat / Employment / Freelance /? Rnd = 191 HTTP / 1.1
 Host: yaca.yandex.ru
 User-Agent: ...
 Accept: text / html, application / xhtml + xml, application / xml; q = 0.9, * / *; q = 0.8
 Accept-Language: ru-ru, ru; q = 0.8, en-us; q = 0.5, en; q = 0.3
 Accept-Encoding: gzip, deflate
 Accept-Charset: windows-1251, utf-8; q = 0.7, *; q = 0.7
 Keep-Alive: 115
 Connection: keep-alive
 Referer: http://yaca.yandex.ru/setup.xml
 Cookie: yandexuid = 796954901281541279;  fuid01 = 4c62c49f04c00e82.  my = YwA =;  L = eEAcXVFJR252Q0ADVkt9BW5wWmFyXXhXBkBYAwQaYmIRBgo6Ciw9ZggRFwUmNQwcOUs5LwQvVD42OjAPCmFfFQ ==.  yp = 1636215158.sp .;  yabs-frequency = / 3 / UOW2AQmAGyle0Ici2au0 /;  yaca_view = num


Here the fragment "yaca_view = num" immediately catches the eye. Most likely, this is our desired cookie parameter. But where is it installed? In any case, not in the server response headers, since this parameter is not found there. Then it is logical to assume that this is a JavaScript cookie and that means that it is installed somewhere in the javascript of the settings page ("setup.xml"). Let's try to find it in the text of this page. And there is. Here is the line from the file "setup.xml":
 $.cookie('yaca_view', $('input[name="yaca_view"]:checked' ).val()); 

Apparently, this is where the "yaca_view" parameter is set with the value taken from the form element of the same name (in our case, this value is 'num').

So, we found out that in order to see the directory page without thumbnails, you need to pass a cookie parameter to the server with the name 'yaca_view' and the value 'num'. Now that we already have the means to add cookies to CURL requests, you can write a bot script without much difficulty. Here is his code with a few comments:

 <?php require_once "cookiejaredit.inc"; if (!function_exists('curl_setopt_array')) { function curl_setopt_array(&$hc, $a_opts) { foreach ($a_opts as $name=> $val) if (!curl_setopt($hc, $name, $val)) return false; return true; } } /**** **     CURL: ** : ** 1) $aOpts -    CURL: ** 2) $sUrl - URL  ** 3) $sUrlRef - URL  **  : ** 1) false -    ** 2) string -   ,   */ function getByCurl($aOpts, $sUrl, $sUrlRef='') { $hc= curl_init(); curl_setopt_array($hc, $aOpts); curl_setopt($hc, CURLOPT_URL, $sUrl); curl_setopt($hc, CURLOPT_REFERER, $sUrlRef); $cont= curl_exec($hc); $b_ok= curl_errno($hc)==0 && curl_getinfo($hc, CURLINFO_HTTP_CODE)==200; echo "\nSent HTTP Header:\n". curl_getinfo($hc, CURLINFO_HEADER_OUT). "Content Length: ".strlen($cont)."\n\n"; curl_close($hc); return $b_ok? $cont : false; } //  ()  cookies: $fn_cook= $_SERVER['DOCUMENT_ROOT'].'/cookiejar-tmp.txt'; //    CURL: $a_curl_opts= array( CURLOPT_NOBODY => 0, CURLOPT_RETURNTRANSFER => 1, CURLOPT_CONNECTTIMEOUT => 10, CURLOPT_TIMEOUT => 15, CURLOPT_USERAGENT => 'Mozilla/5.0 Gecko/20110920 Firefox/3.6.23', CURLINFO_HEADER_OUT => true, CURLOPT_COOKIEFILE => $fn_cook, CURLOPT_COOKIEJAR => $fn_cook, ); define('URL0', 'http://yaca.yandex.ru/yca/cat/Employment/Freelance/'); define('URL1', 'http://yaca.yandex.ru/setup.xml'); define('URL2', 'http://yaca.yandex.ru/yca/cat/Employment/Freelance/?rnd='); define('FN_RESULT', 'result.htm'); echo '<h3>Trace Log:</h3><pre>'; $cookedit= new CookiejarEdit($fn_cook, 'yaca.yandex.ru'); //   : getByCurl($a_curl_opts, URL1, URL0); //  JS cookie-: $rec= $cookedit->addCookie('yaca_view', 'num'); echo "addCookie:\n". ($rec? "$rec\n" : "Fail\n"); //   : $cont= getByCurl($a_curl_opts, URL2. rand(0,999), URL1); echo '</pre> <hr><h3>Result: '; if ($cont) { file_put_contents(FN_RESULT, $cont); echo 'OK</h3><a href="'.FN_RESULT.'" target="_blank">Result page</a>'; } else echo 'Fail</h3>'; ?> 


Now really everything. Thanks again for your attention, especially those who read to the end).

Source: https://habr.com/ru/post/133191/


All Articles