Prehistory
There is one online customer care system with which my young man has to work. The system is probably functional, good for administrators, effective in management and so on, but how inconvenient it is in daily use!
- I don’t remember my login, password and city - as a result, after logging in, you need to wait for all applications to load from the default city, and then go to your own.
- Not all necessary information is available from the general list of applications. For part of it, you have to look inside the application, and each of them opens in a new window (there is javascript and there is not even a normal attribute href, can you imagine?).
- This beauty is made on asp, and therefore with each transition it drives its viewstate on the net.
- Well, the minimum width of the site and a half with something thousands of points is not a pleasure.
The specificity of the work sometimes makes you log in from a mobile phone and from the mobile Internet.
And if I worked with her myself, then nothing would have happened - I would get used to it, adapt, and in general, the authorities yearn for it ... But it’s a pity for a loved one, and the idea was to write a parser of applications.
Story
Actually, I'm a coder. And a web developer, but in this direction the skill is not so high, just doing passable websites on wordpress. I have not come across any harsh curl requests before. And with aspx sites too.
But after all it is interesting!
(This resulted in a month of evenings with php and several sleepless nights. And a lot of fun, of course)At first, there were attempts to cross-domain queries using javascript, but nothing happened on this side.
Then timid excavation aside phantomjs and other emulation of user behavior. But it turned out that I still lack js skills.
As a result, everything works on curl requests coming from the php page.
Receiving the information
Authorization was fast enough, and it worked more or less without problems.
The most nasty problem was the restriction on the number of incorrect password entries: twice - and call the admin, restore access ...')
But with the transition to the desired city stubbornly failed. The transition was made, but somewhere not there, although the POST request was executed by all the rules.
It turned out that preg_match does not work correctly with very large numbers of characters.
The directive directive saves it.
ini_set("pcre.backtrack_limit", 10000000);
First we get the initial state of the page (since we are not logged in yet, we get to the login page), and tear out the viewstate from there:
$url = 'http://***/Default.aspx'; $content = curlFunction($url); preg_match_all("/id=\"__VIEWSTATE\" value=\"(.*?)\"/", $content, $arr_viewstate); $viewstate = urlencode($arr_viewstate[1][0]);
Now, having already got an up-to-date snapshot of the page status, we enter the login and password.
(postdata is the POST parameter of the request for the page, you can spy on the same firebug).
$url = 'http://***/Default.aspx?ReturnUrl=%2fHome%2fRoutes.aspx'; $postdataArr = array( '__LASTFOCUS=', '__EVENTTARGET=', '__EVENTARGUMENT=', '__VIEWSTATE='.$viewstate, 'ctl00$cphMainContent$loginBox$loginBox$UserName='.$login, 'ctl00$cphMainContent$loginBox$loginBox$Password='.$password, 'ctl00$cphMainContent$loginBox$loginBox$LoginButton=', ); $postdata = implode('&',$postdataArr); $content = curlFunction($url, $postdata); preg_match_all("/id=\"__VIEWSTATE\" value=\"(.*?)\"/iu", $content, $arr_viewstate); $viewstate = urlencode($arr_viewstate[1][0]);
Due to the fact that the initial link is issued with a redirect, and curl is setting
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
we get we get as a result of the viewstate of the page we need.
It was at this moment that a problem arose with the inactive preg_replace, but the solution - thanks to
Habra - was found.
There is! Now you can go to the application for the desired city and already deal with parsing.
$url = 'http://***/Home/Routes.aspx'; $postdataArr = array( '__EVENTTARGET=ctl00$cphMainContent$ddlCityID', '__EVENTARGUMENT=', '__LASTFOCUS=', '__VIEWSTATE='.$viewstate, 'ctl00$cphMainContent$ddlCityID='.$city, 'ctl00$cphMainContent$tbConnectionDate='.$date, ); $postdata = implode('&',$postdataArr); $content = curlFunction($url, $postdata);
When you finally understand what you are doing, everything is quite simple: you need to follow exactly the link, the viewstate of which you received in the last step.
Data processing
Got, we start parsing.
The first experience was associated with regular expressions. Unfortunately, php on the hosting was somehow very strange working with multi-line expressions, and didn’t fully select (with all the options), no matter how I tried to persuade it (while everything worked on LAN).
The next step was the library
Simple Html Dom . All is well, we’ve got it, follow the links and parse the information ... Getting one page takes 0.9 seconds, and getting the same data from five inputs on the page takes another 5 seconds. When you need to go through nine such links, everything becomes very sad.
We google, we think, we read. Find
Nokogiri . You know, easy and worthwhile! Really fast and pleasant thing to work:
$html = new nokogiri($content);
Beauty and design
Suddenly, a very strange problem appeared: the actual customer with obvious discontent used the version of the developer without css, js and other bells and whistles. More precisely, he did not understand how
to use it.
We are looking for information about
XHR requests .
Profit! The user is happy, the user's mobile phone is free from the need to overtake tons of viewstates via the mobile Internet, and it’s somehow easier to manage the design of a handwritten page.
PS Here I was just asked if it is possible with the help of this client to also change the data in the application management system. Looks like it was a threat ...