Recently I have been interested in web-scrapping (it’s web-mining) and as a result I decided to write an article for those who have already heard that it exists, but so far I have not tasted it.
So, in my understanding, web-scrapping itself is the transfer of data posted on the Internet as HTML pages into a kind of storage. The storage can be either a plain text file or an XML file or a database (DB). That is, on the face of the reverse (reverse) process - because the web application usually takes data just from the database.
From theory to practice
For example, take a simple case - parsing the auto.ru site page. Having followed the link
http://vin.auto.ru/resolve.html?vin=TMBBD41Z57B150932 we will see some information displayed for the Identification number TMBBD41Z57B150932 (brand, model, modification, etc.). Imagine that we need to display this information in a window, for example, a Windows application. Working with DB in .Net is widely described, so we will not focus on this problem, we will deal with the essence.
So, let's create a WinForms application project, drop one form TextBox component named tbText on the form, which will contain our address (link); the btnStart button, when clicked, the request to the specified address will be executed, as well as the ListBox lbConsole, where we will display the received data. In a real application, links will also have to be taken from some external source, but do not forget that this is just an example.
')
Actually everything is with the interface, now we will create a method that is called in response to a button click.
In this method we need to do the following things:
1. Contact the address given in our TextBox
2. Get the page
3. Select the required data from the page.
4. Display data on the form
Contact address
To begin with, we will create a variable in which the page retrieved by the query will be stored:
string AutoResult = String .Empty; * This source code was highlighted with Source Code Highlighter .
string AutoResult = String .Empty; * This source code was highlighted with Source Code Highlighter .
string AutoResult = String .Empty; * This source code was highlighted with Source Code Highlighter .
Next, create a query, passing as a parameter the link we know:
- var autoRequest = (HttpWebRequest) WebRequest.Create (tbLink.Text);
* This source code was highlighted with Source Code Highlighter .
Let's set the properties of the request, help us to impersonate a browser In this case, it does not matter, but some sites analyze the request headers, so this is a hint for the future.
- autoRequest.UserAgent = "Mozilla / 4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident / 4.0)" ;
- autoRequest.Headers.Add ( "Accept-Language" , "ru-Ru" );
- autoRequest.Accept = "image / gif, image / jpeg, image / pjpeg, image / pjpeg, application / x-shockwave-flash, application / x-ms-application, application / x-ms-xbap, application / vnd.ms -xpsdocument, application / xaml + xml, application / vnd.ms-excel, application / vnd.ms-powerpoint, application / msword, * / * " ;
* This source code was highlighted with Source Code Highlighter .
Also indicate that the GET method will be used.
- autoRequest.Method = "GET" ;
* This source code was highlighted with Source Code Highlighter .
Now we execute the query and proceed to the next item -
Page retrieval
- HttpWebResponse autoResponse = (HttpWebResponse) autoRequest.GetResponse ();
* This source code was highlighted with Source Code Highlighter .
Actually the server’s response, which means that the page itself is now stored in the autoResponse variable. Now we need to analyze this answer, if everything is OK, then we can present the page as a string:
- if (autoResponse.StatusCode == HttpStatusCode.OK)
- {
- using ( Stream autoStream = autoResponse.GetResponseStream ())
- {AutoResult = new StreamReader (autoStream, Encoding .GetEncoding ( "windows-1251" )). ReadToEnd (); }
- }
* This source code was highlighted with Source Code Highlighter .
And if everything is really OK, then we now have a string of the same type in the AutoResult variable, which we can see in the browser using the Source Page Source Menu. Well, except in unformatted form.
This is all, of course, great. But I would like to choose from this jumble of tags exactly what we need. Here, regular expressions will come to our aid, which we will use with the help of extender methods. Extender methods, let me remind you, these are static methods of a static class that can be called as a method of an object of another class, if this object of this class is the first parameter of a method of a static class marked with this keyword. The example is easier. If we have a StringWithEq method of the StringOperations class
- static class StringOperations
- { internal static string StringWithEq ( this string s) { return string .Format ( "{0} =" , s);}}
* This source code was highlighted with Source Code Highlighter .
then we can use this method as usual (1), and as a method extender (2):
- string test = "Test" ;
- (1) Console .Write (StringOperations.StringWithEq (test));
- (2) Console .Write (test.StringWithEq ());
* This source code was highlighted with Source Code Highlighter .
If you look at the source code of the HTML page in the browser, you will notice that the data we need is contained within a tag that is not used anywhere else:
<dl class="def-list md"><dt><strong> </strong></dt><dd>TMBBD41Z57B150932</dd><dt><strong></strong></dt><dd>SKODA</dd><dt><strong></strong></dt><dd>Octavia II (A5)</dd><dt><strong></strong></dt><dd>Elegance</dd><dt><strong> </strong></dt><dd>2007</dd><dt><strong> </strong></dt><dd></dd><dt><strong> </strong></dt><dd>5-</dd><dt><strong> , ..</strong></dt><dd>2000</dd><dt><strong> </strong></dt><dd>150</dd><dt><strong> </strong></dt><dd>BLR, BLX, BLY</dd><dt><strong> </strong></dt><dd> </dd><dt><strong> </strong></dt><dd>Solomonovo</dd><dt><strong> </strong></dt><dd></dd><dt><strong> </strong></dt><dd></dd><dt><strong></strong></dt><dd>Skoda Auto as</dd><dt><strong> </strong></dt><dd>50932</dd><dt><strong> </strong></dt><dd><span style='color: #FF0000;'>NOT OK!</span></dd></dl> - <a href="http://vinformer.su">vinformer.su</a></div>
Therefore, we will use this, first extract the data from inside this tag, and then sort it out and place it, for example, into an object of the Dictionary class. Then we will display the data in the ListBox lbConsole. I would like the final code to look like this:
- string BetweenDL = AutoResult.BetweenDL ();
- Dictionary < string , string > d = BetweenDL.BetweenDTDD ();
- foreach ( var s in d)
- {
- lbConsole.Items.Add ( string .Format ( "{0} = {1}" , s.Key, s.Value));
- }
* This source code was highlighted with Source Code Highlighter .
In the first line we get a line containing the necessary data. Here we use the following extender method:
- internal static string BetweenDL ( this string dumpFile)
- {
- var _regex = new Regex ( @ "<dl [^>] *> (? <value> [\ s \ S] +?) </ dl>" , RegexOptions.IgnoreCase | RegexOptions.Compiled);
- Match _match = _regex.Match (dumpFile);
- return _match.Success? _match.Groups [ "value" ] .Value: string .Empty;
- }
* This source code was highlighted with Source Code Highlighter .
Next, using another expander method, select the required data and write it to the object of the Dictionary class:
- internal static Dictionary < string , string > BetweenDTDD ( this string dumpFile)
- {
- var _regex = new Regex ( @ "<dt [\ s \ S] +? strong> (? <valDT> [\ s \ S] +?) </ strong> </ dt> <dd [^>] *> (? <valDD> [\ s \ S] +?) </ dd> " , RegexOptions.IgnoreCase | RegexOptions.Compiled);
- MatchCollection matches = _regex.Matches (dumpFile);
- Dictionary < string , string > d = new Dictionary < string , string > ();
- foreach (Match match in matches)
- {
- GroupCollection groups = match.Groups;
- d.Add (groups [ "valDT" ] .Value, groups [ "valDD" ] .Value);
- }
- return d;
- }
* This source code was highlighted with Source Code Highlighter .
Next, in the foreach loop, we display the data in the ListBox.
Of course, you could only use the second expander method, the result would be the same. In real-world applications, it is sometimes more convenient to select a part of the text containing the necessary data, and then to analyze it. You can make other improvements and / or changes to this code, but I hope that I’ve achieved the goal of this small article - it gave you an idea of what web-scraping is.