📜 ⬆️ ⬇️

Simple web scraping on f #

image Quite a legitimate question why such a hackneyed topic like web scraping and why f #. 1. on f # web scraping is much more interesting than on c # 2. I wanted to try as far as f # is applicable for developing not demo examples but something that really makes programs 3. F # has an interactive console, which when tinkering in the depths of HTML becomes just a salvation. Today, with the help of f # we will buy VW Touareg.

Touareg


In my opinion the most optimal car for the harsh winter and no less severe roads. Suppose we have a million four hundred thousand, a great desire and nothing more. There is also an auto.ru site, but my recent involvement in the purchase of a used car revealed several shortcomings: a. you need to constantly go to the correct section b. you need to constantly fill out a search form, which is especially annoying when you need to do it on the road from mobile devices, on the way to inspect the next candidate, we used the iPad and it was “not ice”, and from some kind of smartphone I would definitely shoot myself to perform all these operations. Total requirements: the program bypasses the offer page with the corresponding request, searches for new offers and if it finds it, sends an email with the parameters of the new offer (s) and at the same time a list of all the offers that satisfy the request so that you can compare visually.

General methods


Auto.ru is quite loyal to collecting its content, so there will be no perversions like emulation of pressing a button and slipping cookies, and all that interests us can be obtained via a direct url via GET. We will also send letters via gmail, which will require the client’s SMTP settings specified in the comments

module WebUtil =<br/>
let LoadPage (x:WebRequest) =<br/>
use resp = x.GetResponse()<br/>
let rs = resp.GetResponseStream()<br/>
let rr = new StreamReader(rs,System.Text.Encoding.GetEncoding(1251))<br/>
rr.ReadToEnd()<br/>
let LoadPageByUrl (x:string) =<br/>
let request = WebRequest.Create(x)<br/>
LoadPage request<br/>
let SentByMail (recepinet: string) (subj:string) (content: string) =<br/>
let client = new SmtpClient()<br/>
client.DeliveryMethod <- SmtpDeliveryMethod.Network<br/>
use message = new MailMessage()<br/>
message.To.Add(recepinet)<br/>
message.Body <- content<br/>
message.Subject <- subj<br/>
message.IsBodyHtml <- true <br/>
client.Send(message)<br/>
(*
<system.net>
    
<smtp from="YourMail@gmail.com">
<network host="smtp.gmail.com" port="587" enableSsl="true"
password="$ecretPa$$w0rd" defaultCredentials="false"
userName="YourMail@gmail.com" />
      

    

</system.net>
*) <br/>

')
What is here from f #? Functionally, nothing, the standard methods of the platform, but strikingly briefly, if it were not for the footcloth of the MailMessage properties setters. The readability of the code is very personal, but in my opinion there is little that can be compared with f # in readability.

Data structures


Because you need to distinguish which sentences were added since the last check, the result of the previous query will be stored in a file. It would be true to keep only the date of the last check, but then it would not be at all interesting and the topic of the serialization of complex objects would be missed. So records (records):

module CarParser =<br/>
[<DataContract>]<br/>
type Car = <br/>
{<br/>
[<field: DataMember(Name= "Year" ) >]<br/>
Year: int;<br/>
[<field: DataMember(Name= "Price" ) >]<br/>
Price: int;<br/>
[<field: DataMember(Name= "Model" ) >]<br/>
Model: string;<br/>
[<field: DataMember(Name= "Engine" ) >]<br/>
Engine: string;<br/>
[<field: DataMember(Name= "Url" ) >]<br/>
Url: string;<br/>
}<br/>
[<DataContract>]<br/>
type CarRequest =<br/>
{<br/>
[<field: DataMember(Name= "Email" ) >]<br/>
Email:string;<br/>
[<field: DataMember(Name= "RequestUrl" )>]<br/>
RequestUrl: string;<br/>
[<field: DataMember(Name= "Cars" ) >]<br/>
Cars: Car list;<br/>
}<br/>



Why additional attributes? The fact is that standard serialization in XML via the XmlSerializer does not work, because f # records has no constructor without parameters, which is mandatory. In this case, it will save the DataContractSerializer, the methods for serializing and deserializing to the file look like this:

open System;<br/>
open System.IO;<br/>
open System.Xml;<br/>
open System.Runtime.Serialization;<br/>
open System.Text.RegularExpressions;<br/>
module SerializationUtils = <br/>
let SerializeToFile (req: 'T) (fileName: string) =<br/>
let xmlSerializer = DataContractSerializer(typeof<'
T>); <br/>
use fs = File.OpenWrite(fileName)<br/>
xmlSerializer.WriteObject(fs, req)<br/>
<br/>
//T' will be calculated automatically
let Deserialize< 'T> (fileName:string) =<br/>
let xmlSerializer = DataContractSerializer(typeof<'
T>); <br/>
use fs = File.OpenRead(fileName)<br/>
xmlSerializer.ReadObject(fs) :?> 'T<br/>


Content parsing


If we talk about the parameters of a particular car, then the priority is as follows: price, engine - how much tax I will pay and gasoline or diesel, a year - very indirectly indicating the state. If everything suits me, then you can look at the photo by clicking on the link, this moment, too, by the way, it would be interesting to remake and indicate the link to your site, which will show me the photos and the description of the car without any advertisements there. But back to the more urgent task.


Using the HTMLAgilityPack (in my opinion this is really cool - any .net libraries are available from f #) we get a table with sentences and further analysis is just a matter of technology. Again on f # parsing looks very short and clear, I know for sure that at least part of the next real project on collecting and analyzing content I will do on f # because it is much easier to read.

let private ParseCar (cnt: HtmlNode) =<br/>
let columns = cnt.SelectNodes( "td" ) |> Seq.toList<br/>
let model = columns.[0].InnerText<br/>
let txt = columns.[1].InnerText<br/>
let price = txt |> ( fun x -> Regex.Replace(x, "\\W" ,System.String.Empty)) |> Int32.Parse<br/>
let url = columns.[0].ChildNodes <br/>
|> Seq.find ( fun x -> x.Name.ToLower() = "a" )<br/>
|> ( fun x-> x.Attributes) <br/>
|> Seq.find ( fun x -> x.Name = "href" )<br/>
|> ( fun x -> x.Value)<br/>
let year = columns.[2].InnerText |> Int32.Parse<br/>
let engine = columns.[3].InnerText<br/>
let c: Car = { Year = year; Price = price; Model = model; Url = url; Engine = engine; }<br/>
c<br/>
let private ParsePage (node: HtmlNode) (parseCar: HtmlNode -> Car) =<br/>
node.SelectNodes( "//div[@id='cars_sale']/table[@class='list']/descendant::tr[not(@class='header first')]" )<br/>
|> Seq.map parseCar<br/>


And several methods for summarizing the data obtained into interesting requests are probably only currying and initializing the record by copying from the old record. f # overrides the CompareTo, Equals and GetHashCode functions, so the comparison of records in this case works correctly and you can write x = y.

let private ParseCarNode x = ParsePage x ParseCar<br/>
let private GetCars (cntnt:string) (pars: HtmlNode -> seq) =<br/>
let doc = new HtmlDocument()<br/>
doc.LoadHtml(cntnt)<br/>
pars doc.DocumentNode<br/>
let CreateCarRequest mail url =<br/>
let cars = GetCars (LoadPageByUrl url) ParseCarNode<br/>
{ Email = mail; RequestUrl = url; Cars = cars |> List.ofSeq }<br/>
let UpdateCarList (oldRequest: CarRequest) =<br/>
let newCars = GetCars (LoadPageByUrl oldRequest.RequestUrl) ParseCarNode<br/>
let isContains y = Seq.tryFind ( fun x -> x = y)<br/>
let diff = newCars |> Seq.filter ( fun x -> (oldRequest.Cars |> isContains x) = None)<br/>
let res = { oldRequest with Cars = newCars |> List.ofSeq }<br/>
// ,
(res,diff)<br/>



Results


Overboard were the functions of formatting e-mail messages and the function that brings it all together and runs on a timer, but their implementation is obvious. Testing can be done on c #, in particular, testing the correctness of finding new machines was implemented with the help of Moles and it was there that the rakes described in this post surfaced.
Main advantages: 200 lines of code. Together. All 5 files. There are 30 percent less than the average c # file in programs that carry at least some kind of functional load, and not just call the framework methods with a different order of arguments. Readability code. The speed of development in programs collecting kontent very big advantage is the ability to execute code without compiling. In my opinion, f # is a more understandable and natural way of developing programs, familiar and routine tasks become interesting again. The main drawbacks: the main drawback is of course hand curves, because the program has no logging or clear error handling, which of course is unacceptable (but the truth is, we just buy a car and not sell software).
In any case, it is possible and necessary to write real world programs on f # and this task is simply not complex enough algorithmically and logically to show all the capabilities and advantages of the language, but even such tasks are not bad, and most importantly, interesting and fast enough, except for the time to learn the language .
PS: It remains to write the web interface and ask donation for subscribing to pay for sms gateway and start sending messages if Avto.ru does not ban me earlier :)

Source: https://habr.com/ru/post/112553/


All Articles