How I monitored Avito by SMS

As is known, goods of very good quality periodically appear at Avito and at the same time are very cheap. But they rarely appear, hang there a little and disappear quickly.

Therefore, I had an idea: isn’t it possible to search for a service that checks ads every few minutes, and if there is something interesting for me, notifies about it? In this case, it is best to notify via SMS, and then I do not always check the mail promptly.

Google provided several such services, "only" from 3 rubles per SMS or from 4 rubles per day.
')
In the end, I decided to write such a service myself, but more on that later ...

For interest, I registered on one of the services. Yesterday, he checked links every 15 minutes, and if anything changed, he sent notifications to the post office. About SMS on their website it was casually mentioned that mail mail.ru is able to send text messages. In fact, it turned out that mail.ru can only send to a megaphone, and I don’t even have it at all ... And if you need to beeline-mts, then please, the service will help you with pleasure, for a separate coin.

I also note that I have been a user of a very convenient and free service for a long time , about which they wrote a long time ago on Habré , and which allows you to send an email with a specific topic to a specific box, and the content of the letter will come to me in the form of SMS. I wanted to specify my_box@sms.ru for service letters, but did not understand how to change the subject of the letter, without which you will not receive SMS.

In addition, today the demo period of the Glyce is over, and the frequency of checking is 720 minutes.

In general, having thought that to pay for, I apologize, the “service” of such a level is the same as paying the air for ~~Windows~~ , I decided that it was easiest to spend 3 hours of my valuable time and build such a service myself, since the parsing of the Avito page is trivial and , as follows, I took exactly 1 line of code.

I used VPS hosting for this script. WEB-hosting is also suitable, subject to the presence of a pearl on it, access to the "outside" and the scheduler. In extreme cases, any computer included in the Internet will work. I think many have something like that.

What is the script written on?

I decided to write it on a pearl, and although I know a pearl rather mediocre, for scripts of this kind it fits best. There, where it was laziness to deal with the pearl, I didn’t really strain, I called the shell commands through the system. Nevertheless, it turned out, in my opinion, quite decently and even not ashamed to show my creation to the public.

The logic of the work, briefly

- Run the script every xxx minutes;
- Download the page using wget;
- We store the page downloaded last time, comparing it with the newly downloaded one, if some ads have changed / new ones appeared - send an SMS about it.

Infa taken out of ads is:

1. ad URL (which I use as a unique ad identifier);
2. Name;
3. Price.

At the same time, it is foreseen: if a failure occurs at one of the downloads of the page, the old list will remain, and the page will simply be downloaded the next time, then SMS will be sent about the changes, if they occurred.

More details

Before use, check the paths and names for mailer and wget, make sure you have them and work. In particular, in my centos mailer is called mutt, mail or sendmail is more common with the same syntax. Maybe you need to replace wget with / usr / local / bin / wget, etc.

You should also set your mailbox and the phone to which you want to receive notifications.

Run the script with the command: ./avito.pl url_pages_with_advertisements.

I note that the URL of the page should be in the form of a "list with a photo." In other words, there should not be any & view = list or & view = gallery in urla.

Example url: www.avito.ru/moskva?q=%D1%80%D0%B5%D0%B7%D0%B8%D0%BD%D0%BE%D0%B2%D1%8B%D0%B9+% D1% 81% D0% BB% D0% BE% D0% BD

The page is downloaded to a file with the name obtained from the URL, with the replacement of all left characters with underscores, like this:

https ___ www.avito.ru_moskva_q__D1_80_D0_B5_D0_B7_D0_B8_D0_BD_D0_BE_D0_B2_D1_8B_D0_B9__D1_81_D0_BB_D0_BE_D0_BD

It should be unique, supported in Linux and in Windows and at the same time be sufficiently readable.

If there is already such a file, the script tries to pull ads out of it. If no ads are found in the file, the script calls wget, while overwriting the file. If ads are found, the file is saved with a -1 suffix:

https___ www.avito.ru_moskva_q__D1_80_D0_B5_D0_B7_D0_B8_D0_BD_D0_BE_D0_B2_D1_8B_D0_B9__D1_81_D0_BB_D0_BE_D0_BD-1

Next, the page is downloaded again, it checks the following situations:

1. If the ads in the new downloaded page are not found, the script simply ends - the old page remains with the suffix -1. This is in case, if suddenly the network disappeared or hung up - the previous list of ads will not be lost.
2. If the script is launched for the first time (the previously downloaded page was not found), then the info will come simply about the number of available ads:

Found 25 items, page www.avito.ru/moskva?q=%D1%80%D0%B5%D0%B7%D0%B8%D0%BD%D0%BE%D0%B2%D1%8B%D0% B9 +% D1% 81% D0% BB% D0% BE% D0% BD monitoring started

If this message has arrived, then the system has started, it is mainly a check that everything worked.

Since SMS should be shorter, the better, all messages are very concise.

3. If there is a new announcement, then info about this will be added to the text of the future SMS. Then for all ads info will come in the form of one SMS.
4. If the price or the name of the product has changed, then infa will come in the form: old price -> new price: name link. Or new link name.

I don’t know if the name can change, but it was not a pity to make an extra check.

5. The console displays a separate text list of what was found. This is done more for debugging, because today the parser is working, and tomorrow, when they change the markup, it will stop. Will have to change the parsing.

About parsing and nuances

Actually, the whole parsing is in this line:

while($text=~/<div class=\"description\"> <h3 class=\"title\"> <a href=\"(.*?)\".*?>\n(.*?)\n.*?<div class=\"about\">\n\s*(\S*)/gs)

Although, the price also contains a space in the form of nbsp, which I cut out with another regexp:

 $price=~s/&nbsp;//g

So parsing, formally speaking, is still not in one, but in two lines.

g is a global search modifier that allows you to thrust a search inside the while condition, each time issuing the next declaration;
s - allows you to search in several lines within one regexp (on Avito, the URL, name and price are located on 4 lines, but this is now, until they have changed the layout).

Also note that for a multiline file reading at the beginning of the script is assigned:

 undef $/;

This is for my $ text =; I read the entire file in myself.

Another caveat: I insert clickable URLs into all sms. I have a normal smartphone, which allows you to poke the url inside the sms and get to the right page is very convenient. So, for some reason, sms.ru spoils such an innocent character as underscore. Replacing it with% C2% A7. I can’t affect it, but I can replace it with an underscore code, which is normal, while the URL becomes clickable for sms.ru, remaining the same for regular mail: $ text = ~ s / _ /% 5F / g;

Add a task to the scheduler

 #crontab -e */20 * * * * cd /scripts/avito && ./avito.pl 'https://www.avito.ru/moskva?q=%D1%80%D0%B5%D0%B7%D0%B8%D0%BD%D0%BE%D0%B2%D1%8B%D0%B9+%D1%81%D0%BB%D0%BE%D0%BD'

Every 20 minutes, call the script, checking the page. Do not forget to screen URL with single quotes.

Such tasks can be set as many as you want, they will all work independently of each other.

What I have not done for the industrial version and that it would be easy to finish

1. Web muzzle to add / remove users and tasks. Storage of urls, periodicity, mailbox and telephone of users on sms.ru in the mysql database. The script would be called every minute, check what url to run and send SMS not to my hard-coded number, but to the one specified by the user.

Then it would be possible to rip off users for 8 rubles a day or something like that. Maybe do? Do you want to pay for such a thing?

2. Filter prices. Ignore the price above or below the set. It becomes elementary, with another if: next if($page_new{"price"}{$uri}>$max_price or $page_new{"price"}{$uri}<$min_price) . Just did not have to.

3. By analogy with Avito, add Automotive News, irr, etc. sites.

It’s also elementary, just behind that while(...){...} add a few more while - to each site one by one. The main thing is that inside them fill $page{"name"}{$uri} $page{"price"}{$uri} .

For each site, your while will work, the rest just return an empty result.

Well, actually the script code

 #!/usr/bin/perl use strict; undef $/; my $url=$ARGV[0]; my $mailer="mutt"; my $wget="wget"; if($url eq ""){ print "Usage: avito.pl <https://www.avito.ru/...url>"; exit; } my $filename=$url; $filename=~s#[^A-Za-z0-9\.]#_#g; $url=~m#(^.*?://.*?)/#; my $site=$1; print "site:".$site."\n"; sub sendsms { my $text=shift; $text=~s/_/%5F/g; $text=~s/&/%26/g; system("echo '$text' | $mailer -s 79xxxxxxxxx xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx\@sms.ru"); } sub parse_page { open(MYFILE,"<".shift); my $text=<MYFILE>; close(MYFILE); my %page; while($text=~/<div class=\"description\"> <h3 class=\"title\"> <a href=\"(.*?)\".*?>\n(.*?)\n.*?<div class=\"about\">\n\s*(\S*)/gs) { my $uri=$1; my $name=$2; my $price=$3; $uri=~s/^\s+|\s+$//g; $name=~s/^\s+|\s+$//g; $price=~s/^\s+|\s+$//g; $price=~s/&nbsp;//g; $page{"name"}{$uri}=$name; $page{"price"}{$uri}=$price; } return %page; } my %page_old=parse_page($filename); if(scalar keys %{$page_old{"name"}}>0){ system("cp $filename ${filename}-1"); } else{ %page_old=parse_page("${filename}-1"); } system("$wget '$url' -O $filename"); my %page_new=parse_page($filename); if(scalar keys %{$page_old{"name"}}>0){ # already have previous successful search if(scalar keys %{$page_new{"name"}}>0){ # both searches have been successful my $smstext=""; foreach my $uri(keys %{$page_new{"name"}}) { if(!defined($page_old{"price"}{$uri})){ $smstext.="New: ".$page_new{"price"}{$uri}." ".$page_new{"name"}{$uri}." $site$uri\n "; } elsif($page_new{"price"}{$uri} ne $page_old{"price"}{$uri}){ $smstext.="Price ".$page_old{"price"}{$uri}." -> ".$page_new{"price"}{$uri}." ".$page_new{"name"}{$uri}." $site$uri\n"; } if(!defined($page_old{"name"}{$uri})){ # already done for price } elsif($page_new{"name"}{$uri} ne $page_old{"name"}{$uri}){ $smstext.="Name changed from ".$page_old{"name"}{$uri}." to ".$page_new{"name"}{$uri}." for $site$uri\n"; } } if($smstext ne ""){ sendsms($smstext); } } else{ # previous search is successful, but current one is failed # do nothing, probably a temporary problem } } else{ # is new search if(scalar keys %{$page_new{"name"}}<=0){ # both this and previous have been failed sendsms("Error, nothing found for page '$url'"); } else{ # successful search and items found sendsms("Found ".(scalar keys %{$page_new{"name"}})." items, page '$url' monitoring started"); } } foreach my $uri(keys %{$page_new{"name"}}) { print "uri: $uri, name: ".$page_new{"name"}{$uri}.", price: ".$page_new{"price"}{$uri}."\n"; if($page_new{"price"}{$uri} eq $page_old{"price"}{$uri}){print "old price the same\n";} else{print "old price = ".$page_old{"price"}{$uri}."\n";} if($page_new{"name"}{$uri} eq $page_old{"name"}{$uri}){print "old name the same\n";} else{print "old name = ".$page_old{"name"}{$uri}."\n";} }

Source: https://habr.com/ru/post/268857/

All Articles