📜 ⬆️ ⬇️

Convenient downloading from Books.ru or attaching to WWW :: Mechanize

Picture to attract attention
If there is a book, it is good
And when the opposite is bad
Instead of an epigraph

As everyone knows, there has recently been a campaign with the possibility of acquiring a large number of electronic books on books.ru at a fair price . The user icoz made a script for batch downloading , but the script is not very convenient, since the books are stored under inconvenient names and need to be downloaded by hand.
In general, I told myself that everything should be convenient and automatic, as is well known “said-done”, which is especially important in the light of the upcoming sale tomorrow.

Step 1 . We connect the necessary modules.
We will need
use WWW::Mechanize; use HTTP::Request::Common; use LWP; use LWP::UserAgent; 

The module itself and several service modules on which it depends. If you, like me, are using Ubuntu, then downloading WWW :: Mechanize with CPAN is contraindicated, but instead, it’s better to say
 sudo apt-get install libwww-mechanize-perl 

Step 2 . We create the object of mechanization and collect the script parameters from the command line: login and password.
 my $mech = WWW::Mechanize->new(); $booklog = $ARGV[0]; $bookpsw = $ARGV[1]; 

Step 3 . Login on site
 my $resp = $mech->get('http://www.books.ru/member/login.php'); $mech->cookie_jar->set_cookie(0, 'cookie_first_timestamp',DateTime->now->epoch, '/', 'www.books.ru'); $mech->cookie_jar->set_cookie(0, 'cookie_pages', '1', '/', 'www.books.ru'); $resp = $mech->post('http://www.books.ru/member/login.php',[ 'login' => $mail, 'password' => $password, 'go' => 'login', 'x' => rand_from_to(40, 50), 'y' => rand_from_to(1, 20), 'token' => '' ]); 

I draw attention to lines 2 and 3. In the original code, these cookies are formed using JavaScript, but only for the sake of calculating the two parameters, connecting JavaScript is not rational and easier to rewrite it on pearl.
Step 4 . We get a general list of our orders and create an iterator for it:
 $resp = $mech->get('http://www.books.ru/member/orders/'); my @order_list = mkGunz($resp->content) =~ /\<a\shref=\"http:\/\/www\.books\.ru\/order.php\?order\=(\d+)\"\>/gi; foreach my $order_id (@order_list) {...} 

I draw attention to the function mkGunz, which automatically expands the data if the server has packed them with gzip.
Step 5 . Now we need to extract the authors of the book and its title from the page. Since we use the HTML :: TokeParser module to parse the page, the easiest way is to stream the data we need using the URL.
  my $fname = ''; my $authors = ''; while (my $token = $stream->get_token) { if ($token->[0] eq 'S' && $token->[1] eq 'a') { my $href = $token->[2]{'href'}; $authors .= $stream->get_trimmed_text('/a').',' if ($href =~ /\/author\//); if ($href =~ /show\=1/) { $fname = $stream->get_trimmed_text('/a'); $fname =~ s/\(\sPDF\)//gi; } if ($href =~ /download\/\?file_type\=pdf/) { chop($authors); $fname = trim($authors.','.$fname); $fname =~ tr/\//_/; $fname .= '.pdf'; .... } } 

Step 6 . Retrieve and save PDF. There are several interesting points at once: if you don’t make a clone, then only one book is downloaded, apparently a bug on the site books.ru. To save files with Russian letters, you cannot use the IO :: File module, a bug in the module for the version of Pearl v5.14.2. Well, call binmode, so as not to break the PDF files.
  my $gbm = $mech->clone(); $resp = $gbm->get($href); $resp = $gbm->submit_form(with_fields => {'agreed' => 'Y', 'go' => 1}); my $pdfFile = $resp->content; $pdfFile = mkGunz($resp->content) unless ($resp->content =~ /^\%PDF/); print "Saving ".$fname." as ".length($pdfFile)." bytes.\n" ; open(my $fh, ">", $fname); if (defined $fh) { binmode($fh); print $fh $pdfFile; close($fh); } 


And finally, all in the collection.
 #!/usr/bin/perl use WWW::Mechanize; use HTTP::Request::Common; use LWP; use LWP::UserAgent; use URI::Escape; use HTML::TokeParser; use DateTime; use Compress::Raw::Zlib; use Encode qw(decode encode); use warnings; sub trim($); my $mech = WWW::Mechanize->new(); $booklog = $ARGV[0]; $bookpsw = $ARGV[1]; #die "Usage: books.su.pl <login> <password> \n" if (scalar @ARGV < 2); $mail = $booklog; $password = $bookpsw; $mech->agent_alias("Linux Mozilla"); #$mech->proxy('https', 'http://127.0.0.1:8888/'); #$mech->proxy('http', 'http://127.0.0.1:8888/'); my $resp = $mech->get('http://www.books.ru/member/login.php'); $mech->cookie_jar->set_cookie(0, 'cookie_first_timestamp',DateTime->now->epoch, '/', 'www.books.ru'); $mech->cookie_jar->set_cookie(0, 'cookie_pages', '1', '/', 'www.books.ru'); #print mkGunz($resp->content)."\n"; $resp = $mech->post('http://www.books.ru/member/login.php',[ 'login' => $mail, 'password' => $password, 'go' => 'login', 'x' => rand_from_to(40, 50), 'y' => rand_from_to(1, 20), 'token' => '' ]); #print mkGunz($resp->content)."\n"; $resp = $mech->get('http://www.books.ru/member/orders/'); my @order_list = mkGunz($resp->content) =~ /\<a\shref=\"http:\/\/www\.books\.ru\/order.php\?order\=(\d+)\"\>/gi; foreach my $order_id (@order_list) { $resp = $mech->get('http://www.books.ru/order.php?order='.$order_id); my $hcont = mkGunz($resp->content); my $stream = HTML::TokeParser->new(\$hcont); $stream->empty_element_tags(1); my $fname = ''; my $authors = ''; while (my $token = $stream->get_token) { if ($authors eq '' && $fname ne "" && $token->[0] eq 'S' && $token->[1] eq 'br') { $authors .= cnv($stream->get_trimmed_text('/p')).','; } if ($token->[0] eq 'S' && $token->[1] eq 'a') { my $href = $token->[2]{'href'}; if ($href =~ /show\=1/) { $fname = cnv($stream->get_trimmed_text('/a')); $fname =~ s/\(\sPDF\)//gi; } if ($href =~ /download\/\?file_type\=pdf/) { chop($authors); $fname = trim($authors.','.$fname); $fname =~ tr/\//_/; $fname .= '.pdf'; my $gbm = $mech->clone(); $resp = $gbm->get($href); $resp = $gbm->submit_form(with_fields => {'agreed' => 'Y', 'go' => 1}); my $pdfFile = $resp->content; $pdfFile = mkGunz($resp->content) unless ($resp->content =~ /^\%PDF/); print "Saving ".$fname." as ".length($pdfFile)." bytes.\n" ; open(my $fh, ">", $fname); if (defined $fh) { binmode($fh); print $fh $pdfFile; close($fh); } else { die "Unable to open:".$fname."\n"; } $authors = ''; $fname = ''; } } } } sub cnv {return shift;}#encode('cp1251', decode('UTF-8', shift));} sub rand_from_to { my($from, $to) = @_; return int(rand($to - $from)) + $from; } sub mkGunz { my ($ind) = @_; return $ind if($ind =~ /html/); my $gun = new Compress::Raw::Zlib::Inflate(WindowBits => WANT_GZIP); { my $out; my $status = $gun->inflate($ind, $out); if ($status == Z_OK || $status == Z_STREAM_END) { return $out; } else { die $status.":".$ind; } }; } sub trim($) { my $string = shift; $string =~ s/^\s+//; $string =~ s/\s+$//; return $string; } 



Note for Windows lovers :
Most likely, you need to change the string $ fname = ~ tr / \ // _ /; on $ fname = ~ tr / \ / \: \ * \? \\ / _ /; since NTFS has more prohibited characters than ext4 and fiddles with the encoding, for which the cnv function is provided.
')
Required wishes for the reader : I wish not to miss the sale , buy a lot of books, download them on your tablet and quietly read the weekend at the cottage without the Internet.

Legal disclaimer : Since it is forbidden to rename files after the license agreement, you should download them immediately under the correct and convenient name, which this script does!

Source: https://habr.com/ru/post/236519/


All Articles