📜 ⬆️ ⬇️

Grabber for one book site

One day I decided to write a grabber for a book website and now I want to share with you the subtleties of implementing such software solutions. All information is presented for informational purposes only.


Based on the QWebEngineView , that would not bother with authorization. And it looks like this:


Sharing cookies between QNetworkAccessManager and QWebEngineView


For this Qt has QWebEngineCookieStore and QNetworkCookieJar


MainWindow::MainWindow(QWidget *parent) : QMainWindow(parent), m_ui(new Ui::MainWindow), m_store(nullptr), m_cookieJar(new QNetworkCookieJar (this)), m_networmManager(new QNetworkAccessManager(this)), m_try(0), m_currentPage(0), m_capches(1) { m_ui->setupUi(this); m_store = m_ui->webView->page()->profile()->cookieStore(); Q_ASSERT(m_store != nullptr); connect(m_store, &QWebEngineCookieStore::cookieAdded, this, &MainWindow::handleCookieAdded); m_store->loadAllCookies(); m_ui->webView->load(QUrl("http://./")); m_networmManager->setCookieJar(m_cookieJar); connect(m_networmManager, &QNetworkAccessManager::finished, this, &MainWindow::handleImage); } void MainWindow::handleCookieAdded(const QNetworkCookie &cookie) { m_cookieJar->insertCookie(cookie); } 

When we go to reading a book and click on the Grab button, the url of the form is taken:


 http://./static/or3/view/or.html?art_type=4&file=26599915&bname= -  ReactJS&cover=%2Fstatic%2Fbookimages%2F26%2F59%2F99%2F26599923.bin.dir%2F26599923.cover.jpg&art=22880082&user=-&uuid=- 

We take out the file id and name:


 void MainWindow::onGrabButtonClicked() { if(!parseUrl(m_ui->webView->url())) { return; } const auto paths = QStandardPaths::standardLocations(QStandardPaths::DownloadLocation); if (paths.isEmpty()) { qWarning()<<"There is no standard path to download"; return; } downloadTo(*paths.begin()); } bool MainWindow::parseUrl(const QUrl &url) { const auto query = QUrlQuery(url.query(QUrl::FullyDecoded)); if (query.isEmpty()){ return false; } static const QVector<QString> fields = { "file", "bname", "uuid" }; for (const auto& key: fields) { if (!query.hasQueryItem(key)) { qWarning()<<"Query hasn't param"<< key; return false; } } m_name = query.queryItemValue("bname", QUrl::FullyDecoded); m_file = query.queryItemValue("file"); m_format = "jpg"; return true; } 

MainWindow :: downloadTo configures QPdfWriter and QPainter


 void MainWindow::downloadTo(const QString &path) { QDir dir(path); m_writer = std::make_unique<QPdfWriter>(dir.absoluteFilePath(m_name+".pdf")); QPageLayout layout(QPageSize(QPageSize::A4), QPageLayout::Portrait, QMarginsF(0,0,0,0)); m_writer->setPageLayout(layout); m_writer->setResolution(96); m_writer->setTitle(m_name); m_painter = std::make_unique<QPainter>(); m_painter->begin(m_writer.get()); nextImage(); } 

Download page


Pages are downloaded by url of the form:


 http://./pages/read_book_online/?file=26599915&page=2&rt=w1280&ft=gif 

ParameterDescription
rtresponsible for the size, takes the value of w640, w1280
ftgif or jpg format
pagepage number
filefile id

The jpg format is used for pages with graphics, at the same time gif for text.
If the pages by url: http://./pages/read_book_online/?file=26599915&page=0&rt=w1280&ft=gif pages http://./pages/read_book_online/?file=26599915&page=0&rt=w1280&ft=gif does not exist, then you should request http://./pages/read_book_online/?file=26599915&page=0&rt=w1280&ft=jpg


We get:


 void MainWindow::nextImage() { QUrlQuery query; query.addQueryItem("file", m_file); query.addQueryItem("rt", "w640"); query.addQueryItem("ft", m_format); query.addQueryItem("page", QString::number(m_currentPage)); QUrl url(BasePath); url.setQuery(query); m_networmManager->get(QNetworkRequest(url)); ++m_currentPage; } void MainWindow::handleImage(QNetworkReply *reply) { reply->deleteLater(); if (reply->error() != QNetworkReply::NoError) { qWarning()<<"Network error"<<reply->errorString(); if(m_try == 3) { m_painter->end(); m_painter.reset(); m_writer.reset(); return; } if (m_format == "gif") { m_format = "jpg"; } else { m_format = "gif"; } --m_currentPage; ++m_try; nextImage(); return; } m_try = 0; qDebug()<<"Write page"<<m_currentPage<<reply->url(); std::string f; if (m_format == "jpg") { f = "JPEG"; } else { f = "GIF"; } const auto data = reply->readAll(); const auto source = QImage::fromData(data, f.c_str()); if (source.isNull()) { //handleCapcha(data, reply->url()); --m_currentPage; nextImage(); return; } m_ui->pages->setText(QString::number(m_currentPage)); const auto dest = source.scaledToWidth(m_writer->width()/*, Qt::SmoothTransformation */); m_painter->drawImage(QPoint(0,0), dest); m_writer->newPage(); nextImage(); } 

Captcha


Captcha seems to be there, but not at the same time. Jumps out not always


We noticed a strange activity from your computer. Perhaps we were mistaken, and this activity does not come from you. In this case, confirm that you are not a robot and continue to use our site.

It turned out that you can simply re-request the page and continue to continue downloading images. If you do not like to pretend to be a robot, then you can handle it:


 void MainWindow::handleCapcha(const QByteArray &page, const QUrl &url ) { ++m_capches; m_ui->webView->page()->setHtml(page, url); m_ui->captches->setText(QString::number(m_capches)); QEventLoop loop; constexpr int duration = 1000*60*5; QTimer::singleShot(duration, &loop, &QEventLoop::quit); loop.exec(); } 

Here we load the page with captcha into WebView. After that, we can enter the captcha.


Total


The book of 256 pages in PDF with A4 and DPI 96 pages weighs 51.7 MB against 5.8 MB of encrypted document.


The code is available on GitHubGist


')

Source: https://habr.com/ru/post/334412/


All Articles