Good day, habrazhiteli!

I think many of you have ever attended the idea of "just to save articles from Habr." The same thought came to me a
little over a year ago .
')
I present to you the new version of the program of downloading articles from Habra, Hiktames and Megamind in PDF format.
The new project is called
HabraParse .
The project consists of a library, which parses the sites, and a script that uses only a part of the capabilities of this library. The script is written in python3, its work will require the
docopt ,
requests and
weasyprint modules (you can easily install them all with the
pip install name command).
Currently, the script has the following features:
- Download the article by its ID;
- Download a list of favorite URLs for a given user;
- Download articles from favorites to a folder in PDF or HTML format (so far the implementation of HTML is not up to par, so the PDF format is used by default, but it works much longer).
Using the options
--gt /
--mm allows you to save articles with
GeekTimes.ru and
Megamozg.ru .
Brief description of script parametersUsage:
./habraparse.py save_favs_list [--gt|--mm] <username> <out_file> ./habraparse.py save_favs [--gt|--mm] [-cn --save-html --limit=N] <username> <out_dir> ./habraparse.py save_post [--gt|--mm] [-c --save-html] <topic_id> <out_file>
By default, all teams work with the HabraHabr.ru project.
When specifying the
--gt /
--mm options, the script will work with GeekTimes.ru/Megamozg.ru.
Commands:
save_favs_list - <out_file> URL <username> save_favs - <out_dir> <username> save_post - <out_file> ID
Enjoy and enjoy. In the event of errors, please post messages in a personal or a bug on the
github-page of the project .
If someone is missing something, then write a feature-request in the comments, as far as I can, I will try to implement it.
Technical details
In fact, Habraparse is, first of all, a library for working with information on the websites Habrahabr.ru, GeekTimes.ru, MegaMozg.ru, which allows:
- get information about the user's profile by his name;
- get from the user profile: articles that he wrote and which he added to his "Favorites";
- get an article by ID number with its analysis.
The name for the library was chosen extremely original -
habr .
User information is presented in the classes
HabraUser ,
GeektimesUser ,
MegamozgUser of the
habr.user module and includes:
- full name and nickname;
- date of registration;
- Date of Birth;
- Karma data (karma itself, number of votes);
- rating and place in the rating;
- country, region, city;
- the number of followers;
- number of posts;
- number of comments;
- subscriptions to hubs, companies.
Information on articles is presented in the classes
HabraTopic ,
MegamozgTopic ,
GeektimesTopic of the habr.topic module and includes:
- article id;
- title;
- author's name;
- rating;
- the text of the article (the text of the article is not converted, all references to pictures and other things are not touched);
- comments: their number and list with the text of comments;
- list of hubs in which the article is located.
The script uses the
habr library for parsing and the
weasyprint library for generating pdf. Weasyprint was chosen as the easiest to use interface, and as the only one that was tried that was able to generate a normal PDF file. However, as it turned out, this library is very slow.
If you know other pdf generation libraries that work better - write in comments or in person. However, I’ll say right away that the development was originally conducted under python3, so I don’t need to tell me about the excellent pdf libraries for python2.
On this all. If someone liked it, then use it to your health! If anyone is ready, on the basis of this library, to make his own script with cards and women, then everything is in your hands!
UPD. At the request of workers updated the image of the container for the docker
icoz / habraparse . The order of use read
here .