Progrobot: programming language help bot

When you write code, you regularly need to look at help on a specific function, module, etc. I usually go to cppreference.com or docs.python.org for this, but this is usually not instantaneous - it requires a minimum of switching over several pages, and it is often difficult to find the necessary information on a page in Python documentation, let alone that Google often sends to the documentation for the second version, and not the third, and you have to manually switch.

Therefore, I thought that a telegram-bot can be useful, which will know all this information and issue help on a specific function, class, module, etc. on request.

So it turned out @Progrobot bot. You can send him the name of the function and get its short description, you can send the name of the module (in python) or the header file (in c ++) and get a list of all the functions in this module, etc. While there is help on c ++ (with cppreference) and python3 (with docs.python.org). I also planned to do a stackoverflow search, but it turned out that the API-shny search does not work well, and there is also a hard limit on the number of requests — in short, while disabled, then maybe I’ll deflate the offline base and finish it.

About the bot itself

The data is stored in mongo, for each language two tables. In the first, the actual reference for objects (functions, classes, modules, etc.): the “canonical” name, a link to the page from which the documentation is taken, the module (Python module or cpp-shny header) to which the object belongs, the format usage (usage), description, list of child elements (methods for a class, etc.) and the string copyright. A brief description of each child element is also stored, which I took as the first sentence of the description of this element. (Moreover, detecting the first sentence turned out to be not an easy task either.)
')
In the second table I keep the index: for each object I store its possible names, for example, for std :: vector :: push_back the index will contain “push_back”, “push_back vector” and “push_back std vector”, with reference to the help in the first table . Namely, I break the full name of the object into tokens, take all the suffixes of the resulting list and sort the tokens for each suffix alphabetically. For each row in the index there can be several documents (for example, push_back is not only in the vector).

Now the bot's logic is quite simple: we split the request for tokens, sort them alphabetically, and look for the corresponding entry in the index. Found - cheers, did not find - apparently, there is no such object. If there are several corresponding entries, then we choose the most suitable of them (for the sake of simplicity I decided to choose approximately the one with the “canonical” name containing the minimum number of tokens, for example, the “get” query returns std :: get, and not some xml .etree.ElementTree.Element.get). All relevant entries can be viewed with the command / list.

The database I have stored the description in html to save the formatting of the code, etc. Telegram also allows you to use a simple subset of html in messages, so I wrote a converter that throws out all unsupported tags and puts line breaks in appropriate places. Of the special effects here - in the descriptions met local links (<a href="#anchorê>). I left them, and everything worked, just such links did not work in the telegraph client, but it was not scary either. On another day, I discovered that a bot could not send almost any messages. Apparently, the telegram added an additional check for the correctness of addresses in the links, and stopped missing local links. I had to leave only links with a full address.

I also had to tinker a bit because the length of the message in the telegram was limited to 4096 characters (I could hardly find the constant itself in the documentation on the telegraph), and the descriptions of some objects are longer. Added a bit abstruse code, cutting long messages into shorter ones in suitable places, and the / cont command to get the sequel. From the number of unexpected jokes here - I made sure that all the brackets in the cut off part of the message were balanced. And then I came across a Python module random, in the description of which there is a phrase “... generates a random float uniformly in the semi-open range [0.0, 1.0)”. It was necessary to consider square and round brackets equivalent.

About parsing

Parsing html with cppreference turned out to be a pleasure. One page per essence, good text in the style of exactly that reference, adequate classes and id in html tags, a list of child objects right on the page, etc. I took three pages as examples, wrote a fairly simple code using BeautifulSoup, which would parse these pages well, and it all worked. Then just twisted the little things; Now there are still some rough spots that the hands do not reach to fix, but in general everything works. From non-trivial twists, there was a filling of the description and child elements for the header files (so that you could get a list of all the functions in this file upon the “algorithm” request), as well as a more accurate processing of template specializations (initially, std :: vector was broken into std vector tokens bool, as a result of which it is located simply at the request of bool; I had to throw out the specialization before tokenization).

But parsing the Python documentation was much more fun. It is written as a book that can be read in a row. As a result, there are mixed ideology, usage tips, examples, and the actual reference I need, and to top it all there are bundle phrases such as “The pprint module defines one class:” that can not be distinguished from the description of the module itself. Therefore, after everything worked on three example pages, the parsing of Python documentation had to be finished for a long time, and now there are still more problems than with cpp. For example, this phrase about pprint is present now in the answer of the bot, and it looks strange there.

Of the problems that had to be fixed - descriptions of a number of entities begin with the words "New in version xx" or "Source code: ...", and I took the first sentence as a brief description of this entity. I did not find a solution better than just hardcodes that strings of this type cannot be a brief description. The decorators had to cut the @ symbol in some places. The beginning of the description of a new entity is determined by a tag that has a class “class” or “classmethod” or “exception” or something else, only 9 options, and I didn’t immediately find them all (and in cpp each file is a separate entity and there is no problem). Some entities detected my script in two places at once (the unittest.mock module was detected here and here ). There are tables and other structures in the texts that are poorly translated into the message format in the telegraph (and I would not like to translate them), according to such structures the leader is itertools , when finding a line that was completely in bold, it was assumed that the description was over. Finally, on docs.python.org it is very difficult to understand which license applies to the actual documentation; I even had to write to docs@python.org. But there are no these problems with the specialization of templates, and also there is no concept of a “header file” at all - for each object a “parent” is uniquely and naturally defined.

Pro framework

In order not to pull the Telegram API directly, I use the python framework for telegram-bots telepot . He can do a lot of things, even to support conversations with users, and writing a bot on it turned out to be quite simple. True, it is regularly updated and has some unimaginable number of use cases, so it is rather difficult to figure out which option is needed in a particular case.

It turned out that the different messages from the telegram have a significantly different structure. Some objects have just an id field, some of them also indicate what the id is (message_id or file_id). Or, for example, the Message object has the chat and text fields, and the CallbackQuery object has no chat field, and instead of the text field, the data field. I would handle Message and Callback in the same way, but it does not work, I have to add small hacks. True, I wrote this at the beginning of the summer, and the framework itself was actively developed in the summer, maybe now it is better for them.

Code

Github: github.com/petr-kalinin/progrobot , the code is pretty ugly there - the result of my many attempts to get to the telepot interface.

Source: https://habr.com/ru/post/310162/

All Articles

Progrobot: programming language help bot

About the bot itself

About parsing

Pro framework

Code

More articles: