How search and parsing of chenjlogs is arranged in AllMyChanges

Want to look inside and find out how AllMyChanges.com works? Today, I’ll tell you a little about how our robot works and why it manages to find release information so well.

In fact, our whole robot is just a set of functions.
Search and processing of changelogs consists of several stages:

it is necessary to understand how to obtain data on urla;
use the selected method to download data to disk;
go through the downloaded files and extract from them pieces that have a version number and description;
understand which pieces are really part of chenjlog, and which are just rubbish;
add the found good to the base.

Parts 1, 2 and 5 are quite mechanical and do not require special intelligence from the robot.
')

Data acquisition

In the previous article, I mentioned that AllMyChanges supports several different data sources. First, he can dump Git and Mercurial from. Secondly, it can download HTML pages, one at a time, and recursively bypassing the entire site. And thirdly, our robot can download some information from the App Store and Google Play.

At the first stage, we, according to certain heuristic rules, determine what function to call at stage two. This choice, if successful, is remembered in the database so that it does not have to be made every time.

During the download, the robot either simply adds the files to disk, or performs additional processing. So, taking data through the GitHub Releases API, it generates a Markdown file from them. Approximately the same thing happens with data taken from the story or VCS history.

Extract

Retrieving data is not easy. Especially when it comes to large repositories with a bunch of files. After all, each of them needs to be circumvented, and if there is a suitable parser, then parse and search for parts related to the release notes.

At the moment, the service supports three main file formats:

Markdown;
reStructured Text;
Html

But due to the fact that the processing steps are independent of each other, you can easily add both new functions for downloading data and expand the list of supported formats.

To make the robot easier, you can send it to the right place in the file system using advanced settings. To do this, when adding a new package to the service, you can specify a list of directories and files where to look for a changelog, or a list of directories that should be avoided.

For example, you know that the project has a subdirectory with documents, and the changelog should be searched there. In this case, we enter docs/ in the “search list” field, and the robot will only search there. If you don’t know where to look for exactly, but you know that the src/ folder doesn’t have anything about release laptops, then you should add this directory to the “ignore list” field.

Post processing

The most interesting thing begins when you get some pieces of data that can be a changelog. This is where the main magic begins. The difficulty lies in the fact that it is not always possible to say for sure
whether the extracted text refers to a changlog or not. Of course, if it is recorded in its metadata that it is extracted from a file with the name ChangeLog.md , then it will not be difficult. But unfortunately, this is not always the case.

Therefore, to determine whether a version number and description is part of a changelog, or a data piece has been extracted by mistake, the AllMyChanges robot uses a number of heuristic rules.

To begin with, all extracted pieces are grouped by the files from which they were extracted, then by directories, and so on. For each group, an estimate is made of the likelihood that it can be a changelog. File and directory names are evaluated, as well as the total number of pieces that they contain.

This is necessary in order to correctly handle release laptops scattered across different files. Such as Django, for example: ... tree / master / docs / releases

Further, duplicates are removed, which can occur for a variety of reasons. The most common reason is the complex structure within the document, where the same version is mentioned in the headers of different levels.

Also, at the post-processing stage, release dates are calculated, unless of course they are listed in the file. A certain heuristic rule also applies here - if all versions in the file have release dates, and one of the versions does not, then it must be considered not yet released. This version is marked as "Unreleased".

Speaking of dates. AllMyChanges holds two dates for each version of the library: the release date and the detection date. Very often, both of these dates are indicated in the notification letters, and they differ. “Release Date” is always the date indicated in the original source. Sometimes it may not be, sometimes it may be even in the future, there may be any miracles. “Discovery Date” is the same as when our robot first found this version. The discovery date is always there, and if the “source date” is not indicated in the original source, then the “discovery date” will be used instead.

What are two dates for? In the first we fix what is written in the document, the second we use to know whether we need to notify users about the release of the new version.

Imagine such a situation. A Russian developer of a certain library wrote down a new version with a beautiful number 1.2.3 just for the new year. 2015-01-01 down her release date 2015-01-01 and, without even pushing on GitHub, went bainki. After waking up and sobering up on January 7, he decides to do a git push .

What happens when the AllMyChanges robot sees version 1.2.3 in the repository? Right! He will record that her release date is 2015-01-01 , but will put 2015-01-07 as the date of discovery. This is necessary in order for the next day to form a digest, this version 1.2.3 gets into it. And it should get into the digest even though the release date is old. Thus, you will receive a notification about the release of the update, in spite of everything and in spite of everything :)

All this allows AllMyChanges to cope with the search for information about releases in all hard-to-reach places and deliver it to your inboxes on time.

Questions?

If you have any questions, feel free to ask them in the hall, or write to support@allmychanges.com . I will be happy to answer them.

Source: https://habr.com/ru/post/261341/

All Articles

How search and parsing of chenjlogs is arranged in AllMyChanges

Data acquisition

Extract

Post processing

Questions?

More articles: