Automatic visualization of the python code. Part Two: Implementation

In the first part of the article, the flowcharts and the available tools for working with them were briefly discussed. Then all the graphic primitives needed to create a graphical representation of the code were discussed. An example of an environment that supports such a graphical representation is shown in the picture below.

Environment that supports graphical code presentation

In the second part we will talk about the implementation, performed mainly on Python. The implemented and planned functionality will be discussed, as well as the proposed micro markup language.

general information

The implementation of the technology was carried out as part of an open source project called Codimension . All source codes are available under the GPL v.3 license and are located in three repositories on github : two Python expansion modules cdm-pythonparser and cdm-flowparser plus the IDE itself . Expansion modules are mainly written in C / C ++, and the IDE is on Python series 2. For the graphic part, the Python QT library wrapper is used - PyQT.
')
Development is made on Linux and for Linux. Mostly used distribution Ubuntu.

The environment is designed to work with projects written in Python series 2.

Architecture

The diagram below shows the IDE architecture.

IDE Architecture

Blue in the diagram indicates parts developed in the project. Yellow - third-party Python modules, and green - third-party binary modules.

From the very beginning of the project, it was obvious that one developer could not do enough to develop all the required components from a clean slate within a reasonable time. Therefore, existing Python packages were used wherever possible and reasonable. This diagram perfectly reflects the decision.

Only three parts are developed within the project. The IDE is written in Python to speed up development and simplify experiments. Expansion modules are written in C / C ++ for better responsiveness of the IDE. The task of the brief parser is to report all entities found in the Python file (or buffer), such as imports, classes, functions, global variables, documentation lines, etc. The availability of such information makes it possible to implement, for example, such functionality: structuredly display the contents of the file and provide navigation, provide analysis of certain, but never used, functions, classes and global variables in the project. The task of flow parser to provide in a convenient form the data necessary for drawing the diagram.

All other components are third-party. PyQT was used for the interface and network part. QScintilla as the main text editor and some other elements, such as redirected I / O of debugging scripts or showing the svn version of a specific revision file. Graphviz was used to calculate the location of graphic elements in a dependency diagram, etc. Other pre-built Python packages were also used: pyflakes, pylint, filemagic, rope, gprof2dot, etc.

Conveyor from code to graphic

The implementation of the transition from text to graphics is based on the principle of a pipeline. At each stage, a certain part of the work is performed, and the results are transferred to the next stage. The diagram below shows all the stages of the conveyor, on the left of which the input text arrives, and on the right on the output a drawn diagram is obtained.

Conveyor from code to graphic

Work begins with the construction of the syntax tree on the source text. Then the tree is bypassed, and all the detected code blocks, loops, functions, etc. decomposed into a hierarchical data structure. After that, one more pass is made on the source code, as a result of which information about comments is collected. At further stages of the pipeline, it is more convenient to have comments already associated with recognized language constructs, rather than as separate data. Therefore, the next step is to merge comments with a hierarchical structure representing the code. The described actions are performed by the flow parser extension module, which is written in C / C ++ to achieve the best performance.

Subsequent stages are already implemented inside the IDE and written in Python. This provides greater ease of experimentation with rendering compared to a C ++ implementation.

First, in a data structure called a virtual canvas, graphic elements are placed in accordance with the data structure received from the expansion module. Then comes the rendering phase, the main task of which is to calculate the size of all graphic elements. Finally, all graphic elements are drawn properly.

We will discuss all these stages in detail.

Building a syntax tree

This is the very first stage on the way from text to graphics. Its task is to parse the source text and create a hierarchical data structure. This is convenient to do with the help of the syntax tree built on the source text. Obviously, there was no wish to develop a new Python parser specifically for the project, but on the contrary, there was a desire to use something already prepared. Fortunately, a suitable function was found right in the shared library of the Python interpreter. This C is a function that builds a tree in memory for the specified code. To visualize the operation of this function, a utility was written, which visually shows the constructed tree. For example, for source text:

#!/bin/env python # encoding: latin-1 def f(): # What printed? print 154

Such a tree will be built (for brevity, only a fragment is shown):

  $ ./tree test.py
 Type: encoding_decl line: 0 col: 0 str: iso-8859-1
   Type: file_input line: 0 col: 0
     Type: stmt line: 4 col: 0
       Type: compound_stmt line: 4 col: 0
         Type: funcdef line: 4 col: 0
           Type: NAME line: 4 col: 0 str: def
           Type: NAME line: 4 col: 4 str: f
           Type: parameters line: 4 col: 5
             Type: LPAR line: 4 col: 5 str: (
             Type: RPAR line: 4 col: 6 str:)
           Type: COLON line: 4 col: 7 str::
           Type: suite line: 4 col: 8
             Type: NEWLINE line: 4 col: 8 str:
             Type: INDENT line: 6 col: -1 str:
             Type: stmt line: 6 col: 4
               Type: simple_stmt line: 6 col: 4
                 Type: small_stmt line: 6 col: 4
                   Type: print_stmt line: 6 col: 4
                   .  .  .

In the output, each line corresponds to a tree node, nesting is shown by indents, and for nodes, all available information is shown.

In general, the tree looks very good: there is information about the row and column, there is a type of each node that corresponds to the formal grammar of Python. But there are problems. First, there were comments in the code, but there is no information about them in the tree. Secondly, the information on the line and column numbers for the encoding is not true. Thirdly, the very name of the encoding has changed. The code was latin-1, and the syntax tree reports iso-8859-1. In the case of multiline string literals, there is also a problem: there is no information about line numbers in the tree. All these surprises should be taken into account in the code that traverses the tree. However, all the problems described are rather trivial compared to the complexity of the whole parser.

The extension module defines the types that will be visible in the Python code in the subsequent stages. Types correspond to all recognizable elements, for example Class, Import, Break, etc. In addition to the case-specific data members and functions, they are all intended to describe an element in terms of fragments: where a piece of text begins and where it ends.

When traversing a tree, a hierarchical structure is formed, an instance of the ControlFlow class, in which all recognized elements are laid as needed.

Collect comments

Due to the fact that the information about the comments was not in the syntax tree (obviously, the interpreter does not need them), and they are needed to display the code without loss, I had to introduce an additional pass through the source code. For this pass, information about each comment line is retrieved. It's easy to do this, thanks to the simple Python grammar and the absence of both multi-line comments and a preprocessor.

Comments are collected in the form of a list of fragments, where each fragment describes one line of a comment with a set of attributes: line number, column number of beginning and end, absolute positions in the file of the beginning and end of a comment.

For example for the code:

 #!/bin/env python # encoding: latin-1 def f(): # What printed? print 154

Three fragments of the form will be collected:

  Line: 1 Pos: 1 ...
 Line: 2 Pos: 1 ...
 Line: 5 Pos: 5 ...

Merging Comments with Code Snippets

When building a chart, it is more convenient to deal not with two different data structures — the executable code and comments — but with one. This simplifies the process of placing elements on a virtual canvas. Therefore, another step is performed in the extension module: merging comments with recognized elements.

Consider an example:

 # leading comment a = 10 # side comment 1 # side comment 2

Merge comments with code

As a result of passing through the syntax tree for the code above, an instance of the CodeBlock class will be formed, among others. It has, among other fields, the body, leadingComment and sideComment fields, which describe the corresponding elements in terms of fragments. The body field will be filled with information from the syntax tree, and the comment fields will contain None.

According to the results of the passage to collect comments, a list of three fragments will be formed. When merging, the first fragment will be used to fill in the leadingBillment field in CodeBlock, and the second and third fragments for the sideComment field. The merge is based on line numbers available for both sources of information.

Thus, at the output of the merge stage, there is a completely ready unified hierarchical data structure about the contents of a file or buffer in memory.

Module performance

The pipeline stages described above are written in C / C ++ and decorated in the form of a Python module. From general considerations, we wanted to make the work of all these stages quick in order to avoid unpleasant delays when redrawing diagrams during text modification pauses. To check the performance of the module was launched on the existing platform:

Intel Core i5-3210M laptop
Ubuntu 14.04 LTS

for processing all files of the standard installation of Python 2.7.6. With 5707 files, processing took about 6 seconds. Of course, files have a different size and the module operation time depends on it, but the result is on average about 1 millisecond per file not on the fastest hardware, it seems to me more than acceptable. In practice, the text that needs to be processed is often already contained in memory and the time of disk operations is completely gone, which also has a positive effect on performance.

Placement on a virtual canvas

The purpose of this stage of the pipeline is to place on the virtual canvas all the necessary elements, taking into account the relationships between them. Virtual canvas can be imagined as a canvas with rectangular cells. A cell can be empty, it can contain one graphic element or a nested virtual canvas. At this stage, only the relative placement of the elements is important, not the exact dimensions.

At the same time, the canvas does not have a fixed size and can grow down and to the right as needed. This approach corresponds to the prepared data structure and method of drawing a chart. The process starts in the upper left corner. As needed, new rows and columns are added. For example, a new line will be added for the next block of code, and if the block has a side comment, then a new column.

The set of graphic elements that can be placed in the cells corresponds approximately to recognizable elements of the language with a small extension. For example, in a cell, a connector leading from top to bottom may be needed, but there is no such element in the language.

For a virtual canvas, a list of lists (two-dimensional array) is used, empty at the beginning of the process.

Consider a simple example to illustrate the process.

 a = 10 # side comment 1 # side comment 2

Placing graphic elements on canvas

The left of the picture above shows the data structure, formed by the results of code analysis. A ControlFlow instance contains several attributes and a suite container, in which only one element is a CodeBlock instance.

In the initial state, the canvas is empty, and the ControlFlow bypass begins. It was decided to draw the module as a scope, i.e. like a rectangle with rounded edges. For convenience of the subsequent calculations of the sizes of graphic elements taking into account indents, the rectangle of the visibility area is conventionally divided into components: faces and corners. In the uppermost corner of the diagram, there will be the upper left corner of the module's rectangle; therefore, we add a new line to the canvas, and in the line add one column and place an element in the scope corner cell. The upper edge of the rectangle can not be placed to the right, since the vertical reserve is already at the expense of the first cell, and the rectangle will still be drawn as a single shape at the moment when the scope corner is detected at the canvas traversal stage.

Go to the next line. The module has a header in which the start line and encoding are specified. The values of the corresponding fields are None, but the title still needs to be drawn. Therefore, we add a new line to the canvas. On the diagram, the heading should be inside the rectangle with a small indentation, therefore you should not immediately place the heading in the created line. In the first cell should be placed the left side, and then the title. Therefore, we add two columns and in the first place the scope left, and in the second - the scope header. It is not necessary to place the right side in the third cell. First, it is not yet known how many columns will be in the widest row. And secondly, the scope rectangle will still be drawn in its entirety when the scope corner is detected.

The module could have a documentation line, and one more line would be needed for it. But there is no documentation line, so go to the suite. The first element in the suite is a block of code. We add a new line, and in it we add a column for the left side. Then we add another column and place the graphic element for the code block in the cell. In the example, the block has a side comment, which should be located on the right, so we add another column and place a side comment in the cell.

There are no more elements in the suite, so the process of preparing a virtual canvas ends. The lower left corner of the rectangle and the lower edge can be omitted for reasons similar to those described above for the upper and right edges. It is only necessary to take into account what they mean when calculating the geometric dimensions of graphic elements.

Rendering

The task of this stage of the pipeline is to calculate the sizes of all graphic primitives for drawing on the screen. This is done by traversing all the placed cells, calculating the required dimensions and storing them in the attributes of the cell.

Each cell has two calculated widths and two heights - the minimum required and actually required, taking into account the size of neighboring cells.

We first discuss how height is calculated. It happens line by line. Consider the line in which the elements corresponding to the assignment operator are placed. Here you can see that the assignment operator occupies one text line, and the comment - two. This means that a cell with a comment will require more vertical pixels on the screen when drawing. On the other hand, all cells in the row must be the same height, otherwise the cells below will be drawn with an offset. This implies a simple algorithm. It is necessary to bypass all the cells in the row and calculate for each the minimum height. Then select the maximum value from those just calculated and save it as the actually required height.

A little more complicated is the situation with the calculation of the width of the cells. From this point of view there are two types of lines:

those whose width must be calculated taking into account the width of the cells in the adjacent row
such, the width of the cells which can be calculated without taking into account adjacent rows

A good example of dependent strings is the if statement. The branch that will be drawn under the condition block can be of arbitrary complexity, and therefore of arbitrary width. And the second branch, which should be drawn to the right, has a connector from the condition block, located in the line above. This means that the width of the connector cell must be calculated depending on the rows below.

Thus, for independent lines, the width is calculated in one pass and the minimum width coincides with the actual.

For regions of dependent lines, the algorithm is more complicated. First, the minimum widths of all cells in the region are calculated. And then for each column, the actual width is calculated as the maximum value of the minimum required width for all cells in a region column. That is very similar to what was done to calculate the height in the line.

Calculations are performed recursively for nested virtual canvases. The calculations take into account various settings: the metric of the selected font, fields for different cells, text indents, etc. Upon completion of the phase, everything necessary is already prepared for placing graphic primitives already on the screen canvas.

Drawing

The drawing stage is very simple. Since the QT library is used for the user interface, a graphic scene is created with dimensions calculated at the previous stage. Then a recursive traversal of the virtual canvas is performed and the necessary elements with the necessary coordinates are added to the graphic scene.

The bypassing process starts from the upper left corner and the current coordinates are set to 0, 0. Then the string is bypassed, after processing each cell, the width is added to the current x coordinate. When moving to the next line, the x coordinate is reset to 0, and the height of the just processed line is added to the y coordinate.

This process of obtaining graphics for the code ends.

Present and future

We now discuss the functionality that has already been implemented and what can be added in the future.

The list of what is done is rather short:

Automatic update of the chart in the pauses of code changes.
Manual synchronization of visible text and graphics in both directions. If the input focus is in a text editor, and a special key combination is pressed, then the diagram searches for the primitive that most closely matches the current cursor position. Next, the primitive is highlighted, and the chart is scrolled to make the primitive visible. In the opposite direction, synchronization is performed by double-clicking on the primitive, which causes the text cursor to move to the desired line of code and scroll the text editor, if necessary.
Scaling chart. The current implementation uses the built-in scaling tools of the QT library. In the future, it is planned to replace it with scaling by changing the font size and recalculating the sizes of all elements.
Export charts to PDF, PNG and SVG. The quality of output documents is determined by the implementation of the QT library, since it is precisely its means that are used for this functionality.
Navigation bar scopes. Graphics intensively use the idea of scope, so a typical diagram contains a lot of nested areas. The navigation bar shows the current path in terms of nested scopes for the current position of the mouse cursor over the graphic scene.
Individual branch placement switching for the if statement. By default, the N branch is drawn below, and the Y branch is to the right. The diagram allows you to swap the location of branches using the context menu of the condition block.
Individual replacement of the text of any graphic primitive. Sometimes there is a desire to replace the text of any block with arbitrary. For example, a condition in terms of variables and function calls can be long and not at all obvious, whereas a phrase in natural language can better describe what is happening. The diagram allows you to replace the displayed text with arbitrary text and shows the source text in the tooltip.
Individual replacement of colors of any primitive. Sometimes there is a desire to draw attention to any part of the code by highlighting graphic primitives. For example, a potentially dangerous area can be highlighted in red or select a common color for elements that are responsible for common functionality. The diagram allows you to change the colors of the background, font and stroke of primitives.

The combined use of already implemented individual changes of primitives, as practice shows, can significantly change the type of diagram.

Opportunities that can complement the existing foundation are limited only by imagination. Here are the most obvious.

Automatic synchronization of text and graphics when scrolling.
Editing support on the chart: delete, move and add blocks. Editing text inside individual blocks. Copy and paste blocks.
Support for group operations with blocks.
Visualize debugging on the chart. At a minimum, the highlight of the current block and the current line in it.
Support search on the chart.
Print support.
Managing the hiding / showing of various elements: comments, lines of documentation, function bodies, classes, cycles, etc.
Highlighting various types of imports: within the project, system, unidentified.
Support for additional non-language blocks or pictures on charts.
Smart scaling. You can enter several fixed levels of scale: all elements, without comments and documentation lines, only class and function headers, dependencies between files in a subdirectory, and external link pointers. If you fix this behavior for the mouse wheel with any modifier, you can get overview information very quickly.
Convolution of several blocks into one and deployment back. A group of blocks that perform a joint task can be highlighted in the diagram and collapsed into one block with its own graphics and provided text. The natural limitation here is that the group should have one entrance and one exit. Such functionality can be useful when working with unknown code. When it comes to understanding what multiple blocks are doing, you can reduce the complexity of the diagram by combining the group and replacing it with one block with the desired signature, for example, “MD5 calculation”. Of course, at any time it will be possible to return to the details. This function can be considered as the introduction of the third dimension in the diagram.

CML v.1

The features listed in the previous section can be divided into two groups:

Required to save information in connection with the code.
Completely independent of the code.

For example, the scaling functionality is completely independent of the code. The current scale should rather be saved as the current IDE setting.

On the other hand, the switching of the if branches is associated with a specific operator and information about this must be somehow preserved. After all, at the next session of the work, if should be drawn as it was prescribed earlier.

Obviously, there are two ways to save additional information: either directly in the source text, or in a separate file or even multiple files. When making a decision, such considerations were taken into account:

Imagine that a team of developers is working on a project, and some of them successfully use graphical code presentation capabilities. And the other part of the principles for working with the code uses only vim. In this case, if the additional information is stored separately from the code, then it will be extremely difficult, if not impossible, to maintain its consistency when alternately editing the code by members of different camps. Surely additional information will be inappropriate to reality at some point in time.
If the approach of additional files is chosen, they are more likely to litter the contents of the project and require additional efforts from the team when working with version control systems.
When a developer enters additional markup - for example, replacing a long code of an incomprehensible condition with a suitable phrase - this is not done for entertainment, but because such a mark has value. It would be good to keep this value available for those who do not use graphics, even if not in such a beautiful form.

Thus, if there is a compact solution for storing additional information directly in files with source text, then it is better to use it. Such a solution was found and was called CML: Codimension markup language.

CML is a micro markup language that uses python comments. Each CML comment consists of one or more adjacent lines. The format of the first line is selected as follows:

 # cml <> <> [ =]

And the format of the continuation lines is as follows:

 # cml+ <  cml >

The cml and cml + literals are what distinguishes a CML comment from others. The integer version is introduced for future use if CML evolves. The type determines how exactly the comment will affect the chart. The type is a string identifier, for example rt (short for replace text). And key pairs = values provide an opportunity to describe all the necessary parameters.

As you can see, the format is simple and easy to read. Thus, the requirement to extract additional useful information is satisfied, not only when using the IDE, but also when viewing the text. In this case, the only voluntary agreement between those who use and do not use graphics is this: do not break the CML comments.

CML: replace text

Recognizing CML comment for text replacement has already been implemented. It may appear as a leading comment before a block, for example:

 # cml 1 rt text="Believe me, I do the right thing here" False = 154

Block of code with modified text

There is no support from the graphical interface. You can add such a comment now only from a text editor. The purpose of the text parameter is quite obvious, and the type rt is chosen based on the abbreviation of the words replace text.

CML: switching branches if

Recognizing the CML comment for switching branches if is also already implemented. There is support both from the side of the text and from the side of graphics. Optionally, a comment will be inserted in the right place of the code from the context menu of the condition block, and then the diagram is generated anew.

 # cml 1 sw if False == 154: print("That's expected") else: print("Hmmm, has the code above been run?")

The if statement with the N branch to the right

In the above example, the N branch is drawn to the right of the condition.

It is seen that this CML comment does not require any parameters. And its sw type is selected based on the abbreviation of the word switch.

CML: color change

Recognizing the CML comment for a color change is already implemented. It may appear as a leading comment before a block, for example:

 # cml 1 cc background="255,138,128" # cml+ foreground="0,0,255" # cml+ border="0,0,0" print("Danger! Someone has damaged False")

Block with individual colors

Supported color change for the background (parameter background), font color (parameter foreground) and color of faces (parameter border). The type cc is chosen based on the abbreviation of the words custom colors.

Support from the graphical interface yet. You can add such a comment now only from a text editor.

CML: group convolution

There is no support for this type of comment now, but it is already clear how the functionality can be implemented.

The sequence of actions may be as follows. The user selects a group of blocks in the diagram, taking into account the restriction: the group has one input and one output. Next, select the item from the context menu: merge into a group. Then the user enters the title for the block that will be drawn instead of the group. At the end of the input, the diagram is redrawn in the updated form.

Obviously, for a group as a single entity, there are points in the code where it starts and where it ends. This means that you can insert CML comments into these places, for example:

 # cml gb uuid="..." title="..." . . . # cml ge uuid="..."

Here, the uuid parameter is automatically generated at the time of creating the group and is needed in order to correctly find the pair of the beginning of the group and its end, since the levels of nesting can be any number of times. The purpose of the title parameter is obvious - this is the text entered by the user. Record types gb and ge are introduced for reasons of abbreviations of the words group begin and group end, respectively.

The presence of uuid also allows you to diagnose various errors. For example, as a result of editing the text, one of a couple of CML comments could be deleted. Another possible use case for uuid is memorization in IDE groups, which should be shown as folded in the next session.

Side effects

The practice of using the tool has shown that the technology has interesting side effects that were not considered at all during the initial development.

First, it turned out to be convenient to use the generated charts for documentation and for discussion with colleagues, often not related to programming, but being subject matter experts. In such cases, a code was prepared that was not intended to be executed, but conforming to the logic of actions and corresponding to the formal syntax of python. The generated diagram was either inserted into the documentation, or printed and discussed. Additional convenience was manifested in the ease of making changes to the logic - the diagram instantly redraws with all the necessary indents and aligns without the need for manual operations with graphics.

Secondly, an interesting purely psychological effect was manifested. The developer, opening his own long-written code using Codimension, noticed that the diagram looks ugly - too complicated or confusing. And in order to get a more elegant diagram, I made changes to the text, actually performing refactoring and simplifying the code. Which in turn leads to a decrease in the complexity of understanding and further support of the code.

Thirdly, in spite of the fact that development is executed for the Python, the technology can quite be extended to other programming languages. Python was chosen as a testing ground for several reasons: the language is popular, but at the same time syntactically simple.

Thanks

I would like to thank everyone who helped in the work on this project. Thanks to my colleagues Dmitry Kazimirov, Ilya Loginov, David Makelkhani and Sergey Fukanchik for help with various aspects of development at different stages.

Special thanks to the authors and developers of Python open source packages that were used in the work on the Codimension IDE.

Source: https://habr.com/ru/post/320674/

All Articles