We need less powerful programming languages.

Today, many systems and programming languages are positioned as "powerful." This is not to say that it is bad. Almost every one of us considers this a positive property. But in this post I want to convey this point of view that in many cases we need less powerful programming languages and systems. But before continuing, I will clarify: there will be few original, my own reflections. I will present the course of thought that arose from reading Douglas Hofstadter’s book “ Gödel, Escher, Bach ”, which helped me to put together the scattered ideas and thoughts that wandered in my head. Also, the post of Philip Wadler and the video from the Scala conference had a great influence on the material below. The key thought is:

Each increase in expressiveness places an additional burden on anyone who wants to understand the message.
')
And I just want to illustrate this point with examples that will be closer and clearer to the Python programming community.

A few words about the definitions. What is meant by more or less powerful programming languages? As part of the post, this can be roughly described as follows: “freedom and the ability to do whatever you want” from the point of view of the person writing the code or entering data into the system. This is roughly correlated with the idea of “expressiveness,” although it is not a formal definition. More precisely, many languages have the same level of expressiveness in terms of Turing completeness . But we, the developers, still distinguish some of them as more powerful, because they allow us to get a certain result through a smaller amount of code, or in several different ways, which gives more freedom.

But with this freedom is not so simple. Every bit of “power” that you demand when using a language is equal to the impact that you must provide when someone “consumes” what you have written. Below are examples that are formally related to programming, but at the same time find a response in the shower.

Someone will ask: “Does it really matter at all?” Of course, it does so to the extent that you can “consume” the result of your system’s work. Software users, compilers and other tools for developers can act as “consumers”, so you almost always care not only about the performance and correctness of your products, but also about people.

Databases and Schemes

At the lower level of the “expressiveness” scale is located something that can be attributed, rather, to data, and not to programming languages. But both the data and the language must be perceived as “a message received by someone,” so the same approach is applied here.

Over the years I was often asked to make text boxes in any form. From the point of view of the end user, this field embodies the highest level of “power”, because anything can be entered into it. And within the framework of this logic, such a field is "most useful."

But precisely because it is possible to enter any text in the field, it is the most useless, since it is worse amenable to structuring. Even a search on such fields works unreliably - due to possible typos and different ways of describing things. The longer I work with databases, the stronger the desire to strictly regulate everything that is possible. And when I manage to do this, much more benefit can be gained from the data obtained. That is, the more I limit the “power” (that is, the freedom) of the sources that enter data into the system, the more opportunities I get when this data is “consumed”.

The same can be said about database technology. Unstructured (schemaless) databases provide ample opportunities and flexibility for data entry, and more useless for their output. Key-value storage is an analogue of “free-form text” with the same drawbacks: there is little use to it if you want to extract information or do something with the data, because you cannot be sure that there are any then specific keys.

HTML

In part, the success of the web was due to the deliberate limitation of the capabilities of key technologies — HTML and CSS. Yes, these are markup languages, not programming, but they are not made by chance. It was a deliberately chosen concept , one of the founders of which was Tim Berners Lee. I will simply quote one passage: “From the 1960s to the 1980s, computer science made a lot of efforts to create more and more powerful programming languages. But today there are reasons for choosing less powerful tools. One of them is that the “weaker” the language, the more you can do with the data stored in it. If you write a simple declarative form, then anyone can write a program that analyzes this form in a variety of ways. In a general sense, the Semantic Network is an attempt to convert large amounts of data into ordinary language in order to obtain such opportunities for analyzing this data that their creators did not dream of.

For example, if a web page uses RDF to present a weather forecast, users will be able to extract this data as a table, somehow process, average, plot graphs, compare with other information. And compare this with a Java applet: information can be presented very nicely, but it cannot be analyzed. The search bot does not understand what is presented on the page. The only way to find out what a Java applet does is to launch it in front of a person sitting in front of the screen. ”

The W3C consortium took the same position: “Good practice: use the least powerful language suitable for expressing information, connections or applications on the World Wide Web” .

This is almost completely contrary to the advice of Paul Graham (with the proviso that definitions of “power” are often far from formal): “If you can choose from several languages, then other things being equal, it would be a mistake to program not on the most powerful of them” .

File format MANIFEST.in

We now turn to the "real" programming languages. As an example, I chose the format of the MANIFEST.in file used by the distutils and setuptools tools. If you’ve been able to create packages for Python libraries, then you are probably familiar with this format.

In fact, it is a very small language that describes which files should be included in a package in Python (with respect to the MANIFEST.in file being called from the working directory). For example:

include README.rst recursive-include foo *.py recursive-include tests * global-exclude *~ global-exclude *.pyc prune .DS_Store

There are two types of directives:

include (include, recursive-include, global-include and graft)
exclude (exclude, recursive-exclude, global-exclude and prune)

The question arises: how are these directives interpreted? What is their semantics?
It can be interpreted as follows: “ A file from the working directory (or a subdirectory) must be included in the package if it matches at least one include directive and does not match any exclude directive”.

It seems that this suggests that the language is declarative. Unfortunately, this is not the case. The distutils documentation says about MANIFEST.in - directives should be understood as follows:

Start with an empty list of files that should be included in the package. More precisely, start with the default list;
follow directives in MANIFEST.in according to their order ;
for each include directive, copy all relevant files from the working directory to the package list;
for each exclude type directive, remove all relevant files from the package list.

This example vividly illustrates the imperative nature of this language: each line of MANIFEST.in is a command whose action implies side effects. This makes the language more powerful than the pseudo-declarative version given above. Consider an example:

 recursive-include foo * recursive-exclude foo/bar * recursive-include foo *.png

As a result of executing the list of these commands, the png-file (below foo / bar) is included in the package. And everything above foo / bar is not included in the package. To achieve the same result using declarative language would be more difficult, for example:

 recursive-include foo * recursive-exclude foo/bar *.txt *.rst *.gif *.jpeg *.py ...

Since the imperative language is more powerful, there is a temptation to choose it. However, the imperative version has several important flaws:

Difficulties with optimization . When it comes to interpreting MANIFEST.in and generating a list of files to include in the package, there is only one effective solution: first make an unchangeable list of all files in the directory and subdirectories, and then apply rules to it. Copy the files to the output list according to the rules of addition, and then delete some files from it according to the rules of exclusion. This approach is now implemented in Python.

It is a working option if you have relatively few files. And when the list consists of thousands of positions, most of which should be excluded from the package, it takes a long time to form the final list.

The solution to this problem looks obvious: do not climb into directories that will be excluded by some directives. But this is only possible if the exclusion directives follow all the inclusion directives.

This problem is not theoretical. I found that due to the large number of files in the working directory, if you use, for example, the tox tool, the implementation of setup.py sdist and a number of other commands can reach 10 minutes. That is, tox itself (using setup.py) will work very slowly. Now I am trying to solve this problem , but I think it will be very difficult to do it.

At first glance, it would be possible to optimize (to use the file system less intensively by executing exclusive directives after all adding ones), but this complicates everything and the patch is unlikely to be accepted. The number of branches of the code will increase, which means that the likelihood of multiple errors will increase.

Probably the only acceptable solution would be to opt out of using MANIFEST.in at all, and optimize it only in cases when it is completely empty.
Reverse side of the “power”: it is more difficult to understand MANIFEST.in files. First of all, it is more difficult to master the principles of the language. For the decorative version, the documentation would be significantly shorter than it actually is.

In addition, when analyzing specific MANIFEST.in, you have to mentally execute commands, trying to present the result. It would be much easier to place the lines in the order in which it was convenient for you.

All this leads to errors when creating packages. For example, it is easy to believe that the global-exclude * ~ directive at the beginning of MANIFEST.in means that all files whose names end in ~ (temporary files of some editors) will be excluded from the package. Actually this directive does nothing at all. And if one of the following directives tries to include some files in the package, they will be mistakenly included. I found the following examples of this error (exclude directives that do not work as intended):
- hgview (if placed at the very beginning of the file, it will not work);
- django-mailer (idle global exclusive directive at the beginning of the file).
You cannot group lines in MANIFEST.in for easier perception, since changing their order affects the composition of the package.

Routing

Routing is one of the core components of Django. This is the component that analyzes the URL and passes it to the handler of the given URL. At the same time, it is possible, extracting some components from the URL.

In Django, this is implemented using regular expressions. Suppose we have an application that displays information about kittens, and the kittens / urls.py file contains this code:

 from django.conf.urls import url from kittens import views urlpatterns = [ url(r'^kittens/$', views.list_kittens, name="kittens_list_kittens"), url(r'^kittens/(?P<id>\d+)/$', views.show_kitten, name="kittens_show_kitten"), ] ,  views.py  : def list_kittens(request): # ... def show_kitten(request, id=None): # ...

Regular expressions have a built-in capture function used to get the parameters passed to the view function. Let our application work at cuteness.com. Then the address www.cuteness.com/kittens/23 will initiate a call to the show_kitten code (request, id = "23").

Since we can now route URLs to specific functions, web applications almost always have to generate these URLs. Suppose we needed to include on the page with the list of kittens links to their personal pages: show_kitten. And for sure we will want to do this by reusing the URL routing configuration.

However, we will use it in the opposite direction. When doing URL routing, do the following:

 URL path -> (handler function, arguments)

When generating, we know the handler function and the necessary arguments. And we have to generate a URL that will lead the user to the desired page after performing the URL routing :

 (handler function, arguments) -> URL path

To do this, we need to be able to predict the behavior of the routing mechanism. We ask: "What will be the input data for such a weekend?" At the very beginning of the Django story, there was no such functionality in it yet. But it turned out that in most cases you can “change the direction” of the URL template. Regular expressions can be parsed to search for static and captured elements.

Please note that this is possible only because the language used to define URL routes — regular expressions — has certain limitations. Although it could be used for this and a much more powerful language. For example, by defining URLs using functions that:

use URL as input;
in case of a mismatch, NoMatch is issued;
when matched, a truncated URL and a set of some captured parameters are returned.

Then our urls.py would look like this:

 from django.conf.urls import url, NoMatch def match_kitten(path): KITTEN = 'kitten/' if path.startswith(KITTEN): return path[len(KITTEN):], {} raise NoMatch() def capture_id(path): part = path.split('/')[0] try: id = int(part) except ValueError: raise NoMatch() return path[len(part)+1:], {'id': id} urlpatterns = [ url([match_kitten], views.list_kittens, name='kittens_list_kittens'), url([match_kitten, capture_id], views.show_kitten, name="kittens_show_kitten"), ]

Of course, match_kitten and capture_id could be made more concise:

 from django.conf.urls import url, m, c urlpatterns = [ url([m('kitten/'), views.list_kittens, name='kittens_list_kittens'), url([m('kitten/'), c(int)], views.show_kitten, name="kittens_show_kitten"), ]

Given that m and c are return functions, this language is more powerful for routing URLs than real, based on regular expressions. The interface for matching and capturing has much more possibilities — for example, it would be possible to search for an ID in a database, etc.

But in this barrel of honey there is also tar: we would not be able to reverse URL. In Turing-complete languages, one cannot ask: “What would be the input data for such an output?” Theoretically, one could look at the source code of the function to search for known patterns, but this is completely impractical.

And with regular expressions that are limited in their capabilities, there are more options available. In general, the regular-based URL configuration is not reversed, as simple expressions seem to be. (just a point) cannot be reversed in a unique way. And if we want to normally generate classic URLs, then we need a unique solution. If a. still comes across, Django arbitrarily chooses another character, other wildcard characters are not handled this way. But since such characters are found only among the captured, it is quite possible to reverse regular expressions.

So if we want to reliably reverse URL routes, we will need something less powerful than regular expressions. At one time, they were chosen only because they were powerful enough , not realizing that their possibilities are redundant.

Among other things, in Python, it is not so easy to define mini-languages for similar tasks. Their implementation and use will require a considerable number of boilerplates and a level of detail - much more than when using “string” languages like regular expressions. By the way, in languages like Haskell, such things are made much easier thanks to fairly simple features like the definition of algebraic data types and pattern matching.

Regular expressions

The previous chapter reminded me of another problem. In most cases, using regular expressions is quite simple. But whenever you call a regular season, all its possibilities become immediately available to you, regardless of whether you need it or not. One consequence in some cases is the need for backtracking to find all possible matches. And that means you can deliberately create a combination of characters that will be VERY long processed by regular expressions.

This, by the way, gave rise to a whole class of DoS-vulnerabilities , one of which was found in Django - CVE-2015-5145 .

Patterns: Django vs Jinja

The creators of the Jinja template engine were inspired by the language of Django templates , but somewhat changed its philosophy and syntax.

Performance is one of the main advantages of Jinja2. Here, the Python code is compiled immediately, instead of executing an interpreter written in Python, as is done in Django, which gives a 5-20-fold increase in performance.

Jinja author Armin Ronher (Armin Ronacher) quite successfully applied the same approach to speed up the rendering of Django templates. By proposing this project, he knew that API extensions in Django make it very difficult to implement the approach implemented in Jinja. In Django, you can use your own template tags, which gives you almost complete control over the compilation and rendering steps. This includes such powerful tags as addtoblock in django-sekizai , even at first glance it seems impossible. But even if in such cases (infrequent) a slower version would be used, it would still be the benefit of a quick implementation.

There was another important difference that influenced many patterns. In Django, the passed context object (containing the data necessary for the template) can be overwritten during the template rendering process. Template tags can assign a context, and some of them (for example, url) do just that.

All this allowed to implement in Django the main part of the compilation in Python, following the example of Jinja.

Note that in both cases, the problem lies in the power of the Django engine — it allows code authors to do things that are impossible in Jinja2. As a result, we run into very big difficulties trying to compile fast code.

This is quite an important point, since in some cases the template rendering speed may become the main problem of the project. Because of this, quite a few products have been transferred to Jinja. And this is an unhealthy situation!

Often, only in hindsight it is possible to understand what exactly complicates optimization. And it would be sly to assert that the introduction of some restrictions into the language will necessarily entail the simplification of the optimization process. Although there are languages in which the concept of limited opportunities for programmers and consumers is quite successfully implemented!

It can be said that it was quite logical to make the context object rewritable, since the data structures in Python are mutable by default. Which brings us to Python itself ...

Python

You can have different attitudes to the breadth of the possibilities of this language. For example, as far as this complicates the life of each developer and program, faced with Python code.

First of all, compilation and performance come to mind. The almost complete absence of restrictions, which allows, among other things, to rewrite classes and modules, not only helps to do all sorts of useful things, but also greatly degrades performance. The authors of PyPy managed to achieve impressive results, but judging by this dynamic , they are unlikely to be able to achieve significant growth in the future. Yes, and the achievements in performance were achieved at the cost of increasing memory consumption. Simply, the Python code is optimized only up to a certain limit.

In case you had such an opinion: I am not at all an opponent of Python or Django. I am one of the leading developers at Django, and use it and Python in almost all my professional projects. With this post I just want to illustrate the problems posed by the wide possibilities of programming languages.

Let's talk now about refactoring and code support. If you create serious projects, then surely a lot of time is spent on support. And it is very important to be able to do this quickly and with a minimum of errors.

Suppose, in the same Python, when using VCS (for example, Git or Mercurial), if you move a function from ten lines to another place, you will receive a diff into 20 lines, despite the fact that from the point of view of the program itself nothing has changed. And if something has changed (the function was not only moved, but also modified), it will be very difficult to determine.

Every time, when confronted with this, one wonders: why do we work with our complex-structured code as with a bunch of lines of plain text? This is some kind of madness!

You probably think that this problem can be solved with the help of advanced diff-tools . But the trouble is that in Python, changing the sequence of functions may actually affect the work of the program (meaning changes that occur at runtime).

Here are some examples. Take the previously defined function as the default argument:

 def foo(): pass def bar(a, callback=foo): pass

If you change the order of the strings, then for foo in the definition of bar, a NameError error will crash.
Let's use the decorator:

 @decorateit def foo(): pass @decorateit def bar(): pass

Because of the possible effects in @decorateit, you cannot change the order of these functions and be sure that the program will work the same way. The same can be said about calling the code in the function argument list:

 def foo(x=Something()): pass def bar(x=Something()): pass

Class attributes also cannot be interchanged:

 class Foo(): a = Bar() b = Bar()

Here, the definition of b is set above a due to possible effects in the Bar constructor. This may seem like a theorizing, but the same Django does use it inside the Model and Form definitions for the sake of providing standard field order, using a tricky counter at the class level inside the Field's base constructor.

You will have to accept the fact that in Python the sequence of expressions of a function is a sequence of actions that result in the creation of objects (functions and arguments), some actions are performed with them, etc. Unlike some other languages, you cannot change the order of declarations.

This provides you with tremendous opportunities when writing code, but it does not allow you to turn around from the point of view of automation of manipulations with the ready-made code. Refactoring is almost impossible to implement without fear, because of the possibilities of the language (for example, “ duck typing ”) you cannot rename methods. And due to the likelihood of “reflections” and dynamic access to attributes (getattr and others), you generally cannot safely rename in automatic mode.

So do not rush to blame VCS or refactoring tools for all the troubles. It's all about the breadth of Python itself. Despite the huge number of structures in the code, there is little that can be done with the point of view of some kind of manipulation. So, on closer inspection, the dependence of diff on the order of the rows does not look so bad.

Today, we almost do not use decorators, because of which the order of definitions in the code becomes important, and this makes life a little easier for consumers. But in rare cases, our tools still turn out to be practically useless. For some consumers, you can optimize for some standard situation and detect the fact of failure, for example, use JIT checks. But for other means (say, for VCS or refactoring tools) in case of failure it will be too late to collect performance information. By the time the problem was discovered, you could have already released the “corrupted” code, so it is better to be vigilant than to apologize.

In an ideal language, when renaming a diff function in VCS, it should look like "The foo function has been renamed to bar". — , foo bar. « » . - Python .

, , . .

, , , . , , — . — . . , , (, ), .

, , VCS, . Lamdu , .

Results

( ), , : , , . , Slavery is freedom. , , .

, , , . , , VCS .

, .

, , . — , .

, , . , .

Source: https://habr.com/ru/post/271585/

All Articles