Data Types Strike Back

This is the second part of my thoughts on “Python, how would I like to see it,” and in it we take a closer look at the type system. To do this, we again have to delve into the specifics of the implementation of the Python language and its interpreter, CPython.

If you are a Python programmer, the data types for you are always behind the scenes. Somewhere they exist by themselves and somehow interact with each other, but most often you think about their existence only when an error occurs. And then the exception tells you that some of the data types behave differently than you expected.

Python has always been proud of its type system implementation. I remember reading the documentation many years ago, which contained a whole section on the benefits of duck typing. Let's be honest: yes, for practical purposes duck typing is a good solution. If you are not limited by anything and there is no need to deal with data types because of their absence, you can create very beautiful APIs. Especially easy in Python it turns out to solve everyday problems.
')
Practically all APIs that I implemented in Python did not work in other programming languages. Even such a simple thing as a command line interface ( click library) just doesn't work in other languages, and the main reason is that you have to constantly struggle with data types.

Not so long ago, the question of adding static typing in Python was raised, and I sincerely hope that the ice has finally broken. I will try to explain why I am against explicit typing, and why I hope that Python will never go this way.

What is a "type system"?

A type system is a set of rules according to which types interact with each other. There is a whole section of computer science devoted exclusively to data types, which in itself is impressive, but even if you are not interested in theory, it will be difficult for you to ignore the type system.

I will not go too deep into the type system for two reasons. Firstly, I myself do not fully understand this area, and secondly, in fact, it is not at all necessary to understand everything in order to “feel” the interrelationships between data types. For me, it is important to take into account their behavior because it affects the interface architecture, and I will talk about typing not as a theorist, but as a practice (using the example of building a beautiful API).

Type systems can have many characteristics, but the most important difference between them is the amount of information that the type of data provides about yourself when you try to work with it.

Take, for example, Python. There are types in it. Here is the number 42, and if you ask this number what type it has, it will answer that it is an integer. This is exhaustive information, and it allows the interpreter to define a set of rules according to which integers can interact with each other.

However, there is one thing that is missing in Python: composite data types. All data types in Python are primitive, and this means that at a certain point in time you can work with only one of them, in contrast to composite types.

The simplest composite data type found in most programming languages is structures. In Python, there are none as such, but in many cases, libraries need to define their own structures, for example, the ORM models in Django and SQLAlchemy. Each column in the database is represented by a Python descriptor that corresponds to a field in the structure, and when you say that the primary key is called id, and this is IntegerField (), you define the model as a composite data type.

Composite types are not limited to structures. When you need to work with more than one number, you use collections (arrays). In Python, there are lists for this, and each element of the list can have a completely arbitrary data type, as opposed to lists in other programming languages that have a given element type (for example, a list of integers).

The phrase "list of integers" always makes more sense than a list. You can argue with that, because you can always go through the list and see the type of each element, but what to do with an empty list? When you have an empty list in Python, you cannot determine its data type.

The same problem occurs when using the value None. Suppose you have a function that takes a “User” argument. If you pass the parameter None to it, you will never know that it should have been a user object.

What is the solution to this problem? Do not have null pointers and have arrays with explicitly specified element types. Everyone knows that everything is in Haskell, but there are other languages that are less hostile to developers. For example, Rust is a programming language that is closer and more understandable to us, since it is very similar to C ++. And in Rust there is a very powerful type system.

How can you pass the value "user not set" if there are no null pointers? For example, in Rust, there are optional types for this. So, the Option expression is a flagged enumeration that wraps the value (of a particular user in this case), and it means that either Some (user) or None can be transferred. Since now a variable can either have a value or not have it, all the code working with this variable must be able to correctly handle cases of passing None, otherwise it will not compile.

Gray future

Previously, there was a clear separation between interpreted languages with dynamic typing and compiled languages with static typing. New trends change the current rules of the game.

The first sign that we are stepping onto uncharted territory is the appearance of the C # language. This is a compiled language with static typing, and at first it was very similar to Java. As C # developed, new features began to appear in its type system. The most important event was the emergence of generalized types, which made it possible to strictly typify collections that are not processed by the compiler (lists and dictionaries). Further - more: the creators of the language have introduced the ability to abandon the static typing of variables for entire blocks of code. This is very convenient, especially when working with data provided by web services (JSON, XML, etc.), because it allows you to perform potentially unsafe operations, catch exceptions from the type system and inform users about incorrect data.

Nowadays, the C # type type system is very powerful and supports generic types with covariant and contravariant specifications. It also supports working with types that allow null pointers. For example, in order to define default values for objects represented as null, a union statement with the value null ("??") was added. Although C # has already gone too far to get rid of null, all bottlenecks are under control.

Other compiled languages with static typing also try new approaches. Thus, in C ++, it has always been a language with static typing, but its developers have begun experiments with type deduction at many levels. The days of MyType <X, Y> :: const_iterator iterators are a thing of the past, and now in almost all cases you can use autotypes, and the compiler will substitute the necessary data type for you.

In the Rust programming language, type inference is also very well implemented, and this allows you to write programs with static typing without specifying the types of variables at all:

use std::collections::HashMap; fn main() { let mut m = HashMap::new(); m.insert("foo", vec!["some", "tags", "here"]); m.insert("bar", vec!["more", "here"]); for (key, values) in m.iter() { println!("{} = {}", key, values.connect("; ")); } }

I believe that in the future we expect the emergence of powerful type systems. But in my opinion, this will not lead to the end of dynamic typing; rather, these systems will evolve along the path of static typing with local type inference.

Python and explicit typing

Some time ago at one of the conferences, someone convincingly argued that static typing is great and the Python language is extremely necessary. I don’t remember exactly how this discussion ended, but the result was a project called mypy, which, in conjunction with the syntax of annotations, was proposed as the gold typing standard in Python 3.

In case you have not seen this recommendation, she suggests the following solution:

 from typing import List def print_all_usernames(users: List[User]) -> None: for user in users: print(user.username)

I am sincerely convinced that this is not the best solution. There are many reasons, but the main problem is that the type system in Python is, unfortunately, not so good. In essence, a language has different semantics depending on how you look at it.

For static typing to make sense, the type system must be implemented well. If you have two types, you should always know how these types need to interact with each other. In Python, this is not the case.

Python Type Semantics

If you read the previous article about the system of slots, you should remember that types in Python behave differently, depending on the level at which they are implemented (C or Python). This is a very specific feature of the language and this you will not see anywhere else. At the same time, at an early stage of development, many programming languages implement fundamental data types at the interpreter level.

In Python, there are simply no “fundamental” types, however there is a whole group of data types implemented in C. And these are not only primitives and fundamental types, it can be anything, without any logic. For example, the collections.OrderedDict class is written in Python, and the collections.defaultdict class from the same module is written in C.

This gives a lot of problems to the PyPy interpreter, for which it is important to emulate the original types as well as possible. This is necessary in order to get a good API, in which any differences with CPython will not be noticeable. It is very important to understand what the main difference is between the level of the interpreter written in C and the rest of the language.

Another example is the re module in versions of Python up to 2.7. In later versions, it was completely rewritten, but the main problem is still relevant: the interpreter does not work like a programming language.

In the re module, there is a compile function for compiling a regular expression into a pattern. This function takes a string and returns a pattern object. It looks like this:

 >>> re.compile('foobar') <_sre.SRE_Pattern object at 0x1089926b8>

We see that the pattern object is specified in the _sre module, which is an internal module, and yet it is available to us:

 >>> type(re.compile('foobar')) <type '_sre.SRE_Pattern'>

Unfortunately, this is not the case, because the _sre module does not actually contain this object:

 >>> import _sre >>> _sre.SRE_Pattern Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute 'SRE_Pattern'

Well, this is not the first or the only time when a type deceives us about its location, and in any case it is an internal type. Moving on. We know the type of the pattern (_sre.SRE_Pattern), and this is a descendant of the object class:

 >>> isinstance(re.compile(''), object) True

We also know that all objects implement some of the most common methods. For example, instances of such classes have a __repr__ method:

 >>> re.compile('').__repr__() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: __repr__

What is going on? The answer is rather unexpected. For reasons unknown to me, in Python prior to version 2.7, the SRE pattern object had its own tp_getattr slot. This slot has implemented its own attribute search logic, which provided access to its own attributes and methods. If you examine this object using the dir () method, you will notice that many things are simply missing:

 >>> dir(re.compile('')) ['__copy__', '__deepcopy__', 'findall', 'finditer', 'match', 'scanner', 'search', 'split', 'sub', 'subn']

This small study of the behavior of the pattern object leads us to rather unexpected results. This is what actually happens.

The data type declares that it is inherited from object. This is true in CPython, but not in Python itself. At the Python level, this type is not associated with an object type interface. Every call that passes through the interpreter will work, unlike calls that pass through the Python language. So, for example, type (x) will work, and x .__ class__ will not.

What is subclass

The above example shows us that in Python there may be a class that inherits from another class, but its behavior will not correspond to the base class. And this is an important issue if we are talking about static typing. So, in Python 3, you cannot implement an interface for the type dict until you write it to C. The reason for this restriction is that this type dictates behavior that can not be implemented to visible objects. It's impossible.

Therefore, when you use type annotation and declare that a function takes a dictionary as an argument with keys as strings and values as integers, it will be impossible to understand from your annotation whether this function accepts a dictionary, or an object with a dictionary behavior, or will pass the dictionary subclass.

Undefined behavior

The strange behavior of the regular expression pattern object was changed in Python 2.7, but the problem remained. As shown by the example of dictionaries, the language behaves differently, depending on how the code is written, and it is simply impossible to fully understand the exact semantics of the type system.

The very strange behavior of the internals of the interpreter of the second version of Python can be seen when comparing types of class instances. In the third version, the interfaces have been changed, and this behavior is no longer relevant for it, but the fundamental problem can still be detected at many levels.

Let's take as an example the sorting of sets (set). Python sets are a very useful data type, but they behave very strangely when comparing. In Python 2, we have a cmp () function that takes two objects as arguments and returns a numeric value that indicates which of the arguments passed is greater.

Here's what happens if you try to compare two instances of a set object:

 >>> cmp(set(), set()) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: cannot compare sets using cmp()

Why is that? To be honest, I have no idea. Perhaps the reason is how the comparison operators work with sets, and this does not work in cmp (). And at the same time, instances of frozensets objects are remarkably compared:

 >>> cmp(frozenset(), frozenset()) 0

Except for those cases when one of these sets is not empty, then we will again get an exception. Why? The answer is simple: it is the optimization of the CPython interpreter, and not the behavior of the Python language. An empty frozenset always has the same value (it is an immutable type and we cannot add elements to it), therefore it is always the same object. When two objects have the same address in memory, the cmp () function immediately returns 0. Why this happens I could not figure out right away, since the code of the comparison function in Python 2 is too complicated and confusing, however this function has several ways which can lead to this result.

The point is not only that this is a bug. The point is that in Python there is no clear understanding of the principles of the interaction of types with each other. Instead, there was always one answer to all the features of the type system behavior in Python: “this is how CPython works”.

It is difficult to overestimate the amount of work that was done in PyPy to reconstruct the behavior of CPython. Given that PyPy is written in Python, an interesting problem emerges. If the Python programming language were described in the way the current Python part of the language is implemented, PyPy would have far fewer problems.

Instance level behavior

Now let's imagine that we, hypothetically, have a version of Python in which all the problems described are fixed. Even in this case, we cannot add static types to the language. The reason is that at the Python level, types do not play a significant role, much more important is how objects interact with each other.

For example, datetime objects, in general, can be compared with other objects. But if you want to compare two datetime objects with each other, then this can be done only if their timezone is compatible. Also, the result of many operations can be unpredictable until you carefully examine the objects involved in them. The result of concatenating two strings in Python 2 can be either unicode or bytestring. Different encoding or decoding APIs from a codec system can return different objects.

Python, as a language, is too dynamic for type annotations to work well. Just imagine the important role that generators play in the language, and they can perform many type conversion operations in each iteration.

The introduction of type annotations will, at best, have an ambiguous effect. However, it is more likely that this will adversely affect the API architecture. At a minimum, if these annotations are not cut before the programs are launched, they will slow down the execution of the code. Type annotations will never allow efficient static compilation to be implemented without turning Python into something that Python is not.

Luggage and semantics

I think my personal negative attitude towards Python was due to the absurd complexity that this language has reached. It simply lacks specifications, and today the interaction between types has become so confusing that we may never be able to figure it all out. There are so many crutches and all these small behavioral features in it that the only possible specification of the language today is a detailed description of the work of the CPython interpreter.

In my opinion, in view of the foregoing, the introduction of type annotations has almost no sense.

If anyone in the future wants to develop a new programming language with predominantly dynamic typing, they should spend extra time on a clear description of how the type system should work. In JavaScript, this is done quite well, all semantics of built-in types are described in detail, even in cases where it does not make sense, and this is good practice in my opinion. If you have clearly defined how the semantics of the language works, in the future it will be easy for you to optimize the speed of the interpreter or even add an optional static typing.

Maintaining a slim and well-documented language architecture avoids many problems. Architects of future programming languages should definitely avoid all the mistakes that were made by developers of PHP, Python and Ruby, when the behavior of the language is ultimately explained by the behavior of the interpreter.

I believe that Python is unlikely to change for the better. It takes too much time and effort to rid the language of all this heavy heritage.

Translated Dreadatour , text read %% username.

Source: https://habr.com/ru/post/242305/

All Articles