Why Python is so good at scientific computing.

A few days ago ( The original note was published on September 12, 2017. - Here and further on, a translator ), I noticed this tweet in my feed:

I 'still' program in C. Why? Hint: it's not about performance. Wrote an essay with analysis ... will appear on Onward!

(Onward! Is one of the conferences in SPLASH , dedicated to discussing new ideas and paradigms in programming and thinking about software.)

It seemed like a good read for the weekend - that was it. The main argument that the author cites: the C language remains unsurpassed as a system integration language, because it allows you to interact with "foreign" code, that is, code written independently and possibly even in other languages, up to the assembler. In fact, C is one of the few programming languages that allows you to deal with any data at the byte level. Most of the more "modern" languages prohibit such interaction in the name of security: all memory you can access is memory allocated using a secure language execution environment. As a result, you are stuck in its closed universe.

System integration is undoubtedly an important aspect of working with software that is often overlooked. And this is especially true for scientific computing, where application software with a fixed set of functions is rare. To solve a scientific problem it is often required to assemble many pieces of programs into a whole that depends on a specific problem, which may be launched only a couple of times (see also my earlier post on this topic). This is exactly the task of system integration: to assemble a single whole from pieces, using binding code, if necessary. In computational science, the linking code takes the form of scripts, workflows and, more recently, notebooks. From a technical point of view, this is noticeably different from system integration at the operating system level, to which Stephen Kell refers, but functionally it is the same.

Stephen's article reminded me of a long time ago to write a blog about why Python is so successful in scientific computing, despite its reputation as a language with poor performance. So ... here it is.

Of course, there are many reasons for the success of Python, but one of them is that it does an excellent job with system integration tasks. Python has two features that I think are important in this matter, and that are not supported by many other languages. One is data types explicitly designed for interfacing; the other is duck typing combined with a small but flexible set of standard interfaces.

The first Python data type developed for binding in the context of scientific computing is the good old NumPy array, which actually turns out to be older than NumPy, being introduced in 1995 by its predecessor, Numeric. Arrays are one of the data types that are the “bread and butter” of scientific computing, to the extent that it is the only type available in languages like Fortran 77 or APL. Numeric array implementation was designed to use the same data layout as Fortran C, to interact with libraries on Fortran and C that dominated scientific computing in 1995 (and so far, albeit to a lesser extent ). Behind Numeric and, later, NumPy always had the idea to use Python as a binding language for libraries in Fortran and C, and to achieve speed by delegating time-critical operations to code written in these languages.

The second Python data type designed for binding is the memoryview associated with the buffer protocol . Here, Python comes closest to the C-shaped memory access. The buffer protocol allows different types of Python data to access the interiors of each other at the byte level. A typical example of use is the type of image data (for example, from Pillow ), with access to the image representation in memory through an array type (for example, from NumPy), which allows you to implement algorithms for working with images in the form of operations on arrays.

The third and least known Python data type for binding is capsule , replacing the earlier CObject . Capsules exist solely for the benefit of the Python C modules written in C, which can communicate with each other with opaque data using the link code in Python, even though the link code itself cannot verify or process the data in any way. A typical example: wrap pointers to a C function into a Python object so that the link code in Python — the script, for example — can transfer a C function from one module to the C code in another module.

All of these interface data types serve as intermediaries between Python code and C, although often the Python system integrator is not at all aware of the use of C code. Another Python system integration feature, duck typing with standard interfaces, facilitates the binding of independently written Python modules. By "standard interfaces" I mean the interfaces of sequence (sequence) and dictionary (dictionary), as well as the standard method names for operator overloading.

To see why this feature is important, let's look at statically typed languages in which it is intentionally absent. As a concrete example, take multidimensional Java arrays. They are not part of a language or standard library, but can be implemented on top of them with reasonable effort. In fact, there are several Java implementations from which you can choose. This is the problem. Suppose you want to use a fast Fourier transform (FFT) library based on the implementation of the "A" arrays, along with a linear algebra library based on the implementation of the "B" arrays. No luck - the arrays from "A" and "B" are of different types, so you cannot use the output of the FFT as an input to the system for solving linear equations. It does not matter that they are based on the same abstractions, and even that the implementations have much in common. For the Java compiler, the types do not match, and the point.

Python is not completely free from this problem. It is perfectly possible to write Python code or code in a C module that expects the exact data type as an input, and otherwise throws an exception. But for Python code, this will be considered bad style, and in C modules for Python too, except for those that require performance or compatibility with other C code. Where possible, Python programmers are expected to use standard interfaces for working with data. For example, iteration and indexing work the same for arrays and built-in lists. For operations not supported by standard interfaces, Python programmers are expected to use Python methods that are also subject to duck typing. In practice, independently implemented Python types are much more interoperable than independently implemented Java types. In the specific case of n-dimensional arrays, Python had a chance of overwhelmingly adopting a single implementation, which is related to social and historical issues rather than technical ones.

Finally, even though Python is a pretty good choice for system integration in scientific computing, of course there are limitations, just the kind Stephen Kell describes in his essay: combining Python code with code in other languages with managed memory, say , R or Julia, requires a lot of work, and even after that remains fragile, because tricks are required, based on undocumented implementation details. I suspect that the only solution may be the appearance of data objects neutral in relation to languages, supporting garbage collection and provided as an operating system level service, preserving the possibility of unmanaged access at the byte level, a la C. The closest existing technology I know about is the Microsoft CLR , better known under the commercial name .NET. Its implementation is now open source and runs on a variety of platforms, but its origin is “Windows only” and strong links to the huge Microsoft library are an obstacle for adoption in the traditionally Unix-centric community of people involved in scientific computing.

Source: https://habr.com/ru/post/349482/

All Articles

Why Python is so good at scientific computing.

More articles: