⬆️ ⬇️

The most difficult problem of computer science

... this is, of course, the naming of entities. And I'm not just talking about the names of variables or new technologies, no. We can not even agree on the most basic terms.



Thousand Dialects



Did you know that the C programming language specification often mentions the term "object"? No, this is not an object as it is described in OOP - an object in C is defined as “a data block in the runtime environment, the contents of which can represent a certain value”. In this understanding of an object, it makes sense to talk about, for example, a “char object”.



The term “method” is quite common, but you can meet programmers who will speak exclusively “function-member of a class”. The Java programming language, therefore, either has or does not have functions, depending on who you ask about it. The terms "procedure" and "subroutine" are sometimes used as an analogue of "function", but in some programming languages ​​(for example, Pascal), the procedure is not at all the same as a function.

')

Even within a single programming language, we sometimes get confused.



Python programmers can catch the use of the term “property” (instead of the attribute), although both terms exist in the language and they are not absolutely identical. There is a difference between an “argument” and a “parameter”, but who cares about it - we just say this or that word when it seems to us more convenient. I often use the term "interface functions" (" signature "), but other people do it very rarely, so sometimes I think about it - does anyone understand what I’m talking about?



When we say "float data type", the C programmer will hear "a single-precision floating point type", and the Python programmer will be sure that the double-precision type was meant. And this is not the worst case yet, because when the word type is mentioned, it can mean at least four different interpretations in terms of its size.



Part of the problem is that when we talk "about computer science," we are not really talking about computer science. We are engaged in practical programming on some set (out of hundreds!) Of non-ideal programming languages, each of which has its own peculiarities and quirks. At the same time, we have a certain (limited) number of familiar terms that we apply to different features of different languages, sometimes to the place, and sometimes not so. A person who has begun to learn programming with Javascript will have a certain idea of ​​what a “class” is and it will be very different from the view of the person whose first language was Ruby. People come from the background of one language to another and start accusing him, for example, that there are no normal closures, because in their language the term "closure" means something completely different.



Sometimes with all this you can somehow put up. And sometimes confusion can happen. Here are my (least?) Favorite examples of such situations.



Arrays, Vectors and Lists



In C language, an array is a sequential data block in which you can put a certain (clearly defined) number of variables of the same type. int [5] describes an array for storing five variables of type int, directly one after another.



C ++ introduces the concept of a vector as an analogue of an array that can automatically change its size, adjusting to current needs. There is also a standard list type, which in this case means a doubly linked list (in fact, the standard does not put forward requirements for a specific implementation, but functional requirements make logical the implementation of a vector based on an array and the list based on a doubly linked list). But wait! In C ++ 11, the term “initializer_list” is introduced, the title of which contains the word “list” (list), but in essence it is an array.



Lisp lists are, of course, linked lists, which makes it easy to process them in terms of access to the head and tail. Haskell also works in the same way, plus it has Data.Array for quick access to the elements by index.



In Perl, the type of a sequence is an array, although the word “type” itself is not very appropriate here, it is rather one of the forms of variables. Perl also has the concept of a “list”, but this is only a temporary object that exists during the evaluation of a certain expression, and not a classic container data type. This is a rather strange thing and its explanation will take more than one paragraph, so I will not even begin here.



In Python, a list is a fundamental data type, in which there are properties similar to a vector in C ++ and (in CPython) it is implemented on the basis of a C-array. The standard library also provides the rarely used data type array, which packs numbers into arrays of C to save space and disorients programmers who came to Python through C — they think that “array” is just what you need to use by default. Oh yeah, there is still a built-in type byte array, which is not the same as an array that stores bytes.



Javascript has an array type, but it is built on top of a hash table with string (!) Keys. There is also an ArrayBuffer for storing numbers in C-arrays (very similar to the type of array in Python).



In PHP, a data type called an array is actually an ordered hash table with string (!) Keys. Also in PHP there are lists, but this is not a data type, but only some syntactic sugar. People who go from PHP to other languages ​​are sometimes surprised that classic hash tables, it turns out, do not retain order.



The Lua language, sweeps aside all sorts of traditions, without using the terms of an array, a vector, or a list at all. The only data type there is called a table.



Well, in order not to get up two times, let's go through the names of the data types of associative containers:



C ++ : map (and in fact this is a binary tree. C ++ 11 adds unordered_map, which is a hash table)

JavaScript : object (!) (This is actually not a classic associative array, but you can store values ​​accessible by a string key. And there is also a data type Map.)

Lua : table

PHP : array (!) (And only string keys)

Perl : hash (also "form", not type, plus ambiguity due to the fact that hashes also call something completely different, plus again only string keys)

Python : dict

Rust : map (although it exists as two separate types - BTreeMap and HashMap)



Pointers, links and aliases



In C, there are pointers that are addresses of some data in memory. For C, this is natural, since everything in C is about data management in memory and the representation of all data as addresses in one large data block (well, more or less so). A pointer is just an index in this large block of data.



With ++, having inherited pointers from C, immediately warns you against their abuse. As an alternative, a link is offered that seems to be exactly like pointers, but for access to values ​​in which you do not need to use the "*" operator. This immediately creates a new (very strange) possibility, which was not in C: two local variables can point to the same data block in memory, so the string a = 5; quite to itself can change the value of the variable b.



There are links in Rust, and they even use the C ++ syntax, but in fact they are “borrowed pointers” (that is, pointers, but transparent). Also in the language there are less common "clean pointers" that use the syntax of pointers C.



Perl has links. Even two separate types of links. Hard links (similar to pointers in C, with the exception that the address is not available and implies that it should not be used directly) and soft links where you use the contents of some variable as the name of another variable. Also in Perl there are aliases that work similarly to references in C ++ - but they do not work for local variables and, in general, are not actually a data type, but simply manipulation of a symbol table.



There are links in PHP, but despite the influence of Perl, the link syntax was taken from C ++. C ++ defines a reference by the type of the variable to which it refers. But in PHP there is no variable declaration, so the variable begins to count as a reference from the moment it participates in some specific set of operations involving the & operator. This magic symbol "infects" the variable "referential."



Python, Ruby, JavaScript, Lua, Java, and a bunch of languages ​​do not have pointers, references, or aliases. This somewhat complicates the understanding of these languages ​​for people who come from the world of C and C ++, because in the course of explaining certain higher-level things, one often has to say phrases like “it indicates ...”, “it refers to ...”, which misleads people, giving the impression that they do have some kind of pointer or reference to a certain area in memory, the contents of which can be directly accessed. For this reason, I refer to the behavior of links in C ++ as aliasing, since it more clearly reflects the essence of what is happening and leaves the word “reference” for a more general use.



Pass by reference and by value



By the way, about the links. I have already explained this before for Python, but I will write here again an abbreviated version. All this dichotomy does not make sense in most languages, since the key question here is what language C considers to be meaning, and this question has no meaning outside the family of languages ​​related to C.



The fundamental problem here is that C has a syntax for describing structures, but the semantics of the structure language in the code itself does not see - only a set of bytes. The structure seems to look like a container, a good such reliable container: the contents are enclosed in curly brackets, you need to use the "." Operator for access to internal members. But for C, your structure is just a block of binary data, not much different from int, well, except maybe a bit larger. Oh, yes, and you can still look at some separate part of the data. If you put one structure inside another, language C stupidly selects in the external structure a data block for the internal one. When you assign one structure to another, a banal byte copying occurs, the same as when assigning, for example, double variables. The face is illusory. As a result, the only truly "real" container in the C language is a pointer!



If you pass a structure to a function, C will copy it completely, just like a variable of any other type. If you want a function to modify a structure, you need to pass a pointer to it to the function. If you want to transfer a very large structure to a function, you again need to use a pointer to improve performance.



With ++ I introduced the concept of reference, well, just in case if all of a sudden in C with its pointers everything was too easy and understandable. Now, as before, you can pass the structure "by value", but if the called function accepts a link, then you are already passing your structure "by reference" and the function can modify it. The function argument becomes the alias of the variable passed to it, so that even simple types like int can be rewritten. This “transfer by reference” is better called “transfer by alias”.



Java, Python, Ruby, Lua, JavaScript, and many other languages ​​operate containers as separate entities. If you have a variable in which there is a structure and you assign this variable to another variable, then in fact no copying takes place. Just now both variables are referenced ... no, do not link, indicate ... (no, do not indicate) ...



And here it is - the problem of terminology! When someone asks if the X language transmits parameters by value or by reference, this person most likely thinks in terms of the C language model and presents all other languages ​​as something that must necessarily fall one way or another on this fundamental model. If I say “both variables refer”, then you might think that this is a C ++ reference (aliasing). If I say "both variables indicate", then we can decide that we are talking about C-style pointers. In many cases, the language may not have the first or second. But in the English language there are no other words for expressing what we want to say.



Semantically, languages ​​behave as if the contents of variables (their values) exist by themselves, in some abstract world, and variables are just names. The assignment associates a name with a value. It is tempting to explain this to newbies as “now and points to b” or “now they refer to the same object as an oddit”, but these explanations add indirection, which in fact does not exist in the language. a and b just both call the same object.



The function call in this case is an assignment form, because the arguments inside the function now name the same values ​​that the caller passed to the function. You can modify them - and the calling code will see the result of these modifications, since it also names those values. Inside the called function, you cannot reassign variables: a variable in this case is not an alias, assigning any value to it will only link its name (inside the function) with the new value, but will not affect the variables in the calling code. All this is somewhat beyond the scope of the classic "passing by reference" and "passing by value." There are no established terminology here at all, I heard how it is called object transfer, transfer by name, transfer by division.



In principle, passing by reference in the C ++ style can also be implemented in other languages ​​(as I mentioned, PHP can transmit by alias using the C ++ reference syntax). But transfer by alias exists only as an alternative to transfer by value, and transfer by value exists, because in the low-level C half a century ago, nothing else could be realized.



Everything that you can do by passing by value, you can also do by passing by name followed by explicit copying. More often, such things are done for the sake of being able to return several values ​​from a function, which can be done in a lot of other ways in high-level languages.



Free typing



This, of course, is a matter of interpretation, but personally I am sure that such a thing as “free typing” does not exist. At least I have not heard any specific definition for this term.

I remember:





The concepts of "strong" and "weak" typing create a harmonious picture of the world. "Static" and "dynamic" typifications are also understandable and complementary. Languages ​​can have elements of both strong and weak typing, as well as static and dynamic, although a single position is still prevalent. For example, although Go is considered to be statically typed, interface {} has signs of dynamic typing. Conversely, Python is formally statically typed and each variable is of type object, but good luck with that.



Since the relation of "strong" \ "weak" typification concerns the values ​​of variables, and the "static" \ "dynamic" relates to their names, all four combinations exist. Haskell is strong and static, C is weak and static, Python is strong and dynamic, Shell is weak and dynamic.



What then is “free typing”? Someone says that this is an analogue of “weak”, but many people call Python “freely typable”, although Python refers to languages ​​with strong typing. (At least, stronger than C!).



And, since the term “freely typable” I mostly meet in a derogatory sense, I can assume that people mean “not as typified as it happens in C ++”. Here, it should be noted that whose cow would mumble, and C ++ would keep mum. The type system C ++ is far from flawless. What, for example, would be the type of a pointer to type T? No, this is not T *, since it can be assigned a null pointer (which is not a pointer to a variable of type T) or random garbage (which is also unlikely to be a pointer to a variable of type T). What is the point of being proud of static typing, if variables of some type in fact may not contain the value of this type?



Caching



With caching, the situation is the funniest of all, and in fact it’s not even a feature of any one language, but a well-known concept. Caching stores the results of some calculations, eliminating the need to repeat them later. Classic optimization, or rather the exchange of memory for speed. I believe that the most important property of caching is that when cleaning the cache, destroying it, or inaccessible data in it for some other reason - everything continues to work as before, except a little slower.



And I everywhere see programmers and code who call the cache any data retention for reuse. This is very confusing. A good example is one example of code that I often met in Python projects. For the first time I paid attention to it in the Pyramid project, where this feature was called reify. She performed a lazy initialization of an object attribute, something like this:



class Monster: def think(self): # do something smart @reify def inventory(self): return [] 


Here, monster.inventory does not actually exist until you try to read it. At this point, reify is called (only once) and the list that it returns becomes an attribute. Everything is completely transparent as soon as the value is created, it is a regular attribute without any subsequent overhead costs for indirect access. You can add something to it, and you will see the same result with every access. The attribute did not exist until you called it to life by trying to look at it.



Such an approach may make sense for objects that describe several related, but still separate aspects of a certain entity (and for some reason all this cannot be divided into separate classes). If the initialization of some part of such an object can take a long time and its existence is not necessary for the operation of the remaining parts of the object, it is quite reasonable to use lazy initialization to create the required component only when it is clearly necessary.



reify for a long time was not represented in the PyPI repository as a separate component. Probably because it can be implemented from scratch to ten lines. When I talked about what I saw to reify in many projects, I meant “many projects copied or wrote a reify implementation on my knee”. And finally, this component was added to the repository under the name ... cached-property . Documentation even showed how to “invalidate the cache” - damage to the internal state of the object.



The big problem I see here is that literally absolutely every use of a given decorator that I saw was not a cache in its classic sense. The example above is somewhat simple, but even for it, “invalidating” the cache will lead to irreversible consequences - we completely lose the state of Monster.inventory. Actual @reify applications often open files or connections to a database, and in these cases, “invalidation” will be equivalent to data destruction. This is absolutely not a cache, the loss of which should only slow down the work, but not spoil the data in memory or on disk.



Yes, you can create a cache with @reify. And you can create it even more using dict and various other ways too.



I tried to put forward a proposal to rename the cached-property to reify at the early stage of the appearance of this component in the repository (this was important, especially given the author’s desire to add it to the standard language library) - but nobody liked the name of reify and the conversation quickly turned to discussion and criticism other alternative titles. So the naming of entities is really the most important problem in computer science.

Source: https://habr.com/ru/post/318618/



All Articles