📜 ⬆️ ⬇️

Maximum overload - adventures in JavaScript in the C ++ world

How to properly extend the capabilities of a programming language using operator overloading.

Creators and minters of programming languages ​​are often asked to add new features to the language. The most frequent answer that can be heard from them is:

"And why, after all, what you are offering can be made available with the means of the language."
')
Operator overloading appeared in C ++ at the request of physicists and mathematicians, who wanted to conveniently operate on self-made data types, large numbers, matrices.

Although physicists and mathematicians liked this opportunity, programmers, including the creators of C ++, never really liked operator overloading. Too complicated, a lot of implicitness, so overloading operators was fixed opinion of something harmful and used in rare cases.

Today I will try to show why it is so difficult and how to properly use overload using the example of creating one new type called var whose behavior will be as close as possible to a similar type in JavaScript.

That is, we will try to create a class that will be able to contain either a number, or a string, or an array, or an object. A type that can be initialized with language literals. A type that is correctly converted where appropriate.

First, let's declare the class itself:

struct var { }; 


(Why is struct and not class? The only difference between them is that by default all members are public in the struct. To simplify the readability of the code, there will be a struct.)

Let's try to put a numeric value and a string value into var:

 struct var { char *str; double num; }; 


Now you need to write constructors. They are called when you write:

 var i = 100; var s = "hello"; struct var { char *str; double num; var (double initial) { num = initial; } var (char *initial) { str = initial; } } 


Great, now that everything comes alive, we need to display the value on the screen:

 var i = 100, s = "hello"; log(i); log(s); 


How to achieve this?

 void log(var x) { ....     ? } 


How do we know which of the two contents is used in a given var instance?

Clearly, you need to add an internal type. But how to do that? It is logical to use enum:

 enum varType { varNum, varStr }; 


Change the class definition:

 struct var { varType type; char *str; double num; var (double initial); var (char *initial); }; 


Now in the constructors you need to assign the type:

 var::var (double initial) { type = varNum; num = initial; } var::var (char *initial) { type = varStr; str = initial; } 


Well, now you can return to log ():

 void log(var x) { if (x.type == varNum) printf("%f\n", x.num); if (x.type == varStr) printf("%s\n", x.str); } 


And now we need to block the assignment operator:

 void var::operator = (double initial) { type = varNum; num = initial; } void var::operator = (char *initial) { type = varStr; str = initial; } 


Now you can write:

 var a = 10, b = "hello"; 


Interestingly, the assignment operator is a complete copy of the constructor. Maybe worth re-use? So do. Everywhere in the "assignment designer" you can simply call the "assignment operator".

At the moment here is our full working code:

 #include <stdio.h> enum varType { varNum, varStr }; struct var { varType type; char *str; double num; var (double initial); var (char *initial); void operator = (double initial); void operator = (char *initial); }; var::var (double initial) { (*this) = initial; } var::var (char *initial) { (*this) = initial; } void var::operator = (double initial) { type = varNum; num = initial; } void var::operator = (char *initial) { type = varStr; str = initial; } void log(var x) { if (x.type == varNum) printf("%f\n", x.num); if (x.type == varStr) printf("%s\n", x.str); } int main() { var x = 100, s = "hello"; log(x); log(s); } 


And what if we just write:

 int main() { var y; } 


The compiler will curse us! We cannot declare a variable without initializing it. Disorder, what's the matter? And the fact that all of our designers require initial values.

We need an “empty” constructor, which is the default constructor, the default constructor. But what will the variable be equal to if it is not equal yet? It is not yet known whether it will be a number or a string, or something else.

To do this, we introduce the concept of "empty value", known as null or undefined.

 enum varType { varNull, varNum, varStr }; var::var() { type = varNull; } 


Now you can simply declare variables without thinking about the type.

 var a, b, c; 


And already in the code to assign values:

 a = 1; b = "foo"; 


But we still can not write:

 a = b; 


We need an assignment operator var = var:

 void var::operator= (var src) { type = src.type; num = src.num; str = src.str; } 


At assignment the type will change! And "a" will be a string.

Let's try to move on. Temporarily forget that our numbers and strings are unfinished. Let's try to do an array.

First we need a new type in enum:

 enum varType { varNull, varNum, varStr, varArr }; 


Now the pointer to the buffer elements, and size:

 struct var { ... int size; var *arr; ... } 


Now we will overload the access operator for the element:

 struct var { ... var operator [](int i); ... } 


Such an operator is called a “subscript operator” or index operator.

Our goal: to store in the array elements of type var. That is, we are talking about recursion.

By the way, by the same operator, we will have to refer to individual characters in the string, and to the properties of the object. But in the case of an object, the input will be a string. After all, the key is a string value:

 var operator [](char *key); 


No, that is no good. We do not need a pointer to a character buffer, but a string; we do this:

 struct var { ... var operator [](var key); ... } 


Then, when everything works, we can write:

 x[1] 


or

 x["foo"] 


The compiler converts to var! Why? After all, we already have constructors from literals of numbers and strings.

It will be possible to write like this:

 y = "foo"; x[y]; 


By the way, a literal, (literal) is a “literal meaning,” that is, the value that you typed directly in the code. For example, the assignment “int a = b;” is an assignment by name, and “int a = 123;” is a literal assignment, a literal assignment, “by literal” 123.

One thing is not clear how var becomes an array? Suppose we create a variable "a", and how to say that this is an array?

 var a ???; 


JavaScript uses several methods:

 var a = new Array; var a = []; 


Let's try both:

 var newArray() { var R; R.type = varArr; R.size = 0; R.arr = new var [10]; return R; } 


So far, in order to focus on more significant things, we will pretend that 10 elements are all we need.

Now for an interesting point, try to do something like:

 var a = []; 


You cannot use [] in C ++, but you can use any identifier, that is, a name. For example Array.

 var a = Array; 


How to do it? To do this, apply the "syntactic type", like this:

 enum varSyntax { Array }; 


Wherever we mention the word “Array”, the compiler will figure out that the “varSyntax” type is needed. But the compiler chooses by type what function, constructor or operator to use.

 struct var { ... var (varSyntax initial) { if (initial == Array) { type = varArr; size = 0; arr = new var[10]; } } ... } var a = Array; 


Of course, where the constructor is, there is the assignment, we immediately recall and write the assignment operator of the varSyntax type.

 void var::operator=(varSyntax initial) { ... } 


In the following code, first “a” is initialized by the constructor var (varSyntax), and then “b” is initialized by the empty constructor and assigned by the operator “var operator = (varSyntax)”.

 var a = Array, b; b = Array; 


Since the constructor and the assignment through the "=" always go as a pair, it is logical to apply the same trick, and in the constructor reuse the code from the assignment.

 struct var { ... var (varSyntax initial) { (*this) = initial; } operator= (var Syntax); ... }; void var::operator= (varSyntax initial) { if (initial == Array) { type = varArr; size = 0; arr = new var*[10]; } // else if (initial == Object) { // ... // } } 


Somewhere, there, we will be able to create empty objects. But that later.

Well, it's time to try:

 int main() { var a = Array; a[0] = 100; log(a[0]); } 


 error: conversion from 'int' to 'var' is ambiguous a[0] = 100.0; 


Wow, that's the thing, we declared operator [] from var. For some reason, the compiler expects an int. If you change var [0] to var [1] then everything will be compiled. What?

 int main() { var a = Array; a[1] = 100; log(a[1]); } 


So, with one, compiles ...

Only this code will not do anything yet, because we have not written operator [] yet.

Must write! Probably something like this:

 var var::operator [](var key) { return arr[key]; } 


 error: no viable overloaded operator[] for type 'var *' return arr[i]; ~~~^~ 


Oh, compiler, what else is wrong?

It turns out that index access to the pointer requires an int, and the compiler does not know how to turn a var into an int.

Well, you can define an int operator, and there is such in C ++! But it is better, where it is possible not to create a new operator, not to create one (long history), therefore we will do this:

 struct var { ... int toInt() { return num; } ... } var var::operator[] (var i) { return arr[i.toInt()]; } 


Compiled, but displays nothing after launch, what's the matter?

And how, in general, can it work? How can you read and write the contents of an element through the same operator?

After all, both lines should work:

 a[1] = 100; log(a[1]); 


In one record, in another reading. It turns out that operator = should return a reference to the element. Pay attention to the symbol &, in this case in this case:

 var& var::operator[] (var i) { return arr[i.toInt()]; } 


But, although, “a [1]” worked, “a [0]” continues to swear. Why all the same?

The fact is that 0 can be considered both a number and a pointer, and in our case var has two constructors, one for a number (double), the other for a pointer (char *). Because of this, it seems to be a completely normal code, when using 0 as a literal, it suddenly produces compilation errors. This is one of the particularly sophisticated torture of C ++ and the ambiguous call series.

But in general, the compiler first considers zero to be integer, that is, int.

Fortunately, it is enough to teach our var to initialize from int. As usual, we immediately write the constructor and operator =.

 var::var (int initial) { (*this) = (double) initial; } void var::operator = (int initial) { (*this) = (double) initial; } 


Here, in order to reuse the code, both calls to operator = (double) are simply redirected.

So, what happened at the moment:

 #include <stdio.h> enum varType { varNull, varNum, varStr, varArr }; enum varSyntax { Array }; struct var { varType type; char *str; double num; var (); var (double initial); var (int initial); var (char *initial); void operator = (double initial); void operator = (int initial); void operator = (char *initial); var *arr; int size; var &operator [](var i); var (varSyntax initial) { (*this) = initial; } void operator= (varSyntax initial); void operator= (var src) { type = src.type; num = src.num; str = src.str; arr = src.arr; } int toInt() { return num; } }; var::var() { type = varNull; } var::var (double initial) { (*this) = initial; } var::var (int initial) { (*this) = (double)initial; } var::var (char *initial) { (*this) = initial; } void var::operator = (double initial) { type = varNum; num = initial; } void var::operator = (int initial) { (*this) = (double) initial; } void var::operator = (char *initial) { type = varStr; str = initial; } void log(var x) { if (x.type == varNum) printf("%f\n", x.num); if (x.type == varStr) printf("%s\n", x.str); } void var::operator= (varSyntax initial) { if (initial == Array) { type = varArr; size = 0; arr = new var[10]; } } var &var::operator[] (var i) { return arr[i.toInt()]; } int main() { var x = 100, s = "hello"; var a = Array; a[0] = 200; log(a[0]); log(x); log(s); } 


By the way, what if we want to display an array on the screen?

 void log(var x) { if (x.type == varNum) printf("%f\n", x.num); if (x.type == varStr) printf("%s\n", x.str); if (x.type == varArr) printf("[Array]\n"); } 


So far the only way.

But I want more.

First, you need to make the self-tuning length of the array:

 var &var::operator[] (var i) { int pos = i.toInt(); if (pos >= size) size = pos+1; return arr[pos]; } 


And you need to do push () - adding one element to the end:

 var var::push(var item) { if (type != varArr) { var nil; return nil; } (*this)[size] = item; size++; return item; } 


Since we are working with a pointer, it’s not superfluous to check the type. In the process of preparing this article, this is how the program fell. Well, we are not checking the size yet, we are busy with global design, but we will return to this issue.

Now you can rewrite the log () function to display the entire array:

 void log(var x) { if (x.type == varNum) printf("%f ", x.num); if (x.type == varStr) printf("%s ", x.str); if (x.type == varArr) { printf("["); for (int i = 0; i < x.size; i++) log(x[i]); printf("]"); } } 


What a minimum of work needed, what life-giving recursion does!

 int main() { var a = Array; a[0]=100; a.push(200); log(a[0]); log(a[1]); log(a); } 


Data output after launch:

 100.000000 200.000000 [100.000000 200.000000] 


Well, great, we have some basic polymorphism.

You can even put an array in an array, and interspersed with strings and numbers.

 int main() { var a = Array; a.push(100); a.push("foo"); a[2] = Array; a[2][0] = 200; a[2][1] = "bar"; log(a); } 


 [100.000000 foo [200.000000 bar ]] 


I wonder what will happen if we try to write this:

 var a = Array; var b = a.push(Array); b.push(200); b.push("foo"); log(a); 


Here's what:

 [[]] 


Why did this happen?

Check in such a simple way:

 printf("%\n", a.arr[0].size); printf("%\n", b.size); 


Logically, we should see the same number: 2.

But actually a.arr [0] .size == 0!

The thing is that a [0] and b are two DIFFERENT variables, two different instances. At the moment when the assignment inside the a.push () function via the return occurred, their fields matched, that is, size, arr were identical, but after b.push () there was an increase in b.size, and there was no increase in a [0]. size.

This is a brainwashing problem that is even difficult to describe in words, and perhaps the reader is completely confused while reading the last lines, called “pass by reference” (pass by reference).

In C ++, usually, passing by reference is called when the argument is preceded by &, but this is a special case. In general, this means that changing the copy changes the original.

Let's see how to solve this problem. At first, everything connected with the array will be put into a separate class, so historically, I called it lst. Especially do not go into his device, so grab the general essence:

 class lst { typedef var** P; P p; int capacity, size; void zeroInit(); public: lst(); ~lst(); int length(); void resize(int newsize); var pop(); void push(const var &a); var& operator [](int i); void delIns(int pos, int delCount, var *item, int insCount); }; 


Let me explain that this is a small class for storing a list of pointers with the ability to dynamically change the size, and additional push/pop/delIns .

This is all we need to ensure that our arrays closely match the JavaScript Array.

Now, let's forget how “var” was arranged before, and try to write “lst” into it correctly:

 struct Ref { int uses; void *data; Ref () { uses = 1; } }; struct var { varType type; union { double num; Ref* ref; }; ... }; 


First, we combined num and ref, because all the same at the same time we do not need these properties. Memory saving.

Secondly, instead of the direct value of everything connected with the array, we will have a link with the counter inside. This is called reference counting.

In the same link, we will then store the Object.

Note that the counter is immediately set to 1.

Whenever reference counting is programmed, two basic methods are written immediately, the “connector” and the “connector”.

The first is “ref = src.ref, ref-> uses ++”, usually it is called copy, link, attach, or, actually, reference.

 void var::copy(const var &a) { //       . type = a.type; if (type == varNum || type == varBool) num = a.num; else { if (a.type == varNull) { return; } ref = a.ref; if (ref) ref->uses++; } } 


Secondly, the reverse process occurs, the usage counter decreases, and if it becomes zero, the original memory is freed.

It is usually called unlink, unreference, detach. I used to call it unref ().

 void var::unref() { if (type == varNum || type == varNull || type == varBool) return; else if (type == varStr) { ref->uses--; if (ref->uses == 0) { delete (chr*)ref->data, delete ref; } } else if (type == varArr) { ref->uses--; if (ref->uses == 0) { deleteLst(); } } else if (type == varObj) { ref->uses--; if (ref->uses == 0) { deleteObj(); } } type = varNull; ref = 0; } 


data in the Ref structure is of type void *, that is, just a pointer, and will store a reference to the actual instance of the array (lst) or object (obj). When we say an object, we are talking about the object in which we will store key / value pairs in accordance with JavaScript [Object object].

In essence, reference counting is a form of garbage collection.

Usually with the words "garbage collector" (garbage collector, GC) they mean an interval collector that runs on a timer, but technically reference counting is the simplest garbage collector, even according to Wikipedia classification.

And, as you can see, it is not so simple, you can break the brain at times.

Just so that the reader is not confused, I will repeat everything from the beginning:

We make a class var and in it we encapsulate either a double, or lst (for an array), or chr (for strings), or keyval (for objects).

Here is our class for working with strings:

 struct chr { int size; wchar_t *s; chr (); ~chr(); void set(double i); void set(wchar_t *a, int length = -1); void setUtf(char *a, int length = -1); void setAscii(char *a, int length = -1); char * getAscii(); char * getUtf(); wchar_t operator [](int i); double toNumber (); int intToStr(int i, char *s); void dblToStr (double d, char *s); int cmp (const chr &other); int find (int start, wchar_t *c, int subsize); chr substr(int pos, int count = -1); int _strcount(const chr &substring); void _cpto(int from, const chr &dest, int to, int count); chr clone(); void replace(chr &A, chr &B, chr &dest); }; 


And here is a class for objects:

 struct keyval { var keys, vals; keyval (); void set(var key, var val); var &get(var key); }; 


There is already complete recursion and polymorphism, see, keyval uses arrays in the form of var. To become part of the var. And it works!

One of the most important features of using reference counting is that if you want to change an object, you must understand that all who refer to it will also receive a modified object.

For example:

 void f(var t) { t += "world"; log(t); } var s = "hello"; f(s); log(s); 


Conclusion:

 world world 


When transferring s to f () instead of copying all the characters of a string, only one pointer is copied and one counter is incremented.

But after changing the string t, the string s will also change. What we need in the case of arrays, but not in the case of strings! This is called pass-by-reference.

When we need the variable passed through reference counting to be changed separately from its source code, we must call the detach / unref / unlink function before each change.

This is how Delphi strings work, for example. This is called the term copy-on-write.

This is considered a bad decision. But how to refuse from copy-on-write, but to preserve the possibility of pass-by-reference and copy-pointer-and-increment (reference counting)?

The answer has become the standard of modern programming: instead of changing the variable, make it unchanged! This is called immutability - immutability.

According to the principle of immutability, strings in JavaScript are set only once, and after that they cannot be changed. All functions of working with strings that change something, cause new strings. This greatly facilitates the hard work of carefully arranging all copy / unref, pointer checks and other work with memory.

Here, suddenly, I have to interrupt, because the article has exceeded 20K characters that are comfortable for the reader. But still it is necessary to overload about 20 operators! Even operator, (comma). Combine objects and arrays, write JSON.parse, implement comparisons of strings and booleans, write a constructor for Boolean, invent and implement a notation for initializing the values ​​of arrays and objects, solve the problem of multi-argument log (...), think of what to do with undefined, typeof , correctly implement replace / slice, etc. And all this without a single template, only operator overloading and functions.

So, if you are interested, we will soon continue.

For the most curious, a link to the library repository:

github.com/exebook/jslike

Source: https://habr.com/ru/post/261351/


All Articles