Elementary types and operations on them. Part I: data types, size, limitation.

The building blocks of any language are elementary data types with which we can work. Knowing them, we always understand that we have stored in one or another variable that returns one or another function. What actions we can take on our data. This is the base. Therefore, this is what I wanted to pay attention to in this article in general, as well as examples of working with binary data in particular.

The material is primarily addressed to those who have just started or want to start writing on Erlang-e. But I tried to cover this aspect of the language as fully as possible, and therefore I hope that it will be useful to a more advanced audience.

The initial material had to be divided into three parts, in this the basic types of the language, the ways of creating the basic types and the consumed resources for each of the types will be considered.

Introduction

First I want to express my deep gratitude to the participants of the Russian-language newsletter on Erlang in Google for raising karma and the opportunity to put this article on Habr.
')
In the process of presentation will be given examples from the shell command (shell) Erlang. Therefore, you need to learn simple working principles. Each command in the shell is separated by commas. In this case, it does not matter at all whether the set is in one line or in several.

1> X=1, Y = 2,
1> Z = 3,
1> S=4.
4

A pointer to the completion of input and execution is the point. In this case, the shell displays the value returned by the last command. In the example above, the value of the variable S is returned. The values of all triggered variables are stored, and since the value of the triggered variable cannot be overridden in Erlang, an attempt to override will result in an error:

2> Z=2.
** exception error: no match of right hand side value 2

Therefore, if it is required in the current session of the work to force the shell to “forget” the values of variables, then you can use the f () function. Called without arguments, it removes all initialized variables. If the variable name is specified as an argument, only it will be deleted (it is impossible to transmit a list of variables):

3> f(Z).
ok
4> X = 4.
** exception error: no match of right hand side value 4
5> f().
ok
6> X = 4. %,
4

To exit, simply enter halt (), or call the user interface interface Crtl + G and enter q (the h command displays help). When displaying digital data in the shell, they are converted to decimal form.

The outlined material refers to the latest, current at the moment, version 5.6.5. To encode strings, ISO-8859-1 (Latin-1) encoding is used. Accordingly, all the numerical character codes are taken from this encoding. The first half (codes 0-127) of the encoding corresponds to US-ASCII codes, so there are no problems with the Latin alphabet.

Despite the developers' claim that in the internal view Latin-1 is used “outside” the virtual machine, this is often not at all obvious. This arises from the fact that the Erlang transmits and receives symbols in the form of codes. If the terminal is set to a locale, the codes are interpreted on the basis of the set code page and, if possible, are output as printable characters. Here is an example of an SSH session:

# setenv | grep LANG
LANG=
# erl
Erlang (BEAM) emulator version 5.6.5 [source] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.6.5 (abort with ^G)
1> [255].
"\377"
2> halt().
# setenv LANG ru_RU.CP1251
# erl
Erlang (BEAM) emulator version 5.6.5 [source] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.6.5 (abort with ^G)
1> [255].
""

And it does not matter at all what is specified in the LANG environment variable, the main thing is that it be installed.

1. Elementary types

There are not many basic types in the language. This number (integer or floating point), atom, binary data, bit strings, function objects (like JavaScript), port identifier, process identifier (Erlang process, not system), tuple, list. There are a number of pseudo-types: write, boolean, string. Any data type (not necessarily elementary) is called a term.

1.1 Number

Two types of numbers are supported. Floating point ( float ) and integer ( integer ) numbers. In addition to the generally accepted form of writing numbers, there are two specific notations:

$ char
ASCII code (depending on locale) of the char character.
base # value
integer value in the base system, the base can be from the range 2 ... 36

For example:

1> 42.
42
2> $A.
65
3> $\ .
10
4> 2#101.
5
5> 16#1f.
31
6> 2.3.
2.3
7> 2.3e3.
2.3e3
8> 2.3e-3.
0.0023
9>$.
255

I remind you that the numbers in the output are reduced to decimal.

Memory consumption and limitations. The whole occupies one machine word, which for 32-bit and 64-bit processors is 4 bytes and 8 bytes, respectively. For large whole 1 ... N machine words. Floating-point numbers, depending on the architecture, occupy 4 and 3 machine words, respectively.

1.2 List

List ( List ) allow you to group data into one structure. The list is created using square brackets, the list items are separated by commas. The list item can be of any type. The first item in the list is called the head , and the rest is the tail .

1> Var = 5.
5
2> [1,45, atom, Var,"string", $Z, 2#101].
[1,45,atom,5,"string",90,5]
3>List = [1, 9.2, 3], % 1, [9.2, 3]
3>List.
[1,9.2,3]

The size of the list is equal to the number of items in it. In the example above, the value of the List variable is a list, the size of the list is 3.
The list is a dynamic structure. You can add and remove list items. Inside a virtual machine, the list is a structure that is a single-linked list, which imposes certain processing features, but more on that below.

Memory consumption and limitations. Each list item occupies one machine word (4 or 8 bytes depending on the architecture) + the size of the data stored in the item. Thus, on a 32-bit architecture, the value of the List variable will be (1 + 1) + (1 + 4) + (1 + 1) = 9 words or 36 bytes.

1.3 Line

In fact, there are no Strings in the Erlang. It is simply syntactic sugar that allows you to write down a list of integers in a more convenient form. Each item in this list is the ASCII code of the corresponding character.

1> "Surprise".
"Surprise"
2> [83,117,114,112,114,105,115,101].
"Surprise"
3> "".
""
4> [$,$,$,$,$,$].
""
5> [$, $, $, $, $, $, 1].
[241,242,240,238,234,224,1]

Therefore, when the virtual machine sees the list, the codes of the elements of which can be translated into printable characters, it understands that there is a line in front of it and displays it in a character form. Unlike many other languages, strings in Erlang are created using double quotes and never single quotes. This is because atoms are created using single quotes. Inline strings are allowed escape sequences (see below).

Memory consumption and limitations. Because a string is a list of integers, and each character is one element of the list, then a character takes 8 or 16 bytes (2 machine words).

1.4 Atom

Atom ( Atom ) is just a literal. It cannot be associated with any numeric value like a constant in other languages. The value returned by an atom is the atom itself. The atom must begin with a lowercase letter and consist of numbers, Latin letters, an underscore _ or dog @. In this case, it can not be enclosed in single quotes. If there are other characters, you need to use single quotes for framing. Double quotes are not suitable for this, because they enclose the line.

For example:

hello
phone_number
'Monday'
'phone number'

The following control sequences can be used in strings and in quoted atoms:

Sequence	Description
\ b	return (backspace)
\ d	delete (delete)
\ e	escape
\ f	form feed
\	newline (newline)
\ r	carriage return
\ s	space
\ t	horizontal tab (tab)
\ v	vertical tab
\ XYZ, \ YZ, \ Z	octal character code
\ ^ a ... \ ^ z \ ^ A ... \ ^ Z	Ctrl + A ... Ctrl + Z
\ '	single quote
\ "	double quote
\\	backslash

The name of an uncombined atom cannot be a reserved word. These words include:
bsr bsr bsor

Memory consumption and limitations. Each declared atom is unique and its symbolic representation is stored in the internal structure of the virtual machine called the table of atoms . An atom occupies 4 or 8 bytes (one machine word) and is simply a reference to an element of the table of atoms which contains its symbolic representation. The garbage-collection does not clean up the atom table. The table itself also takes up space in the memory. It is allowed to use atoms in 255 symbols, in total it is permissible to use 1,048,576 atoms. Thus, an atom of 255 characters will occupy 255 * 2 + 1 * N machine words, where N is the number of references to an atom in the program.

1.5 Tuple

A tuple is similar to a list and consists of a set of elements. It also has a size equal to the number of elements, but unlike the list, its size is fixed. A tuple is created using curly brackets and its elements can be any type of data, tuples can be nested.

 1> {1,2, 2 # 110, n, $ r, [1, 5], "abc"}.
 {1,2,6, n, 114, [1,5], "abc"}
 2> Man = {man,
 2> {name, "Alexey"},
 2> {height, {meter, 1.86}},
 2> {age, 27}}.
 {man, {name, "Alexey"}, {height, {meter, 1.86}}, {age, 27}}
 3> Man2 = {man,
 3> {name, "Ivan"},
 3> {height, {meter, 1.80}},
 3> {age, 25}}.
 4> [Man, Man2].
 [{man, {name, "Alexey"}, {height, {meter, 1.86}}, {age, 27}},
  {man, {name, "Ivan"}, {height, {meter, 1.8}}, {age, 25}}]
 5> {20, 100}.
 {20,100}.

Tuples are convenient in that they allow not only to include specific data in the structure, but also to describe it. This, as well as the fixedness of the tuple, make it possible to apply them very effectively in templates. It will be good practice to create an atom that describes the essence of a tuple when creating a tuple. If we draw analogies with RDBMS, then the list is a table, each row of the table is a list element, and the tuple in this element is a specific record in the corresponding column.

Memory consumption and limitations. The tuple takes 2 machine words + the size needed to store the actual data itself. For example, a tuple in line 5 will occupy (2 + 1) + (2 + 1) = 6 machine words or 24 bytes on the 32nd architecture. The maximum number of elements in a tuple is 67,108,863.

1.6 Record

A record ( Record ) is actually another example of syntactic sugar, and is internally stored as a tuple. The record at the compilation stage is converted to a tuple, so it is impossible to use records directly in the shell. But you can use the rd () function to declare a record structure (line 1). A declaration declaration always consists of two elements. The first element must be an atom called record name . The second is always a tuple, perhaps even empty, whose elements are a pair of field_name - field_value , and the field name must be an atom, and the value of any valid type (including the record, line 11).

The operator for creating a tuple based on a record (line 2) is # # followed by a record name and a tuple with field values, possibly even empty, but not when with field names that are not declared in the record description.

1> rd(person, {name = "", phone = [], address}).
person
2> #person{}.
#person{name = [],phone = [],address = undefined}
3> #person{phone=[1,2,3], name="Joi", address="Earth"}.
#person{name = "Joi",phone = [1,2,3],address = "Earth"}

The advantage of a record over a tuple is that the work with elements is carried out by the field name (the name must be an atom and be declared in the description of the record, line 1), and not by the position number. Accordingly, the order of assignment does not matter, it is not necessary to remember that we have in the first element the name, telephone or address. You can change the order of the fields, add new fields. In addition, it is possible to assign a default value when the corresponding field is not specified.

4> rd(person, {name="Smit", phone}).
person
5> P = #person{}.
#person{name = "Smit",phone = undefined}
6> J = #person{phone = [1,2,3], name = "Joi"}.
#person{name = "Joi",phone = [1,2,3]}
7> P#person.name.
"Smit"
8> J#person.name.
"Joi
9> W = J#person{name="Will"}.
#person{name = "Will",phone = [1,2,3]}

If, when creating a record (line 4), the default value is not defined (phone field), then its value is equal to the atom undefined . You can access the value of a variable created using an entry using the syntax described in lines 7 and 8. You can copy the value of a variable into a new variable (line 9). Moreover, if the value of any fields is not defined, then a complete copy is obtained; if they are defined, the corresponding fields in the new variable are redefined. All these manipulations do not affect either the definition of the record or the values of the fields in the old variable.
To me personally, this is very similar to the description and creation of instances of a class, although I will emphasize once again that this is just a way of storing in a tuple variable.

10> rd(name, {first = "Robert", last = "Ericsson"}).
name
11> rd(person, {name = #name{}, phone}).
person
12> P = #person{name = #name{first="Robert",last="Virding"}, phone=123}.
#person{name = #name{first = "Robert",last = "Virding"},
phone = 123}
13> First = (P#person.name)#name.first.
"Robert"

The example above illustrates the nested entries and syntax for accessing internal elements.

1.7 Binary data and bit strings

Both binary type ( Binaries ) and bitstrings ( Bit strings ) allow you to work with binary code directly. The difference between a binary type and a bit string is that binary data should consist only of an integer number of bytes, i.e. the number of bits in them is a multiple of eight. Bit strings allow working with data at the bit level, i.e. in fact, a binary type is a special case of a bit string, the number of digits in which is a multiple of eight. You can either create data by describing their structure, or use this type in templates. Binary data is described by the following structure:

<<E1, E2, ... En>>

A separate element of such a structure is called a segment . Segments describe the logical structure of binary data and can consist of an arbitrary number of bits / bytes. This gives a very powerful and convenient tool when used in templates (an example of such an application will be discussed in the third part).

1> <<20, $W, 50:8, "abc">>.
<<20,87,50, "abc" >>
2> <<400>>.
<<144>>
3> <<400:16>>.
<<1,144>>
4> Var = 30.
30
5> <<(Var + 30), (20+5)>>.
<<60,25>>

To understand why, as a result of creating binary data in line 2, we received 144 (that is, 10010000, because we, I hope, have not forgotten that the shell when outputting all digital data to a decimal form), but not the expected 400 consider the bit syntax of the segment description.

 Ei = Value |
      Value: Size |
      Value / TypeSpecifierList |
      Value: Size / TypeSpecifierList

The full form of the segment description consists of a value ( Value ), a size ( Size ) and a specifier ( TypeSpecifierList ). Moreover, the size and specifier are optional, and if not specified, they default values.

Value ( Value ) in the constructor can be a number (integer or floating point), a bit string or a string, which, as we remember, is actually a list of integers. However, at the same time, the value of a segment cannot be a list of even integers, since inside the constructor, the string is syntactic sugar for symbolic conversion to integers, not to a list. Those. the “abc” >> entry is syntactic sugar for << $ a, $ b, $ c >> , not << [$ a, $ b, $ c] >> .
Within templates, the value can be a literal or an undefined variable. Nested templates are not allowed. In Value , you can also use expressions, but in this case the segment should be enclosed in parentheses (line 5).

Size determines the size of the segment in units ( Unit , about them just below) and must be a number. The default value of Size depends on the type ( Type , see below) Value , but it can also be explicitly specified. For integers this is 8, floating-point numbers are 64, binary corresponds to the number of bytes, bit strings to the number of digits. The total segment size in bits can be calculated as Size * Unit .
When used in templates, Size must be explicitly specified (line 7) and may not be specified only for the last segment since the rest of the data falls into it (akin to reading a line from the start character to the end of the line without specifying the desired number of characters).

6> Bin = <<30>>.
<<30>>
7> <<X:2, Y:3, Z/bits>> = Bin,
8> Z. % 3 , 110
<<6:3>>

The specifier ( TypeSpecifierList ) consists of a list of specifying options separated by a hyphen and written in random order (for better readability, I recommend writing the unit last).

Type = integer | float | binary | bytes | bitstring | bits
Specifies the type of Value . bytes is a short form for binary and bits for a bitstring. The default is integer.
Signedness = signed | unsigned
Indicates whether the integer has a sign or it is without a signed value. Makes sense only for integer type. The default is unsigned (i.e. a positive unsigned integer).
Endianness = big | little | native
Byte order One byte can encode a range of integers 0 ... 255, therefore for large numbers two or more bytes are required. For example, the number 400 encoded in two bytes will look like 00000001 10010000 and the first byte is considered the high byte (from high to low). This is a network byte order (big-endian). When it is considered that the low byte goes first, then they talk about Intel's byte order (little-endian). A native value means that the byte order will be set at boot, depending on which mode is native for the CPU on which the virtual machine is running. Order makes sense only for numbers. The default is big.
Unit = unit:IntegerLiteral
Unit The number in the range of 1 ... 256. Together with Size, it uniquely identifies the size of a segment in bits and cannot be specified without explicitly specifying Size . The default value is 1 for numbers and bit strings, and 8 for the binary type.

Thus, the example from line 2 should become clearer. In Erlang, the default binary data constructor creates segments that are one byte unless you explicitly specify a different size. Therefore, line 2 contains an entry like <<400:8/integer-unsigned-big-unit:1>> , which is truncated by the virtual machine to one last byte. In the case of a network byte sequence, the last byte with the value 10010000 will be the last 144 in decimal. If you specify the little sequence, then the last will be 00000001 bytes, i.e. 1 in decimal. If the segment is able to encode the value, then truncation will not occur.

9> <<400:16>>.
<<1,144>>
10> <<400:16/little>>.
<<144,1>>
11> <<400:8/unit:2>>.
<<1,144>>

Using bit syntax, the same data can be described differently (lines 9 and 11 describe the same two-byte structure).

Memory consumption and limitations. 3 ... 6 bits + directly the data itself. On the 32nd architecture, it is possible to manipulate 536,870,911 bytes, on a 64-bit system, 2,305,843,009,213,693,951 bytes. To handle larger structures, you have to write the processing functions yourself.
Attention. The record B=<<1>> will be interpreted as B =<<1>> (i.e. B is less-equal <1 >>). The correct form will be with the use of spaces: B = <<1>> .

1.8 Link

Reference ( Reference ) is a term created by the function make_ref / 0 and can be considered unique. It can be used for such a data structure as a primary key.

Memory consumption and limitations. On a 32-bit architecture, 5 machine words are required per link for the current local node and 7 words for the remote one. On the 64th 4 and 6 words respectively. In addition, the link is associated with the node table which also consumes RAM.

1.9 Boolean

Boolean type ( Boolean ) is a pseudo type. in fact, these are just two atoms, true and false .

1.10 Function Object

 fun
     (Pattern11, ..., Pattern1N) [when GuardSeq1] ->
         Body1;
     ...;
     (PatternK1, ..., PatternKN) [when GuardSeqK] ->
         Bodyk
 end

The function declaration begins with the keyword fun and ends with the keyword end . The set of parameters is transmitted through a comma in parentheses, each parameter is a template. If the input parameters coincide with the template, then a set of instructions from the sign -> and to; is executed. Those. essentially, the input arguments act as input filters. If none of the patterns match, then an error message is generated.

1> Fun = fun({centimetre, X}) -> {meter, X/100} end.
#Fun<erl_eval.6.13229925>
2> Fun(10).
** exception error: no function clause matching
erl_eval:'-inside-an-interpreted-fun-'(10)
3> Fun({centimetre, 10}).
{meter,0.1}

But if the function is defined differently, then an error in line 2 would not occur:

 4> f ().
 ok
 5> Fun = fun ({centimeter, X}) ->
 5> {meter, X / 100};
 5> (X) ->
 5> X / 100
 5> end.
 #Fun <erl_eval.6.13229925>
 6> Fun (10).
 0.1

Therefore, the input arguments must be of the same type as those declared in the function. After the keywords when and before -> you can include an expression whose result is true or false. The body of the function is executed if the expression returns true. (.. ), ( 12). .

7> F = fun(X) when X<0 ->
7> X+1;
7> (X) when X>0 ->
7> X-1;
7> (0) -> 0
7> end.
#Fun<erl_eval.6.13229925>
8> F(5).
 four
9> F(-5).
 -four
10> F(0).
 0
11> X = fun(Y) when Y>0 -> Y + 1 end.
#Fun<erl_eval.6.13229925>
12> X(-5).
** exception error: no function clause matching erl_eval:'-inside-an-interpreted-fun-'(-5)
13> X(5). 
 6

, :

14> Adder = fun(X) -> fun(Y) -> X + Y end end.
#Fun<erl_eval.6.72228031>
15> Add6 = Adder(6).
#Fun<erl_eval.6.72228031>
16> Add6(10).
16

. 9…13 + . .

1.11

( Pid ) spawn/1,2,3,4, spawn_link/1,2,3,4 and spawn_opt/4 . , . Pid- .

1> spawn(m, f, []).
<0.51.0>

. 1 5 . .

1.12

The port identifier ( Port the Identifier ) function returns open_port/2when you create a port. The port is a basic mechanism of interaction between the Erlang processes and the outside world. It provides a byte oriented communication interface with external programs. The process that created the port is called the port owner or process attached . . . . , . , , .

. 1 5 . .

PS . , , , „ Erlang “ „ Getting Started With Erlang “

Source: https://habr.com/ru/post/52493/

All Articles