📜 ⬆️ ⬇️

HV data storage format as an attempt to solve the problem of visual storage of text fields



Not so long ago, I was faced with the task of being able to store data in text form, so that not only the program worked with them, but could read and edit (as well as create from scratch in a text editor) a person. There are already many convenient and good formats for this, for example JSON, YAML, XML and so on. But in the considered systems there were moments that, nevertheless, were not much liked.

I will pay particular attention to the bright inconvenience of most of these formats (in my opinion, of course), including very powerful and popular ones, the problem of storing text: how to write a text field that can contain any text characters so that its contents are not change, and it did not affect the parsing, because there may be various substrings that match the service combinations, and various non-standard indents. For example, in XML, the text should not contain the characters "<" and ">" - they should be replaced by "& lt;" and "& gt;" respectively. In many other systems, text is required to be quoted. But what if the text already contains quotes? Use other types of quotes? Shielded? All this means that it is necessary to make changes to the text, and it’s far from being a fact that after that it will be convenient to read and edit it if you have to work with data in a regular text editor, for example, a notepad or an input field (textarea) in the browser. There is also a YAML format in which the text in quotes is not required to be enclosed, but it is very important to observe the correct indents, which seems to be not very convenient for storing multi-line and multi-level data. It also increases the proportion of non-data characters - several service spaces on the left of each line significantly increase the weight.
')
In addition to the text, I had to store, in fact, two more basic data types - integer and fractional number, as well as associations (data structures (blocks) and arrays). That is 5 types are obtained: integer number, floating point number, text, structure, array. There was no need to use macros, expressions and other extensions - we just needed numbers and texts distributed over various blocks and arrays. In connection with such simplicity, a trivial format was needed that could store, among other things, text fields, given the points that were discussed above. I also wanted to see data with as few control characters as possible so that it is easier to understand and remember the syntax.

In general, a bicycle with an unusual design of the steering wheel was created in HV format (the initial internal name is “human values”). On it, I will show a practical solution to this problem, as I see this solution. The format turned out to be non-contingent - which, in principle, was required - as already mentioned, it supports only three simple data types (integer, floating-point number and text) and two composite types (data structure and array, which contain as simple types and compound). There are 3 main control characters. There are 3 additional control characters, but this is for special cases of text field formatting, as well as for commenting. These cases relate to the question posed in the article (on convenient storage of text fields) and will be discussed below by examples.

Data fields can be single-line (integer, floating-point number, single-line text) and multi-line (structure, array, multi-line text). First, the name of the field is written, then a control character that indicates that the field value occupies either one line or several. And then - the value of the field. If there are several lines, then the ending line is indicated at the end of the value. Actually, this is the main essence of the format, set out in a brief form.

The most striking will be to show the features of the HV format on examples.
I will begin with a general description so that the features of the syntax become clear, and gradually I will move on to my vision of solving the problem posed in the article.

 a: 1
 b: 2.2
 c: abcd


Here are 3 simple data types:
a is an integer equal to 1
b is a floating point number equal to 2.2
c - a text field consisting of one line equal to “abcd”

In the following example, the data structure and array that is in this structure:

 xxq +
   a: 12.33
   b: -15
   x +
     : ab
     : cd
     : ef
   ^
 ^


Here the structure contains two fields:
a is a floating point number equal to 12.33
b is an integer equal to -15
x - an array of text fields that are equal to "ab", "cd" and "ef"
For array elements, the field name is not written.

At once I will say that the indents have no meaning, and the data in the following example are absolutely identical to the data from the previous one:

 xxq +
 a: 12.33
 b: -15
 x +
 : ab
 : cd
 : ef
             ^
 ^


And the variant is the presentation of the same data, but without any spaces at all:

 xxq +
 a: 12.33
 b: -15
 x +
 : ab
 : cd
 : ef
 ^
 ^


So, the most important control characters are ":" (if the value takes one line) and "+" (if the value takes several lines).

And now, directly, my vision of a solution to the question of presenting a multi-line text containing various symbols:

 t +
   ABCD
   EFGH <12> @@
   ijklmnopq
   "ABC" + "DEF" = "ABCDEF"
   "A ('a')" = // = "B" ('' '') \
   abcd
 ^


In this example, the text is as follows:

 ABCD
 EFGH <12> @@
 ijklmnopq
 "ABC" + "DEF" = "ABCDEF"
 "A ('a')" = // = "B" ('' '') \
 abcd


Quotation marks, slashes and other characters contained in the text are not replaced or screened in any way - this is not necessary. That is, the text remains completely original and does not require additional transformations.

The text is limited to the final line. The terminating string is by default equal to the control character "^". The same line is used to complete all multi-line fields, such as structures and arrays (shown in the examples above). The value will be read line by line without taking into account the indents until the final line is encountered. It is not a substring, but the entire string (indents, as I said, are ignored and can be any).

When writing text fields there can be two quite reasonable questions:

1) What if there is a line in the source text that will be equal to the final one, that is, "^"?
2) What if indents in the text are important and cannot be ignored?

To resolve the first case, the HV format allows you to override the terminating string. It just needs to be specified before the field value, and, respectively, after:

 eee + END
   hello
   ^
   ^
   ^
   ^
   abcd
 END


The text contained in the “eee” field is:

 hello
 ^
 ^
 ^
 ^
 abcd


Important nuance - redefinition of the final lines is possible only for text fields. The remaining multi-line values ​​(structures and arrays) are always completed with the service symbol "^".

To resolve the second case (indents matter), HV has as many as 2 options.
Option A. Consider all indents to the right and left of the text in each line:

 text @
   This is a red line.
 And this is the usual line.
 All indents from the beginning of the line will be preserved.
    in the text.
                       Like this.
 ^


Option B. Starting to take into account the indents from the first non-whitespace character in each line, and this first character itself will not be taken into account:

 text%
   -BUT
    * B
     = In where
 ^


The text will be as follows:

 BUT
 B
 In where


An interesting feature is the encapsulation of serialized data as text.
I want to draw attention to another feature, which, although quite interesting, useful and almost unique, but the need for its application is quite rare. This feature becomes automatically available due to the possibility of replacing the final line, thereby leaving the original text unchanged. The point is that this way you can insert some data in the HV format as a text field in other data in the HV format. This will not lead to any syntax errors when parsing. This can be useful if there are several text processors at different levels, and they do not know in what format each of them works - they simply transfer the text to the next level.
For example, for the first level you need to transfer two arrays in the HV format:

 a +
   : one
   : 2
 ^
 b +
   : 3
   : four
 ^


But it should be transmitted in the form of text through the second level:

 level_2 +
   for_level_1 + &
     a +
       : one
       : 2
     ^
     b +
       : 3
       : four
     ^    
   &
 ^


The "for_level_1" field is text. Here, the final line is simply replaced by "&".
Parse the data intended for the first level immediately on the second level is impossible according to the conditions of the example - the second level does not know how this text should be processed - maybe HV is there, maybe JSON, or maybe just text that is not intended for parsing. This solves the first level (according to the conditions of the example).

That is, any serialized data can be transferred in the HV text field - even though the same HV, even JSON, XML, YAML, and so on. The possibilities of safe encapsulation without text editing I did not meet in any of the considered formats. This feature, though rarely where it may be needed, but still.


So, the main key characters turned out 3 pieces:

: - value in one line
+ - value in multiple lines
^ - end of multiline value

And 3 additional:

@ - formatted multi-line text
% - marked multiline text
# - comment

There are no obligatory brackets, quotes, explicit data type indications. Typing and all compliance checks are carried out in the HV handler - he knows in advance which field names can occur and what type and format values ​​they should contain. Excessive simplicity makes it portable to almost any programming language.

When first viewed, HV may seem similar to YAML - also minimalistic, also text without quotes. But, since HV was created from scratch, and not on the basis of any existing format, the differences with YAML are more than similarities. HV is undemanding to indents. The total share of the service text in the HV format is smaller, because YAML requires indentation and often uses combinations of 2 or more characters, for example "---", ": | -", ":>", and HV - always only single characters. Well, I have not met any mechanism that restricts the text with an override of the final line in any of the considered formats. And as it seems to me, this is quite a convenient and visual mechanism.

In general, it turned out such a concise format for storing simple data for convenient, in my opinion, human perception. Of course, there is no storage of functions, associative arrays, macros, preprocessors, closures, arithmetic expressions, and other cool things that many other formats can boast. But these features are not required, since the HV format performs and solves the tasks assigned to it, which were discussed above, for example, it does not require enclosing the text in quotes or brackets, does not require character escaping, does not require explicitly specifying the data type, looks quite trivial. supports the most basic set of types, uses little official characters, etc.

I hope I was able to correctly state the reasons for creating the HV format and its features. If something is still not clarified - I will be glad to answer adequate questions.

For those who want to familiarize themselves better with the HV format, http://vaomark.com/z23F0Cz has a more detailed description and a bunch of examples covering all parties.

You can also download the up-to-date HV handler source code and the test module in Python 2.7. By the way, in the near future it is planned to port the handler to C ++, Java, PHP and other languages ​​- everything will be available through the same link.

PS: The HV format is built on my vision for solving the problem of storing text fields in a serialized form, so that the values ​​are in their original, unchanged form and can be conveniently read and modified in any simple editor. Someone will think that it was a successful decision, someone else is the opposite; maybe someone will offer their own. Someone thinks that the problem mentioned in the article is not such a problem that everything is so convenient. I would like to know your opinion.

Source: https://habr.com/ru/post/271501/


All Articles