Universal Binary JSON - another binary JSON

The article is a free translation of the information provided on the official website.

Introduction

JSON is a widespread and popular data exchange format. Its elegance, ease of processing, and relatively rich type system have become a natural choice for many developers who need to quickly and simply store or randomly transfer data between systems.

Unfortunately, the process of packing and unpacking native programming language structures using a textual representation of data suitable for transmission over a network has significant resource costs. In high-loaded systems, avoiding the JSON text processing stage can lead to higher, better processing of information in time and a reduction in stored data in size.
')
To achieve the best results in such cases, it becomes useful to use binary JSON format .

Why?

Attempts to make using JSON faster, faster using binary specifications such as BSON , BJSON or Smile , exist, but they fail for two reasons:

Internal data types . The use of internal data types, exclusively inherent only in binary formats and not originally included in the JSON standard, makes the above specifications unsuitable for widespread use, since, depending on the implementation, each such type can be interpreted differently.
The complexity of the implementation . Some formats allow for higher performance, while others provide a more compact view due to a more complex, confusing specification. That, in turn, slows down or makes impossible their distribution and implementation. Ease of use is the engine of JSON success.

For example, BSON defines data types for regular expressions, blocks of JavaScript code, and other constructs that do not have a corresponding data type in JSON . BJSON also determines its data types, leaving ample room for errors associated with the interpretation of types in two different implementations. Smile defines more complex data types, generation and parsing rules to efficiently use space.

All existing binary JSON specifications suffer from problems of incompatibility and implementation complexity, which naturally destroys the main advantage of JSON that made it so popular - simplicity.

The simplicity of JSON made it possible to create implementations in various programming languages and made it convenient and understandable at once for everyone who uses your data directly.

Any successful specification of binary JSON must adopt these properties in order to be really useful for the community as a whole.

Why not JSON + gzip?

Compressing JSON can be a better solution than using binary formats. But there are two problems:

Data handling speed drops by 50%.
There is no way to explore the data and work with them directly.

It turns out that the size of the transmitted data can be reduced by an average of 75%, but this will greatly increase the overhead of processing.

Goals

Universal Binary JSON specification is designed solely on the principles of full compatibility with JSON , simplicity, speed and accessibility for understanding. Reading and writing in this format are trivial. As a side effect, the data space is reduced by an average of 30%.

Full compatibility . 100% compatibility with JSON and exclusive use of data types supported by all modern programming languages. This allows you to efficiently convert data between JSON and Universal Binary JSON without the use of sophisticated data structures by developers, which may or may not be supported by a programming language.
Ease of use . It is achieved due to the fact that the JSON specification is taken as a basis and only one binary structure is used to rationally describe types. Thanks to this, we get accessibility and ease of understanding by developers.
Speed and efficiency . The motivation for using binary formats lies in the speed and efficiency of data parsing. At the same time, as a side effect, reducing the consumption of space by 30%.

Data format

General view of a single byte structure in the specification used to describe all supported types.

  [type, 1-byte char] ([length, 1 or 4-byte integer]) ([data])

type - 1 byte, character from ASCII. Used to indicate the type of data that follows it.
length (optional) - 1 or 4 bytes (integer value) depending on the length or size of the object. For an array, its length. For an object, the number of Key / Value pairs. If the length or number of elements is from 0 to 254 inclusive, then 1 byte is used. This field with a value of 255 is reserved for objects and arrays of unknown length.
data (optional) - a sequence of bytes, directly representing the data object.

The length and data fields are used or not used depending on the type of data. For example, a type 32-bit integer has a standard size of 4 bytes. To write a value of this type, you will need 1 byte for the type indication and 4 bytes for the value itself. In this case, the field length is not used because of its uselessness.

Thanks to this presentation of information, the goals are achieved.

Opportunities

The Universal Binary JSON specification supports:

basic data types
arrays
objects
integers, arrays and objects of unknown length or size
streaming data

Important features : values of numeric types are written in byte order from high to low ( Big-Endian ) and the main encoding of text information is UTF-8 .

Community

Official site of the UBJSON specification - Universal Binary JSON
Participated and discussed
Java implementation - Universal Binary JSON Java Library
Simple C # implementation - Ubjson.NET

Implementations in other programming languages will be available as they appear on this link .

PS: The author of the specification and implementation in Java, Riyad Kalla and I, the author of the article and the implementation in C #, will be glad to any of your participation in the process of working on the specification.

Source: https://habr.com/ru/post/130112/

All Articles