
We recently posted on GitHub
ZSON . ZSON is an extension to PostgreSQL for transparent compression of JSONB documents. Compression is accomplished by highlighting the lines most commonly found in your documents and building a dictionary with these lines. Moreover, strings can be not only document keys, but also values or, for example, strings from nested arrays. In some cases, ZSON allows you to reduce the size of the database to two times and increase the number of transactions per second by 10%. In shared buffers, documents are stored in a compressed form, due to which the memory is also saved.
Interesting? Read on and you will learn how to use all this farming in practice.
Remarks
Before going to the main content of the article, I would like to note a few points:
')
- Benchmark ZSON'a in this article is not given. Interested readers can read it here . Take into account that in practice everything depends very much on your data, configuration, hardware, version of the DBMS and other factors. Do not believe synthetic benchmarks , check everything yourself!
- The question of writing extensions for PostgreSQL is beyond the scope of this article. Interested readers can read a separate article on this topic , and further links. How ZSON works inside, if it’s interesting to someone, I can devote a separate article.
- PostgreSQL has a built-in compression algorithm - PGLZ. ZSON does not replace, but complements it. PGLZ compresses each document individually. He cannot find that the same lines appear in different documents. ZSON finds these strings and replaces them with 16-bit codes. Then the documents are compressed PGLZ, as usual (or not compressed, if the final document is small and / or not pressed).
Installation
Building ZSON from sources and installing it is done as follows:
git clone https://github.com/afiskon/zson.git cd zson sudo make install
After installation, run the tests:
make installcheck
Enable ZSON for your database:
create extension zson;
Congratulations, ZSON is installed!
Deletion
When and if you want to remove ZSON, just turn it off:
drop extension zson;
And then completely cut from PostgreSQL:
sudo make uninstall
All types created by ZSON, tables, and so on, will be cleaned automatically.
Using
First of all, you should train ZSON on your typical documents. As a result of the training, a dictionary is constructed with the strings most frequently found in your documents, which is then used for compression. Training takes place using the following procedure:
zson_learn( tables_and_columns text[][], max_examples int default 10000, min_length int default 2, max_length int default 128, min_count int default 2 )
For example:
select zson_learn('{{"table1", "col1"}, {"table2", "col2"}}');
You can view the resulting dictionary as follows:
select * from zson_dict;
Now you can use ZSON as a transparent replacement for the JSONB type:
zson_test=# create table zson_example(x zson); CREATE TABLE zson_test=# insert into zson_example values ('{"aaa": 123}'); INSERT 0 1 zson_test=# select x -> 'aaa' from zson_example; -[ RECORD 1 ]- ?column? | 123
All operators and procedures will work with ZSON in the same way as with JSONB.
Migration to the new dictionary
As your document's layout changes, compression may become ineffective due to the disappearance of some lines and the appearance of others. In this case, you can retrain ZSON on new data:
select zson_learn('{{"table1", "col1"}, {"table2", "col2"}}');
This creates a new version of the dictionary. When updating and creating new documents, they will be compressed with it. Old documents will be released using the dictionary version with which they were compressed. Note that dictionaries are cached in memory. Therefore, ZSON learns about the new dictionary not immediately, but about a minute after its creation.
You can determine which version of the dictionary a particular document was compressed with using the zson_info procedure:
zson_test=# select zson_info(x) from test_compress where id = 1; -[ RECORD 1 ]
If you are
absolutely sure that
all the documents in your database are compressed using the new version of the dictionary, you can safely delete the old dictionary:
delete from zson_dict where dict_id = 0;
In practice, however, this makes little sense. You will save only a couple of kilobytes of disk space. The paranoiac inside of me believes that the risk of losing data due to a little inattentiveness is not worth it.
How to understand that you need to update the dictionary?
Unfortunately, it is difficult to recommend a universal approach. For example, you can check the average document size in your database from time to time:
select pg_table_size('tt') / (select count(*) from tt)
If it began to increase, then it's time to update the dictionary. You can also retrain ZSON just from time to time, for example, once a year.
Finally, application developers themselves know when they greatly change the layout of documents. You can include retraining ZSON in the migration scripts or instructions for updating the application.
With a strong desire, you can come up with other approaches. In general, it all depends on the situation.
Conclusion
As you can see, thanks to ZSON you get the best of both worlds - the flexibility of schemaless data combined with the compactness of the relational model. At the same time, nothing needs to be done on the side of the application; everything happens transparently, on the side of the DBMS itself.
If you have questions, I will be happy to answer them in the comments. Any feedback and pull requests are most welcome.