📜 ⬆️ ⬇️

Advanced work with JSON in MySQL

MySQL does not have the ability to directly index JSON documents, but there is an alternative: generated columns.


Since the introduction of JSON data type support in MySQL 5.7.8, one thing is missing: the ability to index JSON values. To circumvent this limitation, you can use the generated columns. This feature, introduced in MySQL 5.7.5, allows developers to create columns containing information derived from other columns, predefined expressions or calculations. By generating a column from JSON values ​​and then indexing it, you can practically index the field with JSON.


The JSON data set used in this article can be downloaded from Github . It contains a list of players with the following elements: the player ID, his name, and the games he played (Battlefield, Crazy Tennis and Puzzler).


{ "id":1, "name":"Sally", "games_played":{ "Battlefield":{ "weapon":"sniper rifle", "rank":"Sergeant V", "level":20 }, "Crazy Tennis":{ "won":4, "lost":1 }, "Puzzler":{ "time":7 } } }, … 

The Battlefield field contains the player’s favorite weapon, its current rank and the level of that rank. Crazy Tennis includes the number of games won and lost, and Puzzler contains the time taken by the player to complete the game. Create the initial table:


 CREATE TABLE `players` ( `id` INT UNSIGNED NOT NULL, `player_and_games` JSON NOT NULL, PRIMARY KEY (`id`) ); 

This query creates a table of players consisting of an identifier and JSON data, and also sets the primary key in the id field.


You need to build an index on the field with JSON. Let's see what needs to be added to the CREATE TABLE command.


Column generation


The following syntax is used to create generated columns in the CREATE TABLE statement:


 `column_name` datatype GENERATED ALWAYS AS (expression) 

The key words here are GENERATED ALWAYS and AS . The phrase GENERATED ALWAYS optional. It is necessary only if you want to explicitly indicate that this column of the table is generated. It is necessary that the word AS accompanied by an expression that returns the value for the generated column.


Let's start with this:


 `names_virtual` VARCHAR(20) GENERATED ALWAYS AS ... 

Create a column with the name names_virtual up to 20 characters in length, which will store the value of the "name" field from the JSON object. We refer to the “name” field in JSON using the MySQL operator ->> , which is equivalent to writing JSON_UNQUOTE (JSON_EXTRACT (...)) . This construct will return the value of the "name" field from the JSON object as a result.


 `names_virtual` VARCHAR(20) GENERATED ALWAYS AS (`player_and_games` ->> '$.name') 

This code means that we take the field with JSON player_and_games and retrieve the value from JSON by the key “name” - a child of the root.


As with most column definitions, there are a number of limitations and parameters that can be applied to a column.


 [VIRTUAL|STORED] [UNIQUE [KEY]] [[NOT] NULL] [[PRIMARY] KEY] 

The keywords VIRTUAL and STORED that are unique to the generated columns indicate whether values ​​will be stored in the table.


The keyword VIRTUAL used by default. It means that the column values ​​are not saved and do not take up storage space. They are calculated each time a line is read. If you create an index with a virtual column, the value is still stored - in the index.


The keyword STORED indicates that values ​​are calculated when writing data to a table: when inserting or updating. In this case, the index does not need to save the value.


Other parameters are optional constraints that ensure that the field values ​​are NULL or NOT NULL , as well as adding restrictions to the index, for example, UNIQUE or PRIMARY KEY . To ensure that a value exists, use NOT NULL when creating a column, but the restrictions depend on the use case. In the example, NOT NULL will be used, since the players must have a name.


Query creating table:


 CREATE TABLE `players` ( `id` INT UNSIGNED NOT NULL, `player_and_games` JSON NOT NULL, `names_virtual` VARCHAR(20) GENERATED ALWAYS AS (`player_and_games` ->> '$.name') NOT NULL, PRIMARY KEY (`id`) ); 

Filling the table with test data:


 INSERT INTO `players` (`id`, `player_and_games`) VALUES (1, '{ "id": 1, "name": "Sally", "games_played":{ "Battlefield": { "weapon": "sniper rifle", "rank": "Sergeant V", "level": 20 }, "Crazy Tennis": { "won": 4, "lost": 1 }, "Puzzler": { "time": 7 } } }' ); ... 

The contents of the players table on Gist or ...


With go formatting
 SELECT * FROM `players`; +----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+ | id | player_and_games | names_virtual | +----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+ | 1 | {"id": 1, "name": "Sally", "games_played": {"Puzzler": {"time": 7}, "Battlefield": {"rank": "Sergeant V", "level": 20, "weapon": "sniper rifle"}, "Crazy Tennis": {"won": 4, "lost": 1}}} | Sally | | 2 | {"id": 2, "name": "Thom", "games_played": {"Puzzler": {"time": 25}, "Battlefield": {"rank": "Major General VIII", "level": 127, "weapon": "carbine"}, "Crazy Tennis": {"won": 10, "lost": 30}}} | Thom | | 3 | {"id": 3, "name": "Ali", "games_played": {"Puzzler": {"time": 12}, "Battlefield": {"rank": "First Sergeant II", "level": 37, "weapon": "machine gun"}, "Crazy Tennis": {"won": 30, "lost": 21}}} | Ali | | 4 | {"id": 4, "name": "Alfred", "games_played": {"Puzzler": {"time": 10}, "Battlefield": {"rank": "Chief Warrant Officer Five III", "level": 73, "weapon": "pistol"}, "Crazy Tennis": {"won": 47, "lost": 2}}} | Alfred | | 5 | {"id": 5, "name": "Phil", "games_played": {"Puzzler": {"time": 7}, "Battlefield": {"rank": "Lt. Colonel III", "level": 98, "weapon": "assault rifle"}, "Crazy Tennis": {"won": 130, "lost": 75}}} | Phil | | 6 | {"id": 6, "name": "Henry", "games_played": {"Puzzler": {"time": 17}, "Battlefield": {"rank": "Captain II", "level": 87, "weapon": "assault rifle"}, "Crazy Tennis": {"won": 68, "lost": 149}}} | Henry | +----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+ 

The table includes the names_virtual column in which all player names are inserted. Players table structure:


 SHOW COLUMNS FROM `players`; +------------------+------------------+------+-----+---------+-------------------+ | Field | Type | Null | Key | Default | Extra | +------------------+------------------+------+-----+---------+-------------------+ | id | int(10) unsigned | NO | PRI | NULL | | | player_and_games | json | NO | | NULL | | | names_virtual | varchar(20) | NO | | NULL | VIRTUAL GENERATED | +------------------+------------------+------+-----+---------+-------------------+ 

Since we did not specify whether the generated column is VIRTUAL or STORED , by default MySQL automatically made the column VIRTUAL . To check whether the columns are VIRTUAL or STORED , simply run the above query SHOW COLUMNS , and it will show either VIRTUAL GENERATED or STORED GENERATED .


Now that we have set up the table and the virtual column, add four more columns using the ALTER TABLE and ADD COLUMN operations. They will contain Battlefield levels, won and lost tennis games and time in Puzzler.


 ALTER TABLE `players` ADD COLUMN `battlefield_level_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played.Battlefield.level') NOT NULL AFTER `names_virtual`; ALTER TABLE `players` ADD COLUMN `tennis_won_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played."Crazy Tennis".won') NOT NULL AFTER `battlefield_level_virtual`; ALTER TABLE `players` ADD COLUMN `tennis_lost_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played."Crazy Tennis".lost') NOT NULL AFTER `tennis_won_virtual`; ALTER TABLE `players` ADD COLUMN `times_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played.Puzzler.time') NOT NULL AFTER `tennis_lost_virtual`; 

Again, by running the query SHOW COLUMNS FROM players; , we see that next to them all columns are listed as VIRTUAL GENERATED . This means that we have successfully configured the new generated VIRTUAL columns.


Gista code or ...


With go formatting
 +---------------------------+------------------+------+-----+---------+-------------------+ | Field | Type | Null | Key | Default | Extra | +---------------------------+------------------+------+-----+---------+-------------------+ | id | int(10) unsigned | NO | PRI | NULL | | | player_and_games | json | NO | | NULL | | | names_virtual | varchar(20) | NO | | NULL | VIRTUAL GENERATED | | battlefield_level_virtual | int(11) | NO | | NULL | VIRTUAL GENERATED | | tennis_won_virtual | int(11) | NO | | NULL | VIRTUAL GENERATED | | tennis_lost_virtual | int(11) | NO | | NULL | VIRTUAL GENERATED | | times_virtual | int(11) | NO | | NULL | VIRTUAL GENERATED | +---------------------------+------------------+------+-----+---------+-------------------+ 

Running a SELECT query shows us all the values ​​from VIRTUAL COLUMNS , which should look like this:


Gista code or ...


With go formatting
 SELECT `names_virtual`, `battlefield_level_virtual`, `tennis_won_virtual`, `tennis_lost_virtual`, `times_virtual` FROM `players`; +---------------+---------------------------+--------------------+---------------------+---------------+ | names_virtual | battlefield_level_virtual | tennis_won_virtual | tennis_lost_virtual | times_virtual | +---------------+---------------------------+--------------------+---------------------+---------------+ | Sally | 20 | 4 | 1 | 7 | | Thom | 127 | 10 | 30 | 25 | | Ali | 37 | 30 | 21 | 12 | | Alfred | 73 | 47 | 2 | 10 | | Phil | 98 | 130 | 75 | 7 | | Henry | 87 | 68 | 149 | 17 | +---------------+---------------------------+--------------------+---------------------+---------------+ 

After adding data and creating generated columns, we can create an index for each of them to optimize the search ...


Indexing Generated Columns


When secondary indexes are set to the values ​​of the generated VIRTUAL columns, the values ​​are stored in the index. This gives advantages: the size of the table does not increase, it becomes possible to use indices in MySQL.


Let's make a simple query on the generated column to see what it looks like before indexing it. Having studied the details of the request when choosing the names_virtual and the name "Sally", we get the following:


 EXPLAIN SELECT * FROM `players` WHERE `names_virtual` = "Sally"\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: players partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 6 filtered: 16.67 Extra: Using where 

For this query, MySQL searches each line to find "Sally." However, you can get a completely different result by adding an index to the column:


 CREATE INDEX `names_idx` ON `players`(`names_virtual`); 

Now, performing the same query, we get:


 EXPLAIN SELECT * FROM `players` WHERE `names_virtual` = "Sally"\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: players partitions: NULL type: ref possible_keys: names_idx key: names_idx key_len: 22 ref: const rows: 1 filtered: 100.00 Extra: NULL 

As you can see, the index in the column sped up the query, looking only at one line instead of six, using the names_idx index. Let's create indexes for the remaining virtual columns, following the same syntax as names_idx :


 CREATE INDEX `times_idx` ON `players`(`times_virtual`); CREATE INDEX `won_idx` ON `players`(`tennis_won_virtual`); CREATE INDEX `lost_idx` ON `players`(`tennis_lost_virtual`); CREATE INDEX `level_idx` ON `players`(`battlefield_level_virtual`); 

You can check whether all our columns have been indexed by running:
Gista code or ...


With go formatting
 SHOW INDEX ON `players`; +---------+------------+-----------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | +---------+------------+-----------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | players | 0 | PRIMARY | 1 | id | A | 6 | NULL | NULL | | BTREE | | | | players | 1 | names_idx | 1 | names_virtual | A | 6 | NULL | NULL | | BTREE | | | | players | 1 | times_idx | 1 | times_virtual | A | 5 | NULL | NULL | | BTREE | | | | players | 1 | won_idx | 1 | tennis_won_virtual | A | 6 | NULL | NULL | | BTREE | | | | players | 1 | lost_idx | 1 | tennis_lost_virtual | A | 6 | NULL | NULL | | BTREE | | | | players | 1 | level_idx | 1 | battlefield_level_virtual | A | 6 | NULL | NULL | | BTREE | | | +---------+------------+-----------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ 

Now that several indexes have been created in the generated columns, let's complicate the search. In this example, identifiers, names, won tennis games, Battlefield level and Puzzler time are selected for players who have a level above 50, as well as 50 winning tennis games. All results will be sorted in ascending order according to the time in Puzzler. The SQL command and results will look like this:


 SELECT `id`, `names_virtual`, `tennis_won_virtual`, `battlefield_level_virtual`, `times_virtual` FROM `players` WHERE (`battlefield_level_virtual` > 50 AND `tennis_won_virtual` > 50) ORDER BY `times_virtual` ASC; +----+---------------+--------------------+---------------------------+---------------+ | id | names_virtual | tennis_won_virtual | battlefield_level_virtual | times_virtual | +----+---------------+--------------------+---------------------------+---------------+ | 5 | Phil | 130 | 98 | 7 | | 6 | Henry | 68 | 87 | 17 | +----+---------------+--------------------+---------------------------+---------------+ 

Let's see how MySQL executed this query:


 *************************** 1. row *************************** id: 1 select_type: SIMPLE table: players partitions: NULL type: range possible_keys: won_idx,level_idx key: won_idx key_len: 4 ref: NULL rows: 2 filtered: 66.67 Extra: Using where; Using filesort 

When using the win_idx and level_idx indexes, MySQL had to refer to two columns to return the desired result. If a query has to perform a full table scan with a million entries, it will take a very long time. However, using the generated columns and their indexing, MySQL showed a very fast result and a convenient way to search for elements in JSON data.


Nevertheless, one question remains: why do we need STORED generated columns? How to use them and how they work?


Storing values ​​in generated columns


Using the STORED when setting up the generated column is usually not preferable, since basically the values ​​in the table are saved twice: the field with JSON and in the STORED column. However, there are three scenarios when you need to use the STORED column in MySQL:


  1. indexing primary keys
  2. full text index / R-tree index,
  3. column that is often selected.

The syntax for adding a generated STORED column is the same as creating generated VIRTUAL columns, with the exception that you need to add the STORED keyword:


 `id` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.id') STORED NOT NULL, 

To see how to use STORED , create another table. She will take the id from the JSON data and store it in the STORED column. Set the PRIMARY KEY for the id column:


 CREATE TABLE `players_two` ( `id` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.id') STORED NOT NULL, `player_and_games` JSON NOT NULL, `names_virtual` VARCHAR(20) GENERATED ALWAYS AS (`player_and_games` ->> '$.name') NOT NULL, `times_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played.Puzzler.time') NOT NULL, `tennis_won_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played."Crazy Tennis".won') NOT NULL, `tennis_lost_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played."Crazy Tennis".lost') NOT NULL, `battlefield_level_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played.Battlefield.level') NOT NULL, PRIMARY KEY (`id`), INDEX `times_index` (`times_virtual`), INDEX `names_index` (`names_virtual`), INDEX `won_index` (`tennis_won_virtual`), INDEX `lost_index` (`tennis_lost_virtual`), INDEX `level_index` (`battlefield_level_virtual`) ); 

Add the same data set to player_two , except that we delete the id that was previously added to the INSERT operation:


 INSERT INTO `players_two` (`player_and_games`) VALUES ('{ "id": 1, "name": "Sally", "games_played":{ ... ); 

After the data has been inserted into the table, run SHOW COLUMNS in the new table to find out how MySQL created the columns. Notice that the id field is now STORED GENERATED and contains a PRIMARY KEY index.


 SHOW COLUMNS FROM `players_two`; +---------------------------+-------------+------+-----+---------+-------------------+ | Field | Type | Null | Key | Default | Extra | +---------------------------+-------------+------+-----+---------+-------------------+ | id | int(11) | NO | PRI | NULL | STORED GENERATED | | player_and_games | json | NO | | NULL | | | names_virtual | varchar(20) | NO | MUL | NULL | VIRTUAL GENERATED | | times_virtual | int(11) | NO | MUL | NULL | VIRTUAL GENERATED | | tennis_won_virtual | int(11) | NO | MUL | NULL | VIRTUAL GENERATED | | tennis_lost_virtual | int(11) | NO | MUL | NULL | VIRTUAL GENERATED | | battlefield_level_virtual | int(11) | NO | MUL | NULL | VIRTUAL GENERATED | +---------------------------+-------------+------+-----+---------+-------------------+ 

Note on using PRIMARY KEY with generated columns: MySQL will not allow you to create primary keys for generated VIRTUAL columns. In fact, if you do not specify STORED in the id field, MySQL gives the following error:


 ERROR 3106 (HY000): 'Defining a virtual generated column as primary key' is not supported for generated columns. 

At the same time, if you do not set the primary key index and try to insert data, MySQL gives an error message:


 ERROR 3098 (HY000): The table does not comply with the requirements by an external plugin. 

This means that the table does not have a primary key. Therefore, you need to return and re-create the table, or delete the id column and add the generated primary-key column to STORED , for example:


 ALTER TABLE `players_two` ADD COLUMN `id` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.id') STORED PRIMARY KEY; 

Conclusion


The article shows how to effectively store JSON data in MySQL, as well as how to create indexes thanks to the generated columns. Using generated columns will allow indexes to be placed on specific JSON data elements. It is this flexibility that makes MySQL very attractive for using JSON.


')

Source: https://habr.com/ru/post/348854/


All Articles