MySQL does not have the ability to directly index JSON documents, but there is an alternative: generated columns.
Since the introduction of JSON data type support in MySQL 5.7.8, one thing is missing: the ability to index JSON values. To circumvent this limitation, you can use the generated columns. This feature, introduced in MySQL 5.7.5, allows developers to create columns containing information derived from other columns, predefined expressions or calculations. By generating a column from JSON values and then indexing it, you can practically index the field with JSON.
The JSON data set used in this article can be downloaded from Github . It contains a list of players with the following elements: the player ID, his name, and the games he played (Battlefield, Crazy Tennis and Puzzler).
{ "id":1, "name":"Sally", "games_played":{ "Battlefield":{ "weapon":"sniper rifle", "rank":"Sergeant V", "level":20 }, "Crazy Tennis":{ "won":4, "lost":1 }, "Puzzler":{ "time":7 } } }, …
The Battlefield field contains the player’s favorite weapon, its current rank and the level of that rank. Crazy Tennis includes the number of games won and lost, and Puzzler contains the time taken by the player to complete the game. Create the initial table:
CREATE TABLE `players` ( `id` INT UNSIGNED NOT NULL, `player_and_games` JSON NOT NULL, PRIMARY KEY (`id`) );
This query creates a table of players
consisting of an identifier and JSON data, and also sets the primary key in the id
field.
You need to build an index on the field with JSON. Let's see what needs to be added to the CREATE TABLE
command.
The following syntax is used to create generated columns in the CREATE TABLE
statement:
`column_name` datatype GENERATED ALWAYS AS (expression)
The key words here are GENERATED ALWAYS
and AS
. The phrase GENERATED ALWAYS
optional. It is necessary only if you want to explicitly indicate that this column of the table is generated. It is necessary that the word AS
accompanied by an expression that returns the value for the generated column.
Let's start with this:
`names_virtual` VARCHAR(20) GENERATED ALWAYS AS ...
Create a column with the name names_virtual
up to 20 characters in length, which will store the value of the "name" field from the JSON object. We refer to the “name” field in JSON using the MySQL operator ->>
, which is equivalent to writing JSON_UNQUOTE (JSON_EXTRACT (...))
. This construct will return the value of the "name" field from the JSON object as a result.
`names_virtual` VARCHAR(20) GENERATED ALWAYS AS (`player_and_games` ->> '$.name')
This code means that we take the field with JSON player_and_games
and retrieve the value from JSON by the key “name” - a child of the root.
As with most column definitions, there are a number of limitations and parameters that can be applied to a column.
[VIRTUAL|STORED] [UNIQUE [KEY]] [[NOT] NULL] [[PRIMARY] KEY]
The keywords VIRTUAL
and STORED
that are unique to the generated columns indicate whether values will be stored in the table.
The keyword VIRTUAL
used by default. It means that the column values are not saved and do not take up storage space. They are calculated each time a line is read. If you create an index with a virtual column, the value is still stored - in the index.
The keyword STORED
indicates that values are calculated when writing data to a table: when inserting or updating. In this case, the index does not need to save the value.
Other parameters are optional constraints that ensure that the field values are NULL
or NOT NULL
, as well as adding restrictions to the index, for example, UNIQUE
or PRIMARY KEY
. To ensure that a value exists, use NOT NULL
when creating a column, but the restrictions depend on the use case. In the example, NOT NULL
will be used, since the players must have a name.
Query creating table:
CREATE TABLE `players` ( `id` INT UNSIGNED NOT NULL, `player_and_games` JSON NOT NULL, `names_virtual` VARCHAR(20) GENERATED ALWAYS AS (`player_and_games` ->> '$.name') NOT NULL, PRIMARY KEY (`id`) );
Filling the table with test data:
INSERT INTO `players` (`id`, `player_and_games`) VALUES (1, '{ "id": 1, "name": "Sally", "games_played":{ "Battlefield": { "weapon": "sniper rifle", "rank": "Sergeant V", "level": 20 }, "Crazy Tennis": { "won": 4, "lost": 1 }, "Puzzler": { "time": 7 } } }' ); ...
The contents of the players
table on Gist or ...
SELECT * FROM `players`; +----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+ | id | player_and_games | names_virtual | +----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+ | 1 | {"id": 1, "name": "Sally", "games_played": {"Puzzler": {"time": 7}, "Battlefield": {"rank": "Sergeant V", "level": 20, "weapon": "sniper rifle"}, "Crazy Tennis": {"won": 4, "lost": 1}}} | Sally | | 2 | {"id": 2, "name": "Thom", "games_played": {"Puzzler": {"time": 25}, "Battlefield": {"rank": "Major General VIII", "level": 127, "weapon": "carbine"}, "Crazy Tennis": {"won": 10, "lost": 30}}} | Thom | | 3 | {"id": 3, "name": "Ali", "games_played": {"Puzzler": {"time": 12}, "Battlefield": {"rank": "First Sergeant II", "level": 37, "weapon": "machine gun"}, "Crazy Tennis": {"won": 30, "lost": 21}}} | Ali | | 4 | {"id": 4, "name": "Alfred", "games_played": {"Puzzler": {"time": 10}, "Battlefield": {"rank": "Chief Warrant Officer Five III", "level": 73, "weapon": "pistol"}, "Crazy Tennis": {"won": 47, "lost": 2}}} | Alfred | | 5 | {"id": 5, "name": "Phil", "games_played": {"Puzzler": {"time": 7}, "Battlefield": {"rank": "Lt. Colonel III", "level": 98, "weapon": "assault rifle"}, "Crazy Tennis": {"won": 130, "lost": 75}}} | Phil | | 6 | {"id": 6, "name": "Henry", "games_played": {"Puzzler": {"time": 17}, "Battlefield": {"rank": "Captain II", "level": 87, "weapon": "assault rifle"}, "Crazy Tennis": {"won": 68, "lost": 149}}} | Henry | +----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+
The table includes the names_virtual
column in which all player names are inserted. Players table structure:
SHOW COLUMNS FROM `players`; +------------------+------------------+------+-----+---------+-------------------+ | Field | Type | Null | Key | Default | Extra | +------------------+------------------+------+-----+---------+-------------------+ | id | int(10) unsigned | NO | PRI | NULL | | | player_and_games | json | NO | | NULL | | | names_virtual | varchar(20) | NO | | NULL | VIRTUAL GENERATED | +------------------+------------------+------+-----+---------+-------------------+
Since we did not specify whether the generated column is VIRTUAL
or STORED
, by default MySQL automatically made the column VIRTUAL
. To check whether the columns are VIRTUAL
or STORED
, simply run the above query SHOW COLUMNS
, and it will show either VIRTUAL GENERATED
or STORED GENERATED
.
Now that we have set up the table and the virtual column, add four more columns using the ALTER TABLE
and ADD COLUMN
operations. They will contain Battlefield levels, won and lost tennis games and time in Puzzler.
ALTER TABLE `players` ADD COLUMN `battlefield_level_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played.Battlefield.level') NOT NULL AFTER `names_virtual`; ALTER TABLE `players` ADD COLUMN `tennis_won_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played."Crazy Tennis".won') NOT NULL AFTER `battlefield_level_virtual`; ALTER TABLE `players` ADD COLUMN `tennis_lost_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played."Crazy Tennis".lost') NOT NULL AFTER `tennis_won_virtual`; ALTER TABLE `players` ADD COLUMN `times_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played.Puzzler.time') NOT NULL AFTER `tennis_lost_virtual`;
Again, by running the query SHOW COLUMNS FROM players;
, we see that next to them all columns are listed as VIRTUAL GENERATED
. This means that we have successfully configured the new generated VIRTUAL
columns.
Gista code or ...
+---------------------------+------------------+------+-----+---------+-------------------+ | Field | Type | Null | Key | Default | Extra | +---------------------------+------------------+------+-----+---------+-------------------+ | id | int(10) unsigned | NO | PRI | NULL | | | player_and_games | json | NO | | NULL | | | names_virtual | varchar(20) | NO | | NULL | VIRTUAL GENERATED | | battlefield_level_virtual | int(11) | NO | | NULL | VIRTUAL GENERATED | | tennis_won_virtual | int(11) | NO | | NULL | VIRTUAL GENERATED | | tennis_lost_virtual | int(11) | NO | | NULL | VIRTUAL GENERATED | | times_virtual | int(11) | NO | | NULL | VIRTUAL GENERATED | +---------------------------+------------------+------+-----+---------+-------------------+
Running a SELECT
query shows us all the values from VIRTUAL COLUMNS
, which should look like this:
Gista code or ...
SELECT `names_virtual`, `battlefield_level_virtual`, `tennis_won_virtual`, `tennis_lost_virtual`, `times_virtual` FROM `players`; +---------------+---------------------------+--------------------+---------------------+---------------+ | names_virtual | battlefield_level_virtual | tennis_won_virtual | tennis_lost_virtual | times_virtual | +---------------+---------------------------+--------------------+---------------------+---------------+ | Sally | 20 | 4 | 1 | 7 | | Thom | 127 | 10 | 30 | 25 | | Ali | 37 | 30 | 21 | 12 | | Alfred | 73 | 47 | 2 | 10 | | Phil | 98 | 130 | 75 | 7 | | Henry | 87 | 68 | 149 | 17 | +---------------+---------------------------+--------------------+---------------------+---------------+
After adding data and creating generated columns, we can create an index for each of them to optimize the search ...
When secondary indexes are set to the values of the generated VIRTUAL
columns, the values are stored in the index. This gives advantages: the size of the table does not increase, it becomes possible to use indices in MySQL.
Let's make a simple query on the generated column to see what it looks like before indexing it. Having studied the details of the request when choosing the names_virtual
and the name "Sally", we get the following:
EXPLAIN SELECT * FROM `players` WHERE `names_virtual` = "Sally"\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: players partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 6 filtered: 16.67 Extra: Using where
For this query, MySQL searches each line to find "Sally." However, you can get a completely different result by adding an index to the column:
CREATE INDEX `names_idx` ON `players`(`names_virtual`);
Now, performing the same query, we get:
EXPLAIN SELECT * FROM `players` WHERE `names_virtual` = "Sally"\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: players partitions: NULL type: ref possible_keys: names_idx key: names_idx key_len: 22 ref: const rows: 1 filtered: 100.00 Extra: NULL
As you can see, the index in the column sped up the query, looking only at one line instead of six, using the names_idx
index. Let's create indexes for the remaining virtual columns, following the same syntax as names_idx
:
CREATE INDEX `times_idx` ON `players`(`times_virtual`); CREATE INDEX `won_idx` ON `players`(`tennis_won_virtual`); CREATE INDEX `lost_idx` ON `players`(`tennis_lost_virtual`); CREATE INDEX `level_idx` ON `players`(`battlefield_level_virtual`);
You can check whether all our columns have been indexed by running:
Gista code or ...
SHOW INDEX ON `players`; +---------+------------+-----------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | +---------+------------+-----------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | players | 0 | PRIMARY | 1 | id | A | 6 | NULL | NULL | | BTREE | | | | players | 1 | names_idx | 1 | names_virtual | A | 6 | NULL | NULL | | BTREE | | | | players | 1 | times_idx | 1 | times_virtual | A | 5 | NULL | NULL | | BTREE | | | | players | 1 | won_idx | 1 | tennis_won_virtual | A | 6 | NULL | NULL | | BTREE | | | | players | 1 | lost_idx | 1 | tennis_lost_virtual | A | 6 | NULL | NULL | | BTREE | | | | players | 1 | level_idx | 1 | battlefield_level_virtual | A | 6 | NULL | NULL | | BTREE | | | +---------+------------+-----------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
Now that several indexes have been created in the generated columns, let's complicate the search. In this example, identifiers, names, won tennis games, Battlefield level and Puzzler time are selected for players who have a level above 50, as well as 50 winning tennis games. All results will be sorted in ascending order according to the time in Puzzler. The SQL command and results will look like this:
SELECT `id`, `names_virtual`, `tennis_won_virtual`, `battlefield_level_virtual`, `times_virtual` FROM `players` WHERE (`battlefield_level_virtual` > 50 AND `tennis_won_virtual` > 50) ORDER BY `times_virtual` ASC; +----+---------------+--------------------+---------------------------+---------------+ | id | names_virtual | tennis_won_virtual | battlefield_level_virtual | times_virtual | +----+---------------+--------------------+---------------------------+---------------+ | 5 | Phil | 130 | 98 | 7 | | 6 | Henry | 68 | 87 | 17 | +----+---------------+--------------------+---------------------------+---------------+
Let's see how MySQL executed this query:
*************************** 1. row *************************** id: 1 select_type: SIMPLE table: players partitions: NULL type: range possible_keys: won_idx,level_idx key: won_idx key_len: 4 ref: NULL rows: 2 filtered: 66.67 Extra: Using where; Using filesort
When using the win_idx
and level_idx
indexes, MySQL had to refer to two columns to return the desired result. If a query has to perform a full table scan with a million entries, it will take a very long time. However, using the generated columns and their indexing, MySQL showed a very fast result and a convenient way to search for elements in JSON data.
Nevertheless, one question remains: why do we need STORED
generated columns? How to use them and how they work?
Using the STORED
when setting up the generated column is usually not preferable, since basically the values in the table are saved twice: the field with JSON and in the STORED
column. However, there are three scenarios when you need to use the STORED
column in MySQL:
The syntax for adding a generated STORED
column is the same as creating generated VIRTUAL
columns, with the exception that you need to add the STORED
keyword:
`id` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.id') STORED NOT NULL,
To see how to use STORED
, create another table. She will take the id
from the JSON data and store it in the STORED
column. Set the PRIMARY KEY
for the id
column:
CREATE TABLE `players_two` ( `id` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.id') STORED NOT NULL, `player_and_games` JSON NOT NULL, `names_virtual` VARCHAR(20) GENERATED ALWAYS AS (`player_and_games` ->> '$.name') NOT NULL, `times_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played.Puzzler.time') NOT NULL, `tennis_won_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played."Crazy Tennis".won') NOT NULL, `tennis_lost_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played."Crazy Tennis".lost') NOT NULL, `battlefield_level_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played.Battlefield.level') NOT NULL, PRIMARY KEY (`id`), INDEX `times_index` (`times_virtual`), INDEX `names_index` (`names_virtual`), INDEX `won_index` (`tennis_won_virtual`), INDEX `lost_index` (`tennis_lost_virtual`), INDEX `level_index` (`battlefield_level_virtual`) );
Add the same data set to player_two
, except that we delete the id
that was previously added to the INSERT
operation:
INSERT INTO `players_two` (`player_and_games`) VALUES ('{ "id": 1, "name": "Sally", "games_played":{ ... );
After the data has been inserted into the table, run SHOW COLUMNS
in the new table to find out how MySQL created the columns. Notice that the id
field is now STORED GENERATED
and contains a PRIMARY KEY
index.
SHOW COLUMNS FROM `players_two`; +---------------------------+-------------+------+-----+---------+-------------------+ | Field | Type | Null | Key | Default | Extra | +---------------------------+-------------+------+-----+---------+-------------------+ | id | int(11) | NO | PRI | NULL | STORED GENERATED | | player_and_games | json | NO | | NULL | | | names_virtual | varchar(20) | NO | MUL | NULL | VIRTUAL GENERATED | | times_virtual | int(11) | NO | MUL | NULL | VIRTUAL GENERATED | | tennis_won_virtual | int(11) | NO | MUL | NULL | VIRTUAL GENERATED | | tennis_lost_virtual | int(11) | NO | MUL | NULL | VIRTUAL GENERATED | | battlefield_level_virtual | int(11) | NO | MUL | NULL | VIRTUAL GENERATED | +---------------------------+-------------+------+-----+---------+-------------------+
Note on using PRIMARY KEY
with generated columns: MySQL will not allow you to create primary keys for generated VIRTUAL
columns. In fact, if you do not specify STORED
in the id
field, MySQL gives the following error:
ERROR 3106 (HY000): 'Defining a virtual generated column as primary key' is not supported for generated columns.
At the same time, if you do not set the primary key index and try to insert data, MySQL gives an error message:
ERROR 3098 (HY000): The table does not comply with the requirements by an external plugin.
This means that the table does not have a primary key. Therefore, you need to return and re-create the table, or delete the id
column and add the generated primary-key column to STORED
, for example:
ALTER TABLE `players_two` ADD COLUMN `id` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.id') STORED PRIMARY KEY;
The article shows how to effectively store JSON data in MySQL, as well as how to create indexes thanks to the generated columns. Using generated columns will allow indexes to be placed on specific JSON data elements. It is this flexibility that makes MySQL very attractive for using JSON.
Source: https://habr.com/ru/post/348854/
All Articles