Introduction
At the next interview I was asked about the shortcomings of the EAV data model (Entity Attribute Value), I did not find what to say, in my opinion this is the ideal way to store arbitrary data. After a short reflection, I said that the only problem is the impossibility of constructing indices for samples.
After the interview I was puzzled by this question for a few days, I came to some conclusions, to clear my conscience a little google. Googled confirmation of his thoughts, but this was not enough for me - I wanted to implement the confirmation with numbers.
If you are interested in what conclusions I came to and what benefits from optimization can be obtained, then welcome to cat.
Table of contents on the series of articles "Ideal catalog"
- Ideal catalog, architecture sketch
- Ideal catalog, an implementation option
- Ideal catalog, data sample optimization
')
Brief conclusions
For those who are tired and do not master a lot of letters, the conclusion is as follows: for sampling data, you must use materialized views, for which indices are built for all columns.
Sampling on the materialized view works 5-15% faster than sampling from a similar table with similar indices.
This concludes the conclusions and begins a consistent presentation.
Methods of working with the EAV model
Data can be obtained by direct queries, you can make a presentation for each type of Entity (catalog rubrics), you can make a similar materialized view. As an option, you can make a table, but with every Attribute change, it will be necessary to perform ALTER TABLE with DROP COLUMN and ADD COLUMN, and this is exactly what they leave with EAV, although if you store data in EAV and use the table only for reading, normal option.
You can google other ways to optimize EAV, but I researched only these.
Testing method
Scripts (PHP) were written to generate the contents of “headings” and scripts to generate the DDL code for creating a table, creating a view, and creating a materialized view (PostrgeSql).
With these scripts the appropriate data sources were created.
Now we select an entity (catalog rubric), which will have many positions and few characteristics (attributes).
SELECT r.code, (SELECT COUNT(*) FROM rubric_property rp WHERE rp.rubric_id = r.id) property_count, (SELECT COUNT(*) FROM rubric_item ri WHERE ri.rubric_id = r.id) item_count, (SELECT COUNT(*) FROM rubric_property rp WHERE rp.rubric_id = r.id) + ( SELECT COUNT(*) FROM rubric_item ri WHERE ri.rubric_id = r.id) summary FROM rubric r GROUP BY r.code, r.id ORDER BY summary DESC, property_count DESC, item_count DESC;
The number of attributes in my opinion does not matter much, but if someone wants to evaluate the impact of the number of attributes, then all the scripts
are in the repository .
We make two types of samples, the first type of sample returns one line, the second one - several lines.
Of course, for full testing, more sample types are required.
For each type of sample, we perform queries to each data source: we sample from a query, from a view, from a table, from a materialized view and again from a query, and so on in a circle. Such a recurrence in theory will reduce the impact of query caching Performing the same query 1000 times in a row seems to me a bad way to test.
Thus, we make five measurements of the execution time. We discard the extreme measurements - the most is the run time and the shortest run time. What remains is tabulated.
Testing was conducted on the following data set:
- rubrics (entity) - 323
- characteristics of the product (attribute, attribute) - 47,229
- commodity items - 6,989
- values ​​(value) - 1 102 279
Directly in the chosen category there were 41 positions and 22 characteristics.
Test results
The query returns one row (many selection conditions)
WHERE mv.tiger IN ('poor', 'white', 'orange', 'red') AND mv.bowl BETWEEN 1000 AND 4000 AND mv.clock > 3000 AND mv.legs < 2000 AND mv.snake = 'crazy'
iteration | SELECT | VIEW | TABLE | MAT VIEW |
---|
one | 61 | 53 | 22 | 20 |
2 | 49 | 50 | 20 | 15 |
3 | 50 | 49 | 20 | 14 |
mean time ms | 53 | 50 | 20 | sixteen |
The query returns multiple rows (fewer sample conditions)
WHERE mv.tiger IN ('poor', 'white', 'orange', 'red') AND mv.bowl BETWEEN 1000 AND 4000
iteration | SELECT | VIEW | TABLE | MAT VIEW |
---|
one | 92 | 70 | 31 | nineteen |
2 | 63 | 56 | nineteen | 14 |
3 | 54 | 54 | 22 | 18 |
mean time ms | 69 | 60 | 24 | 17 |
For direct requests, these are still good results. During the testing, I changed the data set several times and the sampling time could exceed one second (1000 ms), while the sample from the materialized view took no more than 30 ms.
As you can see, even if the data is stored in a table, the selection from the materialized view is faster.
Possible optimizations
Materialized view
The materialized view works many times faster than direct requests.
When changing data, you must update the view (REFRESH MATERIALIZED VIEW).
When adding an attribute (characteristic) it is necessary to re-create the view (CREATE MATERIALIZED VIEW).
These operations take a matter of seconds. These actions can be automated and performed without the involvement of qualified personnel.
But if we have a million users and they constantly fill in new catalog positions (headings), but also the administration changes the list of characteristics several times an hour, this can be a problem.
Table
Queries to the table as well as queries to the materialized view work many times faster than direct queries, while the query to the table will work a little slower than the query to the materialized view.
In the tables, when changing the attribute list, the problems are the same as in the materialized view, but when the value of the characteristic changes, it is not necessary to update all the records in the table, it is enough to update only the changed rows and columns.
If it is impossible to create a foreign key for the view, then it is possible and necessary for the table to link the position of the catalog and the entry in our table of characteristics, thereby increasing the consistency of the data. Using the table allows you to use all the features of the tables (triggers, logging, backup).
Representation
Data Definition Language CREATE VIEW tea_v AS SELECT ri.item_id item_id, ( SELECT sm.string FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN string_matter sm ON sm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'snake' ) "snake" , ( SELECT sm.string FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN string_matter sm ON sm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'tiger' ) "tiger" , ( SELECT sm.string FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN string_matter sm ON sm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'beans' ) "beans" , ( SELECT sm.string FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN string_matter sm ON sm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'sweater' ) "sweater" , ( SELECT sm.string FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN string_matter sm ON sm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'pudding' ) "pudding" , ( SELECT sm.string FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN string_matter sm ON sm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'bumper' ) "bumper" , ( SELECT sm.string FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN string_matter sm ON sm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'train' ) "train" , ( SELECT sm.string FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN string_matter sm ON sm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'alligator' ) "alligator" , ( SELECT sm.string FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN string_matter sm ON sm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'trousers' ) "trousers" , ( SELECT dm.digital FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN digital_matter dm ON dm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'bowl' ) "bowl" , ( SELECT dm.digital FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN digital_matter dm ON dm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'cabbage' ) "cabbage" , ( SELECT dm.digital FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN digital_matter dm ON dm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'ship' ) "ship" , ( SELECT dm.digital FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN digital_matter dm ON dm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'undershirt' ) "undershirt" , ( SELECT dm.digital FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN digital_matter dm ON dm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'simmer' ) "simmer" , ( SELECT dm.digital FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN digital_matter dm ON dm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'squirrel' ) "squirrel" , ( SELECT dm.digital FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN digital_matter dm ON dm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'ceiling' ) "ceiling" , ( SELECT dm.digital FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN digital_matter dm ON dm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'legs' ) "legs" , ( SELECT dm.digital FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN digital_matter dm ON dm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'jacket' ) "jacket" , ( SELECT dm.digital FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN digital_matter dm ON dm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'book' ) "book" , ( SELECT dm.digital FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN digital_matter dm ON dm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'stomach' ) "stomach" , ( SELECT dm.digital FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN digital_matter dm ON dm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'clock' ) "clock" , ( SELECT dm.digital FROM item_content ic JOIN content c ON ic.content_id = c.id JOIN property p ON c.property_id = p.id JOIN digital_matter dm ON dm.content_id = c.id WHERE ic.item_id = ri.item_id AND p.code = 'furniture' ) "furniture" FROM rubric_item ri WHERE ri.rubric_id = ( SELECT id FROM rubric WHERE code = 'tea' ) ;
A view works a little faster than a direct request, and the use of a view reduces the number of lines of code. If the request can easily be inflated by 1000 or more lines, then accessing the view is just one line - less data is sent between the DBMS server and the application server, less work is done to the DBMS parser.
Changing data immediately, without additional actions, is reflected in the view.
When changing the list of characteristics (attributes), you need to re-create the view.
Using EAV Model
For a regular online store that sells both refrigerators and auto parts, the use of EAV is more than justified, for each item you will not create a new table or an additional column in the table of characteristics. In addition, in each rubric there are usually no more than a hundred titles, and for each rubric a separate materialized view is a good idea; to search among hundreds of other lines is faster than searching among millions of attribute lines.
Changes in the values ​​of the characteristics occur once in a life - at the moment of adding a product item. Changes in the list of characteristics occur with each turn of the fashion - no more than once every six months. Adding commodity items - each delivery - no more than once a week.
That is, we have data that is more often read and almost never change.
The EAV model is a great fit for this use. Optimization through the materialized views gives the reading speed not lower than the reading speed from ordinary tables.
Bonus
In the previous article, I promised to share classes for generating EAV requests. I still haven't got a hand in writing them, but I can share a
repository in which all this is there , the only thing that is not there is documentation.
If someone undertakes to deal with this cemetery of good intentions, then contact, I will help (my contacts are indicated in the profile, it is better to write in VK).
If you want to repeat the experiment
For those to whom my collective testing with testing optimizations seemed dubious, I suggest testing them on their own. All scripts for this
are laid out in the repository .
The sequence of actions is described in
use_case.sql .
In addition, using a bunch:
- noun.txt
- adjective.txt
- get_pdo.php
- words_input.php
- data_generation.php
- ddl_generation.php
You can create your own data set and conduct your experiments.
There is not a lot of code in the scripts (there are 1,000 lines per three php files), cut as you like.
The code works under PHP 7.1 and PostgreSQL 10.1, after file processing it will work for earlier versions.
Conclusion
If you like, subscribe to our channel, put likes, write comments.