Postgre (no) SQL or again about storing data with a flexible structure

When the question is about storing flexible (not known, frequently modified) data structures in the database, developers usually refer to the “great and terrible” EAV pattern, or to the now fashionable NOSQL databases.
Not so long ago, such a task became in front of me.
- EAV . I have a persistent hostility, and it was said and written that there was a lot of everything negative (Kite, Fowler, Karvin, Gorman). The main disadvantage is that when writing queries you have to operate not with real entities (“Employee”, “Home”, “Client”, then this is what SQL is intended for), but with objects at a lower level (sorry for the confusion). Therefore, it was not the most desirable option.
- NOSQL . At first, I was very interested in this option (in particular, MongoDB). After prolonged use of relationals, at first you begin to experience a feeling of total freedom, which takes your breath away. Keeping documents of any structure, instant creation of new collections, requests for them is a beauty! But after a short use, the euphoria began to subside, and problems show up:
- Poor query language (IMHO) + no joins;
- Lack of schemes (a good article was recently on this topic (and not only on this one) habrahabr.ru/post/164361 );
- Lack of built-in support for referential integrity;
- Lack of gadgets in the form of stored procedures / functions, triggers, views, and much more.
- In my application, in addition to data with a flexible (changeable) structure, it is also necessary to store normal static data - a table of users, visits, employees, etc. Working with which (again, IMHO) is much simpler and (most important) safer in a conventional relational database (the same referential integrity, etc.).

I tried to solve the first problem (partially) with the help of ORM (it was Spring Data), it allowed me to write tolerable requests to objects, but for this, you need to create and compile in advance all classes (corresponding to the necessary collections) and operate on them already. For me it did not fit, because collections should be created and changed frequently and quickly - on the go.
The second is by creating a separate collection to store the structures of all the other collections, to check the correctness of the input data, etc.
Until the solution of the remaining problems, the matter did not come up, quit ...
Already at this stage, my base began to resemble a very fragile structure, completely dependent on the application, plus I had to manually implement many things that most of the relations can do out of the box. Maybe this is normal, but somehow I was not used to it, as it became uncomfortable.
')
Next, I thought about how great it would be to combine a relational and NOSQL DBMS. On the one hand, the whole power of the relational with all the accompanying ones, on the other - the ease and elegance of the document-oriented solution. Indeed, what prevents you from storing objects with a flexible structure in some separate special table (s), for example, in xml format, and accessing them using XPATH, especially since many modern DBMS have advanced XML tools (including indexing).
I decided to try on a small example using Postgresql, what happens, what the queries will look like:

For a start, two service tables are enough, I think the comments are superfluous:

CREATE TABLE classes ( id integer NOT NULL, name text, is_closed boolean, obects_count integer, CONSTRAINT classes_pk PRIMARY KEY (id ) ); CREATE TABLE objects ( id integer NOT NULL, body xml, id_classes integer, CONSTRAINT objects_pk PRIMARY KEY (id ), CONSTRAINT classes_objects FOREIGN KEY (id_classes) REFERENCES classes (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION ); CREATE INDEX fki_classes_objects ON objects USING btree (id_classes );

Create two entities for experimenting:

 INSERT INTO classes( id, name, is_closed, obects_count) VALUES (1, 'customers', FALSE, 0); INSERT INTO classes( id, name, is_closed, obects_count) VALUES (2, 'orders', FALSE, 0);

Let's prepare two functions for generating test random data (taken on the Internet):

 CREATE OR REPLACE FUNCTION random(numeric, numeric) RETURNS numeric AS $BODY$ SELECT ($1 + ($2 - $1) * random())::numeric; $BODY$ LANGUAGE sql VOLATILE COST 100; CREATE OR REPLACE FUNCTION random_string(length integer) RETURNS text AS $BODY$ declare chars text[] := '{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}'; result text := ''; i integer := 0; begin if length < 0 then raise exception 'Given length cannot be less than 0'; end if; for i in 1..length loop result := result || chars[1+random()*(array_length(chars, 1)-1)]; end loop; return result; end; $BODY$ LANGUAGE plpgsql VOLATILE COST 100;

Filling the table with random data, objects of the “Client” and “Order” classes (one to many connection, each client made five orders):

 DO $$ DECLARE customer_pk integer; order_pk integer; BEGIN FOR i in 1..10000 LOOP customer_pk := nextval('objects_id_seq'); order_pk := nextval('objects_id_seq'); insert into objects (body, id_classes) values(( '<Customers> <Customer> <ID>' || customer_pk || '</ID> <Name>' || random_string('10') || '</Name> <Partners>' || random_string('10') || '</Partners> </Customer> </Customers>')::xml, 1); for j in 1..5 LOOP insert into objects (body, id_classes) values(( '<Orders> <Order> <ID>' || order_pk || '</ID> <Customer_id>' || customer_pk || '</Customer_id> <Cost>' || random(1, 1000) || '</Cost> </Order> </Orders>')::xml, 2); end loop; END LOOP; END$$;

The first request will choose the maximum cost of the order:

 explain select max(((xpath('/Orders/Order/Cost/text()', O.body))[1])::text::float) as cost_of_order from Objects O where O.id_classes = 2;

 /* Aggregate (cost=2609.10..2609.11 rows=1 width=32) -> Seq Scan on objects o (cost=0.00..2104.50 rows=50460 width=32) Filter: (id_classes = 2) */

The request turned out to be a bit quirky, but still quite understandable: it is immediately clear to which entity the request is executed and by what attribute. Oddly enough it turned out to be a full scan, but nothing prevents to build an index on the Cost attribute:

 create index obj_orders_cost_idx on objects using btree (((xpath('/Orders/Order/Cost/text()', body))[1]::text::float));

And now the query works much faster and uses the index:

 /* Result (cost=0.15..0.16 rows=1 width=0) InitPlan 1 (returns $0) -> Limit (cost=0.00..0.15 rows=1 width=32) -> Index Scan Backward using obj_orders_cost_idx on objects o (cost=0.00..7246.26 rows=50207 width=32) Index Cond: ((((xpath('/Orders/Order/Cost/text()'::text, body, '{}'::text[]))[1])::text)::double precision IS NOT NULL) Filter: (id_classes = 2) */

Now let's try to select information about orders of several specific employees, i.e. a bunch of two tables:

 explain select (xpath('/Customers/Customer/Name/text()', C.body))[1] as customer , (xpath('/Orders/Order/Cost/text()', O.body))[1] as cost_of_order from objects C , objects O where C.id_classes = 1 and O.id_classes = 2 and (xpath('/Orders/Order/Customer_id/text()', O.body))[1]::text::int = (xpath('/Customers/Customer/ID/text()', C.body))[1]::text::int and ((xpath('/Customers/Customer/ID/text()' ,C.body))[1])::text::int between 1997585 and 1997595;

 /* Hash Join (cost=1873.57..6504.85 rows=12867 width=64) Hash Cond: ((((xpath('/Orders/Order/Customer_id/text()'::text, o.body, '{}'::text[]))[1])::text)::integer = (((xpath('/Customers/Customer/ID/text()'::text, c.body, '{}'::text[]))[1])::text)::integer) -> Seq Scan on objects o (cost=0.00..2104.50 rows=50460 width=32) Filter: (id_classes = 2) -> Hash (cost=1872.93..1872.93 rows=51 width=32) -> Bitmap Heap Scan on objects c (cost=196.38..1872.93 rows=51 width=32) Recheck Cond: (id_classes = 1) Filter: (((((xpath('/Customers/Customer/ID/text()'::text, body, '{}'::text[]))[1])::text)::integer >= 1997585) AND ((((xpath('/Customers/Customer/ID/text()'::text, body, '{}'::text[]))[1])::text)::integer <= 1997595)) -> Bitmap Index Scan on fki_classes_objects (cost=0.00..196.37 rows=10140 width=0) Index Cond: (id_classes = 1) */

Expected fullscan, now we index slightly:

 create index obj_customers_id_idx on objects using btree (((xpath('/Customers/Customer/ID/text()', body))[1]::text::int)); create index obj_orders_id_idx on objects using btree (((xpath('/Orders/Order/ID/text()', body))[1]::text::int)); create index obj_orders_customerid_idx on objects using btree (((xpath('/Orders/Order/Customer_id/text()', body))[1]::text::int));

Now it turns out more fun:

 /* Hash Join (cost=380.52..5011.80 rows=12867 width=64) Hash Cond: ((((xpath('/Orders/Order/Customer_id/text()'::text, o.body, '{}'::text[]))[1])::text)::integer = (((xpath('/Customers/Customer/ID/text()'::text, c.body, '{}'::text[]))[1])::text)::integer) -> Seq Scan on objects o (cost=0.00..2104.50 rows=50460 width=32) Filter: (id_classes = 2) -> Hash (cost=379.88..379.88 rows=51 width=32) -> Bitmap Heap Scan on objects c (cost=204.00..379.88 rows=51 width=32) Recheck Cond: (((((xpath('/Customers/Customer/ID/text()'::text, body, '{}'::text[]))[1])::text)::integer >= 1997585) AND ((((xpath('/Customers/Customer/ID/text()'::text, body, '{}'::text[]))[1])::text)::integer <= 1997595) AND (id_classes = 1)) -> BitmapAnd (cost=204.00..204.00 rows=51 width=0) -> Bitmap Index Scan on obj_customers_id_idx (cost=0.00..7.35 rows=303 width=0) Index Cond: (((((xpath('/Customers/Customer/ID/text()'::text, body, '{}'::text[]))[1])::text)::integer >= 1997585) AND ((((xpath('/Customers/Customer/ID/text()'::text, body, '{}'::text[]))[1])::text)::integer <= 1997595)) -> Bitmap Index Scan on fki_classes_objects (cost=0.00..196.37 rows=10140 width=0) Index Cond: (id_classes = 1) */

This query has also not lost its visibility, but it can be even more combed: deal with type conversion, optimize xml structure, etc. There is a lot of work, it's just a small example.

What else can you do:

1. Flexible search by attributes of objects of any classes;
2. The objects table can be partitioned (at least partially), for example, to store objects of large classes physically separately.

Storing data in a database in xml format is not naturally a novelty, but there was very little information about this when searching for a solution to my question, not to mention specific examples. I hope someone will come in handy or (and) to hear in the comments reviews and opinions of people who have worked with a similar scheme.

Source: https://habr.com/ru/post/164803/

All Articles

Postgre (no) SQL or again about storing data with a flexible structure

More articles: