⬆️ ⬇️

PostrgreSQL: Accelerated through intarray

So 6 years ago, when the elephant was only at 8.0, and I sat tight on MySql, I often heard calls to change the DB. I remember how painful it was to start. But after I made up my mind, I never regretted it, and I’m hardly likely to return. There are a lot of advantages here, but the post is not about that.



The task came: to write a store, a big one in perspective. A la Photos, Hotline. Well, the standard task for such sites is a filter.



How is this done in the usual "engines"? True: parameters = filters. Well, everything is clear that it is simple, but not always beautiful, especially when you need filters with ranges (for example: a diagonal of 10 "-11"). And then you have to think about the fact that filters are an entity independent of the parameters. In my version, the filters "fit" to the parameters. I do not know how it was done on the same hotline (there is a suspicion that there filters fit directly on the goods), but this option requires a lot of human attention. I will not go into the details of architecture, this is a topic for another post. But we will consider just the simplified version, so as not to get lost.



The problem with such filters is their construction. Calculate the number of products for the selected (or not selected yet) filters.

')

So…



Create a table for products:



CREATE TABLE public.products ( id SERIAL, title VARCHAR NOT NULL, CONSTRAINT products_pkey PRIMARY KEY(id) ); 


Filter table:



 CREATE TABLE public.filters ( id SERIAL, title VARCHAR NOT NULL, CONSTRAINT filters_pkey PRIMARY KEY(id) ); 


Table for links between filters and products:



 CREATE TABLE public.products_ref_filters ( id_product INTEGER NOT NULL, id_filter INTEGER NOT NULL, CONSTRAINT products_ref_filters_pkey PRIMARY KEY(id_product, id_filter), CONSTRAINT products_ref_filters_fk FOREIGN KEY (id_product) REFERENCES public.products(id) ON DELETE CASCADE ON UPDATE CASCADE, CONSTRAINT products_ref_filters_fk1 FOREIGN KEY (id_filter) REFERENCES public.filters(id) ON DELETE CASCADE ON UPDATE CASCADE ); 


And it turns out the standard version. Trivially.



Farther…



Add the filters field to the product table:



 ALTER TABLE public.products ADD COLUMN filters INTEGER[] DEFAULT ARRAY[]::integer[] NOT NULL; 


As you have noticed, this is an array of numbers, the IDs of the filters will be duplicated here, yes, this is denormalization. For integrity, we write procedures and hang them on the trigger. Insert:



 CREATE OR REPLACE FUNCTION public.products_ref_filters__insert_tr () RETURNS trigger AS' BEGIN UPDATE products SET filters = filters + NEW.id_filter --push element onto array WHERE id = NEW.id_product; RETURN NEW; END; 'LANGUAGE 'plpgsql'; CREATE TRIGGER products_ref_filters__insert_tr AFTER INSERT ON public.products_ref_filters FOR EACH ROW EXECUTE PROCEDURE public.products_ref_filters__insert_tr(); 


Uninstall:



 CREATE OR REPLACE FUNCTION public.products_ref_filters__delete_tr () RETURNS trigger AS' BEGIN UPDATE products SET filters = filters - OLD.id_filter --remove entries matching right argument from array WHERE id = OLD.id_product; RETURN OLD; END; 'LANGUAGE 'plpgsql'; 


Update:



 CREATE OR REPLACE FUNCTION public.products_ref_filters__update_tr () RETURNS trigger AS' BEGIN UPDATE products SET filters = filters - OLD.id_filter WHERE id = OLD.id_product; UPDATE products SET filters = filters + NEW.id_filter WHERE id = NEW.id_product; RETURN NEW; END; 'LANGUAGE 'plpgsql'; CREATE TRIGGER products_ref_filters__update_tr AFTER UPDATE ON public.products_ref_filters FOR EACH ROW EXECUTE PROCEDURE public.products_ref_filters__update_tr(); 


Cleaning:



 CREATE OR REPLACE FUNCTION public.products_ref_filters__truncate_tr () RETURNS trigger AS' BEGIN UPDATE products SET filters = ARRAY[]::INTEGER[]; RETURN NULL; END; 'LANGUAGE 'plpgsql'; CREATE TRIGGER products_ref_filters__truncate_tr AFTER TRUNCATE ON public.products_ref_filters FOR EACH STATEMENT EXECUTE PROCEDURE public.products_ref_filters__truncate_tr(); 


Now, when inserting, updating, deleting data from the link table, an array of filters will be written into the product table. And we got rid of JOIN. Now no need to glue the table in the query. This gives a lot of advantages, it is easier to build queries, they are faster, less memory, etc. But the article is about intarray? Yes.



Install the extension:



 CREATE EXTENSION intarray; 


After executing this command, functions, indexes, operators for working with INTEGER arrays will appear in the database. Now it will be much faster and more convenient to work.



Fill our database. Filters 10 000. Goods 100 000. For each product 10 filters. Total: intermediate table of 1 000 000 lines. This block of requests took 8 min. So wait.



 INSERT INTO filters (title) SELECT 'filter_' || num FROM generate_series(1, 10000) as num; INSERT INTO products (title) SELECT 'product_' || num FROM generate_series(1, 100000) as num; DO $$ DECLARE idp INTEGER; BEGIN FOR idp IN SELECT id FROM products LOOP INSERT INTO products_ref_filters SELECT idp, f.id FROM filters f ORDER BY random() LIMIT 10; END LOOP; END$$; 


Create an index on an array of numbers:



 CREATE INDEX products_idx ON public.products USING gin (filters public.gin__int_ops); 


Total, our database is full. Let's see what to do with it now. Let's find the most popular filters and remember them.



 SELECT id_filter FROM products_ref_filters GROUP BY id_filter ORDER BY count(*) DESC LIMIT 10 


My result is: 7267, 4889, 6364, 5376, 3556, 7292, 11188, 2643, 9005, 10235.



Let's find products with a specific filter, let it be 7267:



 --  JOIN SELECT t1.* FROM products t1, products_ref_filters t2 WHERE t1.id = t2.id_product AND t2.id_filter = 7267 -- 140 rows returned (execution time: 125 ms; total time: 141 ms) -- SUB SELECT SELECT * FROM products WHERE id IN ( SELECT id_product FROM products_ref_filters WHERE id_filter = 7267 ) -- 140 rows returned (execution time: 125 ms; total time: 141 ms) --    SELECT * FROM products WHERE filters @> ARRAY[7267] -- 140 rows returned (execution time: 0 ms; total time: 0 ms) 


One of the filters



 -- JOIN SELECT DISTINCT t1.* FROM products t1, products_ref_filters t2 WHERE t1.id = t2.id_product AND t2.id_filter IN (7267,4889,6364,5376,3556,7292,11188,2643,9005,10235) -- 1347 rows returned (execution time: 297 ms; total time: 297 ms) -- SUB SELECT SELECT * FROM products WHERE id IN ( SELECT id_product FROM products_ref_filters WHERE id_filter IN (7267,4889,6364,5376,3556,7292,11188,2643,9005,10235) ) -- 1347 rows returned (execution time: 234 ms; total time: 250 ms) -- INTARRAY SELECT * FROM products WHERE filters && ARRAY[7267,4889,6364,5376,3556,7292,11188,2643,9005,10235] -- 1347 rows returned (execution time: 16 ms; total time: 16 ms) 


Too lazy to make other requests, forgive. But when you need to find products with several filters, then in general, this approach leaves joins and subselections far behind.



 -- JOIN -- ,       ;) --     SELECT * FROM products WHERE filters @> ARRAY[9844,9957]; -- 1 rows returned (execution time: 0 ms; total time: 16 ms) 


Of dock here .



Pay attention to the operators, it's just a fairy tale! I read somewhere that even there is a patch, by installing which, you can build foreign keys on an array, and the base itself will monitor integrity. I also read somewhere that developers even plan to do this without any patches. Yes, great. At one time, when I found (for myself) this solution, there was joy ...



Seeing that there is a small interest in the topic.

I decided to add a little bit of volume tests.

Let's try on 10 million items.

Let's slightly change the request for filling pseudo-products.

 --     1000 INSERT INTO filters (title) SELECT 'filter_' || num FROM generate_series(1, 10000) as num; -- 10 000 000 INSERT INTO products (title) SELECT 'product_' || num FROM generate_series(1, 10000000) as num; -- ..      ,   ID- DO $$ DECLARE idp INTEGER; BEGIN FOR idp IN SELECT id FROM products LOOP UPDATE products SET filters = (SELECT array_agg(id) FROM (SELECT id FROM filters OFFSET random()*1000 LIMIT 10) as foo) WHERE id = idp; END LOOP; END$$; --    ~20 




DB size ~ 2.9GB



BTREE Index:

 CREATE INDEX products_idx ON public.products USING btree (filters); -- execution time: 00:02:36 -- DB size ~ 3.8Gb SELECT * FROM products WHERE filters @> '{842}'::INTEGER[]; -- Seq Scan on public.products (cost=0.00..357908.00 rows=10000 width=80) ... -- 99798 rows returned (execution time: 5,594 sec; total time: 5,594 sec) SELECT * FROM products WHERE filters = '{842,843,844,845,846,847,848,849,850,851}'::INTEGER[]; -- Bitmap Heap Scan on public.products (cost=487.94..32940.13 rows=9726 width=80) -- 9940 rows returned (execution time: 46 ms; total time: 62 ms) SELECT * FROM products WHERE filters && '{842,843,844,845,846,847,848,849,850,851}'::INTEGER[] -- Seq Scan on public.products (cost=0.00..357908.00 rows=10000 width=80) -- 189853 rows returned (execution time: 6,281 sec; total time: 6,296 sec) 




GIST Index:

 CREATE INDEX products_idx ON public.products USING gist (filters public.gist__int_ops); -- execution time: 00:26:55; -- DB size ~ 4.5Gb SELECT * FROM products WHERE filters @> '{842}'::INTEGER[]; -- Bitmap Heap Scan on public.products (cost=833.92..34097.44 rows=10000 width=80) -- 99798 rows returned (execution time: 2,234 sec; total time: 2,234 sec) SELECT * FROM products WHERE filters = '{842,843,844,845,846,847,848,849,850,851}'::INTEGER[]; -- Bitmap Heap Scan on public.products (cost=811.79..33263.99 rows=9726 width=80) -- 9940 rows returned (execution time: 234 ms; total time: 234 ms) SELECT * FROM products WHERE filters && '{842,843,844,845,846,847,848,849,850,851}'::INTEGER[]; -- Bitmap Heap Scan on public.products (cost=833.92..34097.44 rows=10000 width=80) -- 189853 rows returned (execution time: 5,234 sec; total time: 5,234 sec) 




GIN Index:

 CREATE INDEX products_idx ON public.products USING gin (filters public.gin__int_ops); -- execution time: 56,344 sec; SELECT * FROM products WHERE filters @> '{842}'::INTEGER[]; -- Bitmap Heap Scan on public.products (cost=97.50..33361.02 rows=10000 width=80) -- 99798 rows returned (execution time: 2,204 sec; total time: 2,219 sec) SELECT * FROM products WHERE filters = '{842,843,844,845,846,847,848,849,850,851}'::INTEGER[]; -- Bitmap Heap Scan on public.products (cost=211.37..32663.57 rows=9726 width=80) -- 9940 rows returned (execution time: 297 ms; total time: 312 ms) SELECT * FROM products WHERE filters && '{842,843,844,845,846,847,848,849,850,851}'::INTEGER[]; -- Bitmap Heap Scan on public.products (cost=213.50..33477.02 rows=10000 width=80) -- 189853 rows returned (execution time: 4,500 sec; total time: 4,515 sec) 




Yes, for mega big bases, it won't save much. But given that such a search in practice does not happen, and there is always an additional filter, for example, by category, the method is still good. In any case, this is 100 times faster JOIN.

Source: https://habr.com/ru/post/269823/



All Articles