Little about query optimization

I want to use a simple example to tell about how sometimes you can optimize strongly quite simple at first glance queries. Take this code, for example on PostgreSQL 9.3, but the principle is suitable for all subd that have hash join.

The task is simple - to join two tables - one is quite large, the other is small - but the join is not simple, but ~~golden~~ with OR. (Like a real case - a join of the table of account postings to the accounts themselves, given that in the posting there are two fields with the account - for debit and credit.)

Preparing test data:

create table public.test1 ( id bigint, id2 bigint, value varchar(100) ); create table public.test2 ( id bigint, value varchar(100) ) ; insert into public.test1 select generate_series(1,2000000), 1, 'abcdef'; insert into public.test2 select generate_series(1,100), 'abcdefssdf'; create index ix_test2_id on public.test2 (id); /*         ,       ,        . */

And here is our request in its original form, subject to or.
')

 select * from public.test1 t1 inner join public.test2 t2 on t1.id = t2.id or t1.id2 = t2.id;

A join with this condition will receive a guaranteed nested loop. The plan is:

 "Nested Loop (cost=0.00..3532741.25 rows=2000099 width=42) (actual time=0.043..61627.189 rows=2000099 loops=1)" " Join Filter: ((t1.id = t2.id) OR (t1.id2 = t2.id))" " Rows Removed by Join Filter: 197999901" " -> Seq Scan on test1 t1 (cost=0.00..32739.00 rows=2000000 width=23) (actual time=0.010..385.658 rows=2000000 loops=1)" " -> Materialize (cost=0.00..2.50 rows=100 width=19) (actual time=0.000..0.007 rows=100 loops=2000000)" " -> Seq Scan on test2 t2 (cost=0.00..2.00 rows=100 width=19) (actual time=0.005..0.018 rows=100 loops=1)" "Total runtime: 61717.751 ms"

Notice the two million cycles through the small table. This is the main minus of the nested loop. Even if there were two million searches on the index - it is still bad.

What to do?

Hash join would help us - a hash of a small table that fits into memory + one pass through a large one - ideally. But with OR in the condition of not getting it. There is an exit!

Make two joins to different fields and put the OR into the filter:

 select * from public.test1 t1 left join public.test2 t2 on t1.id = t2.id left join public.test2 t3 on t1.id2 = t3.id where t2.id is not null or t3.id is not null;

The plan has become much faster. 5kb hash table is what you need: even taking into account the fact that there are two joins, the gain is dozens of times!

 "Hash Left Join (cost=6.50..67746.50 rows=2000000 width=61) (actual time=0.124..2230.636 rows=2000000 loops=1)" " Hash Cond: (t1.id2 = t3.id)" " Filter: ((t2.id IS NOT NULL) OR (t3.id IS NOT NULL))" " -> Hash Left Join (cost=3.25..40243.25 rows=2000000 width=42) (actual time=0.073..1065.822 rows=2000000 loops=1)" " Hash Cond: (t1.id = t2.id)" " -> Seq Scan on test1 t1 (cost=0.00..32739.00 rows=2000000 width=23) (actual time=0.012..338.759 rows=2000000 loops=1)" " -> Hash (cost=2.00..2.00 rows=100 width=19) (actual time=0.041..0.041 rows=100 loops=1)" " Buckets: 1024 Batches: 1 Memory Usage: 5kB" " -> Seq Scan on test2 t2 (cost=0.00..2.00 rows=100 width=19) (actual time=0.004..0.015 rows=100 loops=1)" " -> Hash (cost=2.00..2.00 rows=100 width=19) (actual time=0.039..0.039 rows=100 loops=1)" " Buckets: 1024 Batches: 1 Memory Usage: 5kB" " -> Seq Scan on test2 t3 (cost=0.00..2.00 rows=100 width=19) (actual time=0.004..0.016 rows=100 loops=1)" "Total runtime: 2318.009 ms"

Thanks for attention.

Source: https://habr.com/ru/post/269511/

All Articles

Little about query optimization

More articles: