📜 ⬆️ ⬇️

The story of a single SQL investigation

Last December, I received an interesting error report from the VWO support team. The load time of one of the analytical reports for a large corporate client seemed to be prohibitive. And since this is my responsibility, I immediately focused on solving the problem.


Prehistory


To make it clear what I am talking about, I will tell you quite a bit about VWO. This is a platform with which you can run different targeted campaigns on your sites: conduct A / B experiments, track visitors and conversions, analyze sales funnels, display heat maps and play recordings of visits.


But the most important thing in the platform is reporting. All of the above functions are related. And for corporate clients, a huge array of information would be simply useless without a powerful platform representing them in the form of analytics.


Using the platform, you can make an arbitrary query on a large set of data. Here is a simple example:


  Show all clicks on the page "abc.com"
 FROM <date d1> TO <date d2>
 for people who
 used chrome OR
 (were in Europe and used the iPhone) 

Pay attention to boolean operators. They are available to clients in the query interface in order to make arbitrarily complex queries to receive samples.


Slow request


The client in question tried to do something that should intuitively work quickly:


  Show all session records
 for users visiting any page
 with url where there are "/ jobs" 

There was a huge amount of traffic on this site, and we stored over a million unique URLs just for it. And they wanted to find a fairly simple urla pattern related to their business model.


Preliminary investigation


Let's see what happens in the database. The following is the original slow SQL query:


SELECT count(*) FROM acc_{account_id}.urls as recordings_urls, acc_{account_id}.recording_data as recording_data, acc_{account_id}.sessions as sessions WHERE recording_data.usp_id = sessions.usp_id AND sessions.referrer_id = recordings_urls.id AND ( urls && array(select id from acc_{account_id}.urls where url ILIKE '%enterprise_customer.com/jobs%')::text[] ) AND r_time > to_timestamp(1542585600) AND r_time < to_timestamp(1545177599) AND recording_data.duration >=5 AND recording_data.num_of_pages > 0 ; 

But the timings:


  Estimated time: 1.480 ms
 Running time: 1431924.650 ms 

The request bypassed 150 thousand lines. The query planner showed a couple of interesting details, but no obvious bottlenecks.


Let's explore the query further. Apparently, he does JOIN three tables:


  1. sessions : to display session information: browser, user agent, country, and so on.
  2. recording_data : recorded URLs, pages, duration of visits
  3. urls : to avoid duplicating extremely large urls, we store them in a separate table.

Also note that all our tables are already divided by account_id . Thus, the situation is excluded, when due to one particularly large account problems arise for the rest.


In search of evidence


Upon closer inspection, we see that something in a particular query is wrong. It is worth looking at this line:


 urls && array( select id from acc_{account_id}.urls where url ILIKE '%enterprise_customer.com/jobs%' )::text[] 

The first thought was that, perhaps, due to ILIKE on all these long URLs (we have over 1.4 million unique URLs compiled for this account), the performance might subside.


But no, that's not the point!


 SELECT id FROM urls WHERE url ILIKE '%enterprise_customer.com/jobs%'; id -------- ... (198661 rows) Time: 5231.765 ms 

The pattern search request itself takes only 5 seconds. A pattern search on millions of unique URLs is clearly not a problem.


The next suspect on the list is several JOIN . Perhaps their excessive use led to a slowdown? JOIN 's are usually the most obvious candidates for performance problems, but I didn’t believe our case was typical.


 analytics_db=# SELECT count(*) FROM acc_{account_id}.urls as recordings_urls, acc_{account_id}.recording_data_0 as recording_data, acc_{account_id}.sessions_0 as sessions WHERE recording_data.usp_id = sessions.usp_id AND sessions.referrer_id = recordings_urls.id AND r_time > to_timestamp(1542585600) AND r_time < to_timestamp(1545177599) AND recording_data.duration >=5 AND recording_data.num_of_pages > 0 ; count ------- 8086 (1 row) Time: 147.851 ms 

And it was also not our case. JOIN 's turned out to be quite fast.


We narrow the circle of suspects


I was ready to start changing the query to achieve any possible performance improvements. The team and I developed 2 main ideas:



 SELECT count(*) FROM acc_{account_id}.urls as recordings_urls, acc_{account_id}.recording_data as recording_data, acc_{account_id}.sessions as sessions WHERE recording_data.usp_id = sessions.usp_id AND ( 1 = 1 ) AND sessions.referrer_id = recordings_urls.id AND (exists(select id from acc_{account_id}.urls where url ILIKE '%enterprise_customer.com/jobs%')) AND r_time > to_timestamp(1547585600) AND r_time < to_timestamp(1549177599) AND recording_data.duration >=5 AND recording_data.num_of_pages > 0 ; count 32519 (1 row) Time: 1636.637 ms 

Well yes. The subquery, when wrapped in EXISTS , makes everything super fast. The next logical question is why a query with JOINs and the subquery itself are fast, but they terribly slow down together?



 WITH matching_urls AS ( select id::text from acc_{account_id}.urls where url ILIKE '%enterprise_customer.com/jobs%' ) SELECT count(*) FROM acc_{account_id}.urls as recordings_urls, acc_{account_id}.recording_data as recording_data, acc_{account_id}.sessions as sessions, matching_urls WHERE recording_data.usp_id = sessions.usp_id AND ( 1 = 1 ) AND sessions.referrer_id = recordings_urls.id AND (urls && array(SELECT id from matching_urls)::text[]) AND r_time > to_timestamp(1542585600) AND r_time < to_timestamp(1545107599) AND recording_data.duration >=5 AND recording_data.num_of_pages > 0; 

But it was still very slow.


Find the culprit


All this time one little thing flashed before my eyes, from which I constantly waved away. But since there was nothing left, I decided to take a look at it. I'm talking about the && operator. While EXISTS simply improved performance, && was the only remaining common factor in all versions of the slow query.


Looking at the documentation , we see that && used when you need to find common elements between two arrays.


In the original request it is:


 AND ( urls && array(select id from acc_{account_id}.urls where url ILIKE '%enterprise_customer.com/jobs%')::text[] ) 

Which means that we do a pattern search on our URLs, then find the intersection with all URLs with shared entries. This is a bit confusing, since the “urls” here does not refer to the table containing all the URLs, but to the “urls” column in the recording_data table.


With an increase in suspicion regarding && , I tried to find confirmation in the query plan generated by EXPLAIN ANALYZE (I already had a saved plan, but it’s usually more convenient for me to experiment in SQL than to try to understand the opacity of query planners).


 Filter: ((urls && ($0)::text[]) AND (r_time > '2018-12-17 12:17:23+00'::timestamp with time zone) AND (r_time < '2018-12-18 23:59:59+00'::timestamp with time zone) AND (duration >= '5'::double precision) AND (num_of_pages > 0)) Rows Removed by Filter: 52710 

There were a few lines of filters from && only. Which meant that this operation was not only expensive, but also performed several times.


I checked it by isolating the condition.


 SELECT 1 FROM acc_{account_id}.urls as recordings_urls, acc_{account_id}.recording_data_30 as recording_data_30, acc_{account_id}.sessions_30 as sessions_30 WHERE urls && array(select id from acc_{account_id}.urls where url ILIKE '%enterprise_customer.com/jobs%')::text[] 

This query was slow. Since JOIN fast and subqueries are fast, only the && operator remained.


This is just a key operation. We always need to search the entire main URL table to search for a pattern, and we always need to find intersections. We cannot search by url records directly, because these are just IDs that refer to urls .


Towards a solution


&& slow because both sets are huge. The operation will be relatively quick if I replace urls with { "http://google.com/", "http://wingify.com/" } .


I started looking for a way to do intersection of sets in Postgres without using && , but without much success.


In the end, we decided to just solve the problem in isolation: give me all the urls lines for which the url matches the pattern. Without additional conditions it will be -


 SELECT urls.url FROM acc_{account_id}.urls as urls, (SELECT unnest(recording_data.urls) AS id) AS unrolled_urls WHERE urls.id = unrolled_urls.id AND urls.url ILIKE '%jobs%' 

Instead of the JOIN syntax, I simply used a subquery and deployed the recording_data.urls array so that you can directly apply the condition in WHERE .


The most important thing here is that && used to check whether the entry contains the corresponding URL. Slightly squinting, you can see in this operation moving through the array elements (or rows of the table) and stopping when the condition (match) is met. Nothing like? Yeah, EXISTS .


Since recording_data.urls can be referenced from outside the subquery context, when this happens, we can return to our old friend EXISTS and wrap the subquery with it.


Combining everything together, we get the final optimized query:


 SELECT count(*) FROM acc_{account_id}.urls as recordings_urls, acc_{account_id}.recording_data as recording_data, acc_{account_id}.sessions as sessions WHERE recording_data.usp_id = sessions.usp_id AND ( 1 = 1 ) AND sessions.referrer_id = recordings_urls.id AND r_time > to_timestamp(1542585600) AND r_time < to_timestamp(1545177599) AND recording_data.duration >=5 AND recording_data.num_of_pages > 0 AND EXISTS( SELECT urls.url FROM acc_{account_id}.urls as urls, (SELECT unnest(urls) AS rec_url_id FROM acc_{account_id}.recording_data) AS unrolled_urls WHERE urls.id = unrolled_urls.rec_url_id AND urls.url ILIKE '%enterprise_customer.com/jobs%' ); 

And the final Time: 1898.717 ms time Time: 1898.717 ms time to celebrate?!?


Not so fast! First you need to check the correctness. I was extremely suspicious about EXISTS optimization, as it changes the logic to an earlier end. We need to be sure that we have not added a non-obvious error to the request.


A simple check was to perform count(*) on both slow and fast queries for a large number of different data sets. Then, for a small subset of the data, I checked the accuracy of all the results manually.


All checks gave consistently positive results. We fixed everything!


Lessons Learned


Many lessons can be learned from this story:


  1. Query plans do not tell the whole story, but may give hints.
  2. The main suspects are not always the culprits.
  3. Slow queries can be broken to isolate bottlenecks
  4. Not all optimizations by nature are reductive.
  5. Using EXIST , where possible, can lead to a dramatic increase in performance.

Conclusion


We passed from the time of inquiry in ~ 24 minutes to 2 seconds - a very serious increase in performance! Although this article came out large, all the experiments we did occurred on the same day, and according to estimates, took from 1.5 to 2 hours for optimizations and testing.


SQL is a wonderful language, if not afraid of it, but try to learn and use it. Having a good understanding of how SQL queries are executed, how the database generates query plans, how indexes work, and just the size of the data you are dealing with, you can be very successful in query optimization. It is equally important, however, to continue to try different approaches and slowly break the problem, finding bottlenecks.


The best part in achieving similar results is a noticeable visible improvement in the speed of work - when a report that has not even loaded before is now loaded almost instantly.


Special thanks to my teammates Aditya Misra , Aditya Gauru and Varun Malhotra for brainstorming and Dinkar Pandera for finding an important mistake in our final query before we finally said goodbye to him!


')

Source: https://habr.com/ru/post/455832/


All Articles