It often happens that in a large project, for various reasons - often historical, although it happens in different ways - parts of it can use different DBMS to store and retrieve critical data. Among other things, this diversity is promoted by competition and the development of technology, but one way or another, the interaction between the DBMS is described by the SQL / MED 2003 standard (Management of External Data), which introduces the definition of Foreign Data Wrappers (FDW) and Datalink.
The first part of the standard offers means for reading data as a set of relational tables under the control of one or more external sources; FDW can also represent the ability to use a SQL interface to access non-SQL data, such as files or, for example, a list of letters in a mailbox. The second part, Datalink, allows you to manage a remote SQL server.
These two parts were implemented in PostgreSQL 9.1 and are called FDW and dblink respectively. FDW in PostgreSQL is made as flexible as possible, allowing you to design wrappers for a large number of external sources. Currently I know FDW such as PostgreSQL, Oracle, SQL Server, MySQL, Cassandra, Redis, RethinkDB, Ldap, as well as FDW to files like CSV, JSON, XML, etc.
In our article we will talk about how to configure PostgreSQL connection to MySQL and efficiently perform the resulting queries.
To begin with, build and install mysql_fdw:
git clone https://github.com/EnterpriseDB/mysql_fdw.git cd mysql_fdw # rhel-like pg_config PATH, /usr/pgsql-9.5/bin: PATH=$PATH:/usr/pgsql-9.5/bin USE_PGXS=1 make install
Install the extension on the base to load the necessary libraries:
CREATE EXTENSION mysql_fdw ;
Create a server:
CREATE SERVER mysql_server_data FOREIGN DATA WRAPPER mysql_fdw OPTIONS (host '127.0.0.1', port '3306');
And mapping the current user in PostgreSQL into a MySQL user:
CREATE USER MAPPING FOR user SERVER mysql_server_data OPTIONS (username 'data', password 'datapass');
After that we have the opportunity to connect the MySQL table in PostgreSQL:
CREATE FOREIGN TABLE orders_2014 ( id int, customer_id int, order_date timestamp) SERVER mysql_server_data OPTIONS (dbname 'data', table_name 'orders');
Suppose we store a customers directory in PostgreSQL:
CREATE TABLE customers (id serial, name text);
Let's try to choose the 5 most active buyers in January 2014:
explain (analyze,verbose) select count(o2014.id), c.name from orders_2014 o2014 inner join customers c on c.id = o2014.customer_id where extract('month' from o2014.order_date) = 1 and extract('year' from o2014.order_date) = 2014 group by 2 order by 1 desc limit 5;
Limit (cost=1285.32..1285.34 rows=5 width=36) (actual time=0.276..0.276 rows=5 loops=1) Output: (count(o2014.id)), c.name -> Sort (cost=1285.32..1285.82 rows=200 width=36) (actual time=0.275..0.275 rows=5 loops=1) Output: (count(o2014.id)), c.name Sort Key: (count(o2014.id)) DESC Sort Method: quicksort Memory: 25kB -> HashAggregate (cost=1280.00..1282.00 rows=200 width=36) (actual time=0.270..0.271 rows=5 loops=1) Output: count(o2014.id), c.name Group Key: c.name -> Merge Join (cost=1148.00..1248.25 rows=6350 width=36) (actual time=0.255..0.264 rows=8 loops=1) Output: o2014.id, c.name Merge Cond: (o2014.customer_id = c.id) -> Sort (cost=1059.83..1062.33 rows=1000 width=8) (actual time=0.240..0.241 rows=8 loops=1) Output: o2014.id, o2014.customer_id Sort Key: o2014.customer_id Sort Method: quicksort Memory: 25kB -> Foreign Scan on public.orders_2014 o2014 (cost=10.00..1010.00 rows=1000 width=8) (actual time=0.065..0.233 rows=8 loops=1) Output: o2014.id, o2014.customer_id Filter: ((date_part('month'::text, o2014.order_date) = '1'::double precision) AND (date_part('year'::text, o2014.order_date) = '2014'::double precision)) Rows Removed by Filter: 58 Local server startup cost: 10 Remote query: SELECT `id`, `customer_id`, `order_date` FROM `data`.`orders` -> Sort (cost=88.17..91.35 rows=1270 width=36) (actual time=0.011..0.011 rows=9 loops=1) Output: c.name, c.id Sort Key: c.id Sort Method: quicksort Memory: 25kB -> Seq Scan on public.customers c (cost=0.00..22.70 rows=1270 width=36) (actual time=0.004..0.005 rows=12 loops=1) Output: c.name, c.id
As we can see, the query is ineffective, as from the MySQL side, the contents of the entire table were received: SELECT id, customer_id, order_date FROM data.orders
. The server, due to the natural limitations of the MySQL driver, is not able to transform the query in such a way that it would be possible to execute the query on the MySQL side to get the correct result, and therefore it first receives the entire table and then performs the filtering. However, when changing the query, you can ensure that the filtering by date is performed on the MySQL side:
explain (analyze,verbose) select count(o2014.id), c.name from orders_2014 o2014 inner join customers c on c.id = o2014.customer_id where o2014.order_date between ('2014-01-01') and ('2014-02-01'::timestamptz - '1 sec'::interval) group by 2 order by 1 desc limit 5;
Comparing order_date with ('2014-02-01'::timestamp - '1 sec'::interval)
wrong, because timestamptz is stored with greater accuracy than a second, but this value was not chosen randomly, see:
Limit (cost=1285.32..1285.34 rows=5 width=36) (actual time=0.130..0.130 rows=0 loops=1) Output: (count(o2014.id)), c.name -> Sort (cost=1285.32..1285.82 rows=200 width=36) (actual time=0.129..0.129 rows=0 loops=1) Output: (count(o2014.id)), c.name Sort Key: (count(o2014.id)) DESC Sort Method: quicksort Memory: 25kB -> HashAggregate (cost=1280.00..1282.00 rows=200 width=36) (actual time=0.114..0.114 rows=0 loops=1) Output: count(o2014.id), c.name Group Key: c.name -> Merge Join (cost=1148.00..1248.25 rows=6350 width=36) (actual time=0.111..0.111 rows=0 loops=1) Output: o2014.id, c.name Merge Cond: (o2014.customer_id = c.id) -> Sort (cost=1059.83..1062.33 rows=1000 width=8) (actual time=0.110..0.110 rows=0 loops=1) Output: o2014.id, o2014.customer_id Sort Key: o2014.customer_id Sort Method: quicksort Memory: 25kB -> Foreign Scan on public.orders_2014 o2014 (cost=10.00..1010.00 rows=1000 width=8) (actual time=0.093..0.093 rows=0 loops=1) Output: o2014.id, o2014.customer_id Local server startup cost: 10 Remote query: SELECT `id`, `customer_id` FROM `data`.`orders` WHERE ((`order_date` >= '2014-01-01 00:00:00+00')) AND ((`order_date` <= ('2014-01-02 00:00:00+00' - '00:00:01'))) -> Sort (cost=88.17..91.35 rows=1270 width=36) (never executed) Output: c.name, c.id Sort Key: c.id -> Seq Scan on public.customers c (cost=0.00..22.70 rows=1270 width=36) (never executed) Output: c.name, c.id
There is a problem waiting for us, because of which it is worth using mysql_fdw with great care:
SELECT `id`, `customer_id` FROM `data`.`orders` WHERE ((`order_date` >= '2014-01-01 00:00:00+00')) AND ((`order_date` <= ('2014-01-02 00:00:00+00' - '00:00:01')))
As we can see, between, representing syntactic sugar, was deployed in two conditions, one of which was not calculated on the PostgreSQL side: ('2014-02-01'::timestamp - '1 sec'::interval)
and converted to the difference of two lines (not the date and interval):
mysql> select '2014-01-02 00:00:00+00' - '00:00:01'; +---------------------------------------+ | '2014-01-02 00:00:00+00' - '00:00:01' | +---------------------------------------+ | 2014 | +---------------------------------------+ 1 row in set, 2 warnings (0.00 sec)
As a result, the query returns an incorrect result.
One of our clients faced a similar problem. The problem was fixed in the PostgresPro fork, https://github.com/postgrespro/mysql_fdw and a pull request to the main enterpriseDB repository was created. Install the revised version:
git clone https://github.com/postgrespro/mysql_fdw.git mysql_fdw_pgpro cd mysql_fdw_pgpro PATH=$PATH:/usr/pgsql-9.5/bin USE_PGXS=1 make install
Now the query plan looks like this:
Limit (cost=1285.32..1285.34 rows=5 width=36) (actual time=0.219..0.219 rows=5 loops=1) Output: (count(o2014.id)), c.name -> Sort (cost=1285.32..1285.82 rows=200 width=36) (actual time=0.218..0.218 rows=5 loops=1) Output: (count(o2014.id)), c.name Sort Key: (count(o2014.id)) DESC Sort Method: quicksort Memory: 25kB -> HashAggregate (cost=1280.00..1282.00 rows=200 width=36) (actual time=0.199..0.201 rows=5 loops=1) Output: count(o2014.id), c.name Group Key: c.name -> Merge Join (cost=1148.00..1248.25 rows=6350 width=36) (actual time=0.183..0.185 rows=8 loops=1) Output: o2014.id, c.name Merge Cond: (o2014.customer_id = c.id) -> Sort (cost=1059.83..1062.33 rows=1000 width=8) (actual time=0.151..0.151 rows=8 loops=1) Output: o2014.id, o2014.customer_id Sort Key: o2014.customer_id Sort Method: quicksort Memory: 25kB -> Foreign Scan on public.orders_2014 o2014 (cost=10.00..1010.00 rows=1000 width=8) (actual time=0.116..0.120 rows=8 loops=1) Output: o2014.id, o2014.customer_id Local server startup cost: 10 Remote query: SELECT `id`, `customer_id` FROM `data`.`orders` WHERE ((`order_date` >= '2014-01-01 00:00:00+00')) AND ((`order_date` <= ('2014-01-02 00:00:00+00' - INTERVAL 1 SECOND))) -> Sort (cost=88.17..91.35 rows=1270 width=36) (actual time=0.030..0.030 rows=9 loops=1) Output: c.name, c.id Sort Key: c.id Sort Method: quicksort Memory: 25kB -> Seq Scan on public.customers c (cost=0.00..22.70 rows=1270 width=36) (actual time=0.018..0.020 rows=12 loops=1) Output: c.name, c.id
The query is faster than the first, since with MySQL we return the value of a more point query:
SELECT `id`, `customer_id` FROM `data`.`orders` WHERE ((`order_date` >= '2014-01-01 00:00:00+00')) AND ((`order_date` <= ('2014-01-02 00:00:00+00' - INTERVAL 1 SECOND)))
The filtering operation is now performed on the MySQL side. Under certain conditions, the index on order_date will be used if it is created.
Thus we accelerated the execution of the request. With a simple example, we felt the power of Open Source and the power of PostgreSQL in extensibility.
Thanks for attention!
» Learn more about SQL-MED
» Download the revised version of mysql_fdw
Source: https://habr.com/ru/post/308690/
All Articles