📜 ⬆️ ⬇️

Integrating SAS and Greenplum

Introduction

This article may be of interest to those who use ETL SAS tools when building a data warehouse. Recently, we have completed the active phase of the project to transfer the repository to the Greenplum database. Prior to that, SAS datasets were used as a database, i.e. in fact, the tables were files on the file system. At some point, it became clear that the rate of growth of data volumes is greater than the speed with which we can increase the performance of the file system, and it was decided to switch to a specialized database.

When we started the project, it was impossible to find anything regarding the SAS DIS and Greenplum bundles on the Internet. I would like to highlight the main points of the transition and the difficulties encountered in the process in this article.

Additional complexity to the project was given by the fact that it was necessary not to build processes from scratch, but to redo existing ones, otherwise the terms and cost of the project would turn out to be unacceptable. Historically, we use SAS, and in particular SAS Data Integration Studio, as ETL tools. Each ETL process here is a so-called. job, in which there are input tables, the logic of their processing, the output table (s). The daily vault loading process consists of ~ 800 such jobs. We had to redo them in such a way as to win the transfer of input / output tables to the Greenplum.


SAS / Access for Greenplum

How does the code written in SAS Base work with tables in the database? For various bases there is a series of SAS products, usually called SAS Access for <database name>. Such a connector for Greenplum is an ODBC driver with a special wrapper that allows it to be used from SAS. Connector provides two ways to access data:

In SAS Data Integration Studio, the code using the pass-through mechanism is able to generate only the “SQL Join” transform, the rest generate the usual SAS code that needs to be checked (by log) for what it finally sends to the connector. There were examples of incorrectly generated code.
')
Work -> Greenplum Work

By default, the work principle of jobs in SAS is such that Work is used as a place for temporary tables - a directory created at the start of a session on the disk, accessible only to the current owner of the process and deleted immediately after its completion. The work contains SAS datasets, to which the SAS Base language is applicable, which allows developing ETL very quickly. Also, such isolation makes it easy to “clean” the place behind fallen jobs, to control the amount of used disk space. When translating jobs to Greenplum, part (or all) of the intermediate tables moved from WORK to Greenplum, and the question arose where to put these tables in the database?

SAS does not allow working with temporary tables in Greenplum, so one of the options was to use a separate schema for them. But this approach had several significant drawbacks:


As a result, another approach was chosen - to reproduce the SAS behavior in the database. Each SAS session creates in Greenplum a scheme into which all intermediate tables of this session are added. Some details:

The DI Studio creates an ODBC library with the & LOAD_ETL_WRK_SCH_NAM schema, to which, if necessary, intermediate tables are assigned.

Data transfer between SAS and Greenplum

With such a scheme of work, the speed of data transfer between SAS and Greenplum becomes one of the main issues. Data transfer from Greenplum to SAS always goes through the master node and is limited by the speed of the driver. The speed depends on the width and composition of the fields of the unloaded tables, on average we get about 50MB / s.

With loading data from SAS to Greenplum everything is much more interesting. Greenplum allows bulk loading from text files. The essence of this mechanism is that the external file is defined as an external table (access to it for Greenplum is provided by a special utility installed on ETL hosts) and loaded directly onto data hosts, bypassing the master. Due to this, the download speed greatly increases. From the SAS side, the process looks like this: the table is unloaded into a csv-file, and then Greenplum forces this file into the database. However, it turned out that the speed of this process very much depends on the speed of uploading the table from SAS to the csv file. Uploading to a file goes at a speed of up to 20-30MB / s (rests on the processor), the download speed of csv in Greenplum exceeds 150 MB / s. For large tables, this eventually gave a completely unsatisfactory download speed. Acceleration was obtained by dividing the loaded table into parts: several parallel threads are started, each of which works with its own table piece - unload it into csv and insert it into Greenplum. This made it possible to increase the data download speed in Greenplum to ~ 90-100 MB / s.

ETL primitives

To work in DI Studio, we had to rewrite some transforms, since standard generated code that either worked suboptimally or worked with errors. These are the Table Loader and SCD Type2 Table Loader. In some places, because of the transfer of tables to Greenplum, jobs had to be edited: the standard Transform “Lookup”, for example, works inefficiently if both input tables are in the database.

Instead of conclusion

The article describes the main tasks that had to be solved during the migration process. Much remains beyond the scope of the article. Problem with hash join, table locking with user queries, partitioning and compression. If there is interest, we will describe this in more detail in the following posts.

Source: https://habr.com/ru/post/155763/


All Articles