Integrating SAS and Greenplum

Introduction

This article may be of interest to those who use ETL SAS tools when building a data warehouse. Recently, we have completed the active phase of the project to transfer the repository to the Greenplum database. Prior to that, SAS datasets were used as a database, i.e. in fact, the tables were files on the file system. At some point, it became clear that the rate of growth of data volumes is greater than the speed with which we can increase the performance of the file system, and it was decided to switch to a specialized database.

When we started the project, it was impossible to find anything regarding the SAS DIS and Greenplum bundles on the Internet. I would like to highlight the main points of the transition and the difficulties encountered in the process in this article.

Additional complexity to the project was given by the fact that it was necessary not to build processes from scratch, but to redo existing ones, otherwise the terms and cost of the project would turn out to be unacceptable. Historically, we use SAS, and in particular SAS Data Integration Studio, as ETL tools. Each ETL process here is a so-called. job, in which there are input tables, the logic of their processing, the output table (s). The daily vault loading process consists of ~ 800 such jobs. We had to redo them in such a way as to win the transfer of input / output tables to the Greenplum.

SAS / Access for Greenplum

How does the code written in SAS Base work with tables in the database? For various bases there is a series of SAS products, usually called SAS Access for <database name>. Such a connector for Greenplum is an ODBC driver with a special wrapper that allows it to be used from SAS. Connector provides two ways to access data:

The database schema is defined via the SAS libname and it is possible to work with the tables as with ordinary SAS datasets. In this case, the SAS code is implicitly translated into instructions for the database, but if such translation is not possible (for example, if some SAS function is used that is not in Greenplum), the request processing on the SAS side will be applied.
SQL pass-through. In proc sql, you can write code that will be passed to the ODBC driver in fact as is.

In SAS Data Integration Studio, the code using the pass-through mechanism is able to generate only the “SQL Join” transform, the rest generate the usual SAS code that needs to be checked (by log) for what it finally sends to the connector. There were examples of incorrectly generated code.
')

Work -> Greenplum Work

By default, the work principle of jobs in SAS is such that Work is used as a place for temporary tables - a directory created at the start of a session on the disk, accessible only to the current owner of the process and deleted immediately after its completion. The work contains SAS datasets, to which the SAS Base language is applicable, which allows developing ETL very quickly. Also, such isolation makes it easy to “clean” the place behind fallen jobs, to control the amount of used disk space. When translating jobs to Greenplum, part (or all) of the intermediate tables moved from WORK to Greenplum, and the question arose where to put these tables in the database?

SAS does not allow working with temporary tables in Greenplum, so one of the options was to use a separate schema for them. But this approach had several significant drawbacks:

Automatic code generation in SAS DI Studio does not delete tables that are no longer needed. It would take some kind of procedure embedded in every job, since Keeping a full ETL spreadsheet in the database all the time is too wasteful.
Name conflicts may occur in parallel job jobs.
In case of any problems, it is difficult to identify who (what job) the table belongs to.

As a result, another approach was chosen - to reproduce the SAS behavior in the database. Each SAS session creates in Greenplum a scheme into which all intermediate tables of this session are added. Some details:

One such “work” scheme corresponds to one SAS session. At the start of the SAS session, a scheme is created in Greenplum, and the SAS library is assigned to it.
The name of the created scheme encodes the necessary information, such as the path to the SAS work, host, user, process: % let LOAD_ETL_WRK_SCH_NAME = work_% substr (% sysfunc (pathname (work)),% sysfunc (prxmatch (/ (? <= SAS_work) ./,%sysfunc(pathname(work)))),12)_&SYSUSERID._srv%substr(&SYSHOSTNAME.,% eval (% length (& SYSHOSTNAME.) - 1)) _ & SYSJOBID .;
On the crowns we put the "cleaner" of temporary Greenplum schemes. For each “work_” scheme, Greenplum checks the presence of the corresponding SAS session and, in case of its absence, deletes the “work_” scheme.

The DI Studio creates an ODBC library with the & LOAD_ETL_WRK_SCH_NAM schema, to which, if necessary, intermediate tables are assigned.

Data transfer between SAS and Greenplum

With such a scheme of work, the speed of data transfer between SAS and Greenplum becomes one of the main issues. Data transfer from Greenplum to SAS always goes through the master node and is limited by the speed of the driver. The speed depends on the width and composition of the fields of the unloaded tables, on average we get about 50MB / s.

With loading data from SAS to Greenplum everything is much more interesting. Greenplum allows bulk loading from text files. The essence of this mechanism is that the external file is defined as an external table (access to it for Greenplum is provided by a special utility installed on ETL hosts) and loaded directly onto data hosts, bypassing the master. Due to this, the download speed greatly increases. From the SAS side, the process looks like this: the table is unloaded into a csv-file, and then Greenplum forces this file into the database. However, it turned out that the speed of this process very much depends on the speed of uploading the table from SAS to the csv file. Uploading to a file goes at a speed of up to 20-30MB / s (rests on the processor), the download speed of csv in Greenplum exceeds 150 MB / s. For large tables, this eventually gave a completely unsatisfactory download speed. Acceleration was obtained by dividing the loaded table into parts: several parallel threads are started, each of which works with its own table piece - unload it into csv and insert it into Greenplum. This made it possible to increase the data download speed in Greenplum to ~ 90-100 MB / s.

ETL primitives

To work in DI Studio, we had to rewrite some transforms, since standard generated code that either worked suboptimally or worked with errors. These are the Table Loader and SCD Type2 Table Loader. In some places, because of the transfer of tables to Greenplum, jobs had to be edited: the standard Transform “Lookup”, for example, works inefficiently if both input tables are in the database.

Instead of conclusion

The article describes the main tasks that had to be solved during the migration process. Much remains beyond the scope of the article. Problem with hash join, table locking with user queries, partitioning and compression. If there is interest, we will describe this in more detail in the following posts.

Source: https://habr.com/ru/post/155763/

All Articles