The ETL capabilities of IBM DataStage cover a fairly wide range of requirements that arise in data integration tasks, but, sooner or later, there is a need to extend the functionality by implementing C Parallel Routines or creating Java classes that are later used in Java Transformer or java client. The rather limited capabilities of the built-in Basic language have long been outdated and cannot be regarded as a serious help (for example, it is impossible to use XML structures, or another example - try writing MD5 hashing using Basic. This is possible, but it will take considerable time to develop and debug ).
Anyway, I would like to have a fairly flexible tool that allows you to work with a data stream that does not require recompilation of its source codes and which could be used in the DataStage Client editor. My colleague and close friend was asked to develop a Groovy Transformer. About him and will be discussed in this article.
Why groovy? Because this language is quite flexible and has all the features of Java, as it is an add-on to this language, but in addition it offers the following advantages to developers:
- Native syntax for hashes (associative arrays) and lists, so instead of the following Java code
import java.util.* ; … HashMap<String,String> someMap=new HashMap<String,String>() ; someMap.put("Key1", "Value") ; … String valFromMap=someMap.get("Key1") ;
You can use the following Groovy code (java.util. * in it is already imported by default):
HashMap someMap=new HashMap(); someMap.Key1= "Value" ; … valFromMap=someMap.Key1;
- Interpreting strings (why is everyone called interpolation?):
def NumberOfItems= 10; println "Number of items=$NumberOfItems";
- Dynamic code execution, you can execute code stored in a string variable, in a stream or in files without compilation:
def GroovyCode='sum=a+b; println "Sum is $sum"'; Binding binding=new Binding(); binding.setVariable("a", 10); binding.setVariable("b", 20); GroovyShell shell = new GroovyShell(binding); shell.evaluate(GroovyCode);
- Native support for XML and JSON:
def writer=new StringWriter() ; def xmlOut=new groovy.xml.MarkupBuilder(writer) ; xmlOut.JobsInfo{ Job(name : 'ODS_MOUVEMENT_C') { precedants { precedant "ODS_ECRITURE_C" precedant "ODS_TEC_DEB" } } } println writer;
This example should print the following XML:
<JobsInfo> <Job name='ODS_MOUVEMENT_C'> <precedants> <precedant>ODS_ECRITURE_C</precedant> <precedant>ODS_TEC_DEB</precedant> </precedants> </Job> </JobsInfo>
if you want to generate JSON, you need to replace groovy.xml.MarkupBuilder with groovy.json.JsonBuilder.
The main idea of ​​Groovy Transformer is to use Groovy code in Java Transformer. Since the code can be written directly in this stage, you can decide for yourself what to do - execute the Groovy code that is stored in the file, comes from the job parameters, or which you write yourself.
So we need to learn how to create a Java Transformer. Those who already know how to do this can skip this section. But I will try briefly, since the documentation for this part is written in sufficient detail.
So, to create a Java transformer, we need to create a class that inherits from the Stage class:
import com.ascentialsoftware.jds.* ; class MyJavaTransformer extends Stage{ }
And it is necessary to implement the three most frequently used methods: initialize (), process () and terminate ().
The initialize () method is executed before the page processes the stream and can contain declarations of objects that you intend to use throughout the life of the transformer.
The process () method is executed for each line of the input stream and must contain the logic of your processing.
The terminate () method is executed at the end of the transformer’s existence and it can contain actions for deleting temporary objects (yes, I know that there are no destructors in Java, any garbage you used: files, tables, whatever it is).
Note for parallel mode transformer: DataStage runs a separate Java machine for each node. In other words, if you have four nodes, then DataStage will launch four JVMs. Since the virtual machines are isolated, you do not have acceptable ways to exchange data between the threads running in each of them.
')
Now we are ready to create a template for our Java-transformer:
import com.ascentialsoftware.jds.*; public class MyJavaTransformer extends Stage { public void initialize() { trace("Init"); } public void terminate() { trace("Terminate"); } public int process() { return 0; } }
To read the rows entering and leaving the transformer, you can use the Row object and two methods: readRow () to access the values ​​of the input stream and writeRow () to write to the output.
The Row object also allows you to get the metadata of each column and allows you to get the values ​​of these columns. The following example demonstrates how to replace the values ​​of all columns that are of type VarChar with the value “Hello from the Java”; all other columns are “pushed” without any further changes:
public int process() { Row inputRow=readRow() ; if (inputRow == null)
Note : To compile the transformer class, do not forget to import the tr4j.jar library into the class path or into your IDE.
Now we can formulate the requirements for our Groovy Transformer.
Groovy Transformer is a JavaTransformer that compiles Groovy code on the fly. It contains syntactic sugar to facilitate the routine operations that have to be performed when working with the Stage class.
So, our transformer should:
- Get Groovy code from the Stage-> Properties tab of the Java Transformer Stage;
- Allow access to input and output stream metadata;
- Allow to manipulate column data by name (instead of column number).
In accordance with these requirements, Groovy Transformer was created, which you can download here:
http://geckelberryfinn.ru/fr/GroovyTransformer.html . (Caution! This Java Transformer is also written in Groovy =), there will be problems with decompiling).
Groovy Transformer predetermines the following objects:
An object | Description | Example |
---|
Gtransformer | Object. Link to this class Stage. Contains all the methods and attributes of this class. | GTransformer.createOutputRow () |
OutputMatching | HashMap. Contains matching column names and their indexes. | OutputMatching.get (k); OutputMatching.ID; OutputMatching.LIBL;
|
Metadata | HashMap. Contains information about the input flow method columns. | MetaData.ID.Description; MetaData.ID.Derivation; MetaData.ID.SQLType: MetaData.ID.DataElementName;
|
OutputMetaData | HashMap. Contains information about the method output stream | OutputMetaData.ID.Description; OutputMetaData.ID.Derivation; OutputMetaData.ID.SQLType: OutputMetaData.ID.DataElementName;
|
InputColumns | HashMap. Contains all input columns | InputColumns.ID; InputColumns.LIBL;
|
OutputRows | List <, HashMap>. List of lines of the output stream. You can use this object when the number of lines leaving more than the number of incoming lines. | HashMap curRow = new HashMap (); outputRows [0] = curRow; outputRows [0] .ID = 0; outputRows [0] .LIBL = "First item";
|
Thus, there are two ways to create an output stream:
- Fill in the list OutputRows;
- Call the createOutputRow () method and then the writeRow of the GTransformer object.
Which way to use depends on the specific situation.
To start using Groovy Transformer in your projects, it would be nice to fill in the properties of the Java Transformer steage:
- Transformer Class Name: groovytransformer.GroovyTransformer
- User's Classpath: /path/to/jar/GroovyTransformer.jar
Below I will give a few examples of using Groovy Transformer:
- Easily extract data from a JDBC source.
In this example, we will extract data from the ODS.AXE_LOCAL table using Oracle as a DBMS, although there is a separate stage for it. But, in fact, this code can be used in cases when it is necessary to obtain data from “exotic data sources” (Interbase, FoxPro, VectorWise, Derby, SQLite, H2, etc.).
This Groovy code is quite versatile, since it does not use specific information other than an SQL query. In other words, it can be used in any cases.
In order to run this example, do not forget to add a list of columns in the Output tab of the transformer with case-sensitive names.
sql = Sql.newInstance( 'jdbc:oracle:thin:@oraclehost:1533/nomdebdd', 'user', 'password', 'oracle.jdbc.OracleDriver' ); sql.eachRow('select id, LIBL_AXE_CENTRAL, SYS_CREATED_DAT as DAT_DEB_VALID from ODS.ODS_AXE_CENTRAL') { it -> GRow=GTransformer.createOutputRow(); i=0; it.toRowResult().each{k,v-> i++; if (v instanceof java.sql.Date) GRow.setValueAsSQLTyped(OutputMatching.get(k),it.getTimestamp(i)); else GRow.setValueAsSQLTyped(OutputMatching.get(k),v) ; } GTransformer.writeRow(GRow); }
- Use regular expressions.
In this example, we will try to extract information about the contract (its number and date) from the lines of the form:
"Contract FU-1009 dated 10/01/1960"
“The Service Agreement FU-1019 for 10/01/1961”
“Comp. Number FU-1001 dated 10/01/1962”
and we would like to receive two output columns at the exit: the contract number and the date of its conclusion.
In this case, you can use the following regular expressions:
- For contract number: [AZ] + (- [AZ]) + (- [0-9] +)
- For its date: [0-3]? [0-9] (/ | \\.) [0-1]? [0-9] (/ | \\.) (19 | 20)? [0-9 ] {2}
Then for this example, the Groovy code might be like this (note that there is another way to generate the output stream):
String StrIn=InputColumns.contrat contRegExp="[AZ]+(-[AZ])+(-[0-9]+)" datRegExp="[0-3]?[0-9](/|\\.)[0-1]?[0-9](/|\\.)(19|20)?[0-9]{2}" def matchRegEx(str, regExp) { matcher=(str=~regExp); if (matcher.getCount()!=0) return matcher[0][0]; else return null; } def cont=matchRegEx(StrIn, contRegExp); def dat =matchRegEx(StrIn, datRegExp); HashMap curRow=new HashMap(); OutputRows[0]=curRow; OutputRows[0].contrat=StrIn; OutputRows[0].numero=cont; OutputRows[0].date=dat;
Some useful links:
- Groovy Reference: http://groovy.codehaus.org/Documentation
- Java Transformer Example: http://www.ibm.com/developerworks/data/library/techarticle/dm-1106etljob/index.html