The morning was quite ordinary and lazy: shower, coffee, cigarette ... It's time to get ready for work ...
Upon arrival at the office, I was greeted with the news about the transition to a new project as an ETL engineer (what it means I did not know, but oh well). Well, I think we will try. Before me, one guy worked there, but as always, I didn’t wait for a lot of help about his work. And so begin.
What is ETL? This is what the
wiki says:
ETL (from the English. Extract, Transform, Load - literally "extraction, transformation, loading") - one of the main processes in the management of data warehouses, which includes:
- extract data from external sources;
- their transformation and purification (eng. Data cleansing), so that they correspond to the needs of the business model;
- and loading them into the data warehouse.
I think this more or less clearly. The essence of my task was to pull data from some .xls document, convert it by type (most of the fields were defined as string), calculate something (depending on the case), and write the whole thing into the database. After lead to the
star schema mind, what is it you can read on the same
wiki .
So. Deal with the task. For work,
Talent Open Studio was used , and an example of how and what to do was made. And began a very long and boring job copy-paste.
Here is what a simplified example of data conversion looks like and writing to the database looks like this:
- A project is being created;
- A data source is added to it; (in our case .xml doc);
- Added data recipient (table in MySQL);
- Create a new job;
- We drop the source and the receiver into it, as well as the tMap component of the environment;
- For each field, we write in it a data conversion (for example, an envelope toInt || toFloat, delete extra characters, for example, "%", and so on;
- We press our Job to execute and see what happens;
')
so visually looks like a simple example of a job at TOS
There were cases where I had to write a fairly long check on the validity of the data, their parsing, and so on. After several such tasks and an annoying copy-peist, the idea of ​​optimizing the process matured, as it turned out in this IDE there is an opportunity to write your own static public classes. The first class for data conversion was written in a couple of minutes. And immediately the amount of marudnaya work decreased significantly. If, for example, every time I had to copy-paste:
obj.toString().equals("#") || obj.toString().equals("") || obj==null ? null
Integer.parseInt(StringHandling.EREPLACE(obj.toString().replace(" ", "").substring(0,obj.toString().replace(" ", "").indexOf(".")),"\\xA0", ""))
then it all went on in one line
routines.Convert.toInteger(obj)
But as they say, not everything is so simple, especially if you have to perform some operations with some data, and even more so if you have to work with data from the previous osprey, as in Talent Open Studio the data is transferred in rows, rather than as a whole array, and accessed to the previous osprey is not so easy, but rather I haven’t found a way except to use the buffer components.
More specifically, there is the following task: we have a set of values ​​with the following fields: “Year”, “Quarter”, “StartsFromJan1”, “SomeValue1”, “SomeValue2”, “SomeValue3”; the StartsFromJan1 field can take the values ​​“Y” (true) or “N” (false).
If the “StartsFromJan1” field is set to “Y” (true), then this line must be processed according to a set of rules, if “N” is left as it is. For example, the processing can occur as follows: the corresponding cell of the previous line is subtracted from the cell in the current line (if Q4 is subtracted Q3), taking into account the fact that if it has the value of the StartsFromJan1 == N field then subtract one more minus one line, and so on let's get to Q1, but if the StartsFromJan1 == Y field is in the row to be taken away, then we stop there, for clarity, I will give an example of the data:
Year | Quarter | StartsFromJan1 | Somevalue | Somevalue |
---|
2009 | Q1 | N | 3000 | 4,000 |
2009 | Q2 | Y | 3500 | 5000 |
2009 | Q3 | N | 4,000 | 6000 |
2009 | Q4 | Y | 5000 | 7,000 |
2009 | Q1 | N | 3500 | 4400 |
2009 | Q2 | Y | 3400 | 5600 |
2009 | Q3 | N | 4500 | 6500 |
2009 | Q4 | Y | 5600 | 7800 |
in our case, the first remains unchanged, we take away the first from the second, do not touch the third, subtract 3 and 2 from the fourth
the development of the problem the following problems
- Calculating data if the “StartsFromJan1” field is set to “Y”
- Weeding empty lines
- Minimize the amount of manual work
To solve the first problem, the same static classes came to the rescue. I used 3 one-dimensional arrays (three, because the string in which Quarter == Q1 is always “N” for the “StartsFromJan1” field), created some Int type variable, which after each processed cell was increased by one, and after processing the last cell in the line and the transition to the next reset. Next, a method was created that converted the incoming data and wrote it into an array, it also reset the array when switching from Q4 to Q1. And in the latter method, I was engaged in pulling data out of the array, whether the incoming parameter needed to be counted, he later called the method to add to the array and returned the calculated or uncalculated value (depending on the “StartsFromJan1” parameter).
Then, in the tMap component, this method was called into which the values ​​“StartsFromJan1”, “Quarter”, “someValue” were transferred ...
The problem with empty lines ... at first we just cut the number of readable lines from the file, but this solution was temporary, as the amount of data increases over time. The solution came in the form of a tFilterRow component, which simply indicated some expression for filtering, for example Year! = null || Year! = "", All values ​​that met the condition were run through tMap.
after carrying out the described optimizations, the development time of a single Osprey data was reduced by several orders of magnitude, the amount of work was also significantly less, although for the most part everything was done because of the enormous reluctance to do the same type and long-term work, especially when it concerns copy-peist.