📜 ⬆️ ⬇️

Data conversion, or deepening in Talent Open Studio

The morning was quite ordinary and lazy: shower, coffee, cigarette ... It's time to get ready for work ...
Upon arrival at the office, I was greeted with the news about the transition to a new project as an ETL engineer (what it means I did not know, but oh well). Well, I think we will try. Before me, one guy worked there, but as always, I didn’t wait for a lot of help about his work. And so begin.


What is ETL? This is what the wiki says:
ETL (from the English. Extract, Transform, Load - literally "extraction, transformation, loading") - one of the main processes in the management of data warehouses, which includes:

I think this more or less clearly. The essence of my task was to pull data from some .xls document, convert it by type (most of the fields were defined as string), calculate something (depending on the case), and write the whole thing into the database. After lead to the star schema mind, what is it you can read on the same wiki .
So. Deal with the task. For work, Talent Open Studio was used , and an example of how and what to do was made. And began a very long and boring job copy-paste.

Here is what a simplified example of data conversion looks like and writing to the database looks like this:

')
so visually looks like a simple example of a job at TOS
TOS Job

There were cases where I had to write a fairly long check on the validity of the data, their parsing, and so on. After several such tasks and an annoying copy-peist, the idea of ​​optimizing the process matured, as it turned out in this IDE there is an opportunity to write your own static public classes. The first class for data conversion was written in a couple of minutes. And immediately the amount of marudnaya work decreased significantly. If, for example, every time I had to copy-paste:

obj.toString().equals("#") || obj.toString().equals("") || obj==null ? null
Integer.parseInt(StringHandling.EREPLACE(obj.toString().replace(" ", "").substring(0,obj.toString().replace(" ", "").indexOf(".")),"\\xA0", ""))


then it all went on in one line

routines.Convert.toInteger(obj)

But as they say, not everything is so simple, especially if you have to perform some operations with some data, and even more so if you have to work with data from the previous osprey, as in Talent Open Studio the data is transferred in rows, rather than as a whole array, and accessed to the previous osprey is not so easy, but rather I haven’t found a way except to use the buffer components.

More specifically, there is the following task: we have a set of values ​​with the following fields: “Year”, “Quarter”, “StartsFromJan1”, “SomeValue1”, “SomeValue2”, “SomeValue3”; the StartsFromJan1 field can take the values ​​“Y” (true) or “N” (false).

If the “StartsFromJan1” field is set to “Y” (true), then this line must be processed according to a set of rules, if “N” is left as it is. For example, the processing can occur as follows: the corresponding cell of the previous line is subtracted from the cell in the current line (if Q4 is subtracted Q3), taking into account the fact that if it has the value of the StartsFromJan1 == N field then subtract one more minus one line, and so on let's get to Q1, but if the StartsFromJan1 == Y field is in the row to be taken away, then we stop there, for clarity, I will give an example of the data:
YearQuarterStartsFromJan1SomevalueSomevalue
2009Q1N30004,000
2009Q2Y35005000
2009Q3N4,0006000
2009Q4Y50007,000
2009Q1N35004400
2009Q2Y34005600
2009Q3N45006500
2009Q4Y56007800

in our case, the first remains unchanged, we take away the first from the second, do not touch the third, subtract 3 and 2 from the fourth



the development of the problem the following problems


To solve the first problem, the same static classes came to the rescue. I used 3 one-dimensional arrays (three, because the string in which Quarter == Q1 is always “N” for the “StartsFromJan1” field), created some Int type variable, which after each processed cell was increased by one, and after processing the last cell in the line and the transition to the next reset. Next, a method was created that converted the incoming data and wrote it into an array, it also reset the array when switching from Q4 to Q1. And in the latter method, I was engaged in pulling data out of the array, whether the incoming parameter needed to be counted, he later called the method to add to the array and returned the calculated or uncalculated value (depending on the “StartsFromJan1” parameter).
Then, in the tMap component, this method was called into which the values ​​“StartsFromJan1”, “Quarter”, “someValue” were transferred ...

The problem with empty lines ... at first we just cut the number of readable lines from the file, but this solution was temporary, as the amount of data increases over time. The solution came in the form of a tFilterRow component, which simply indicated some expression for filtering, for example Year! = null || Year! = "", All values ​​that met the condition were run through tMap.

after carrying out the described optimizations, the development time of a single Osprey data was reduced by several orders of magnitude, the amount of work was also significantly less, although for the most part everything was done because of the enormous reluctance to do the same type and long-term work, especially when it concerns copy-peist.

Source: https://habr.com/ru/post/86847/


All Articles