Hello! Having the necessary information, you can do a lot of useful (or a lot of wildly harmful) things, it depends on who has this information and what it motivates. In order to work with information, do the uploads you need, compile reports, you need to store this information somewhere. Therefore, we have created a huge lake of data for marketing.
My name is Andrey Naumov, I work in the corporate data management team and make a product for marketing and sales. Our task is to fill this lake with data (because what kind of data lake it is then without data) so that both business people and direct users from among employees who need to build detailed analytics can work productively with it.
Under the cut - about why we even needed such a lake, how we built it, how it helps to enter new sales markets inside and outside the country, as well as about our plans for the future.
Why is it needed at all
Before the creation of a single data lake, the situation with the processing of information left much to be desired. No, everything worked, but it could be much better. To begin, I’ll tell you how the guys work in our marketing.
')
They work with a tremendous amount of information from many data sources. These are sources inside SIBUR and outside, which are freely available and available only by subscription, free and paid. In general, the zoo is still there. Most of this information is huge flat files that require specialized software to work with. Often at the same time - for each type of data its own software. It’s clear, often this software is unstable or generally blunt.
For example, most of the work of marketing is tied to the study of customs statistics, with its help you can understand which products are leaving Russia and which, on the contrary, are coming. Here we are interested in precisely the products that SIBUR can directly or indirectly sell or create. The information that is processed by this system comes in batches, for months. To build some kind of intelligible analytics, say, in a year or a decade, was impossible, because we rested on the limitations of the software - in the same Excel there is a certain maximum of lines. And we retrieved tables for more than a million rows. Working PCs did not trivialize such bullying.
And this is only customs statistics as one of the sources, and there are many such sources - there are also railway statistics, information from internal systems about company sales, expert sources, reports ordered from external agencies and much, much more.
What to do
There was a task - to create a single version of the documentation in one place so that each user can work with data using one visualization tool and build analytics. In the Do option, we had the wildest defocus of marketers because of the very stage of data preparation. It turned out de facto that our marketers spent a lot of time working as data engineers. It is not right.
It was very difficult to work and analyze data in the context of more than a year. Because even having prepared and uploaded certain data for the year, they had to be thoroughly cleaned. From duplicates, from mistakes, from incorrect names. Some lines required unification, for example, someone in the table had our immense homeland called "Russia", someone - the "Russian Federation", and someone succinctly entered the "RF". All this had to be reduced to one view, and, as you know, the example with the name of the country is far from the only and not the most obvious.
And the thing is that we are a holding company, we have many organizations, and not everyone has the word "SIBUR" in the name. Therefore, trying to search the list and wanting to filter the names in a couple of clicks so that only the holding company can be seen, achieving the result was not easy.
In addition, how many people - and so many approaches to solving work problems. Each employee had their own methodology for processing data, filtering it, mapping, combining. The problem is that this technique existed in the head of an employee. Therefore, at that time a lot was tied to a specific person. This is also not the most fun story, because you need to unload something - and the person is on vacation. And sit, wait for him. Because without it they will either do it much longer, or they will do it wrong.
In general, we decided to make sure that there was no dependence on a particular person, that all information was general and accessible at the same level for any user who might need it.
To do this, we first went to business and asked them which of the data sources would be most interesting to them. We selected them, prepared for them a pilot data warehouse with data lake technologies (we described this lake in detail and with diagrams
in this post ). And then, using a number of ETL tools, they poured all these necessary sources there one time: customs statistics, railway statistics, by product, etc., carefully put this into the database (Vertica). The task was to make the integration of everything that is possible, which we did.
For data visualization we use Tableau, its server version was screwed to the repository and we gave users access to all data at once. Users, I must say, were encouraged - before you sat and stared at tables (huge tables), but now you have everything beautifully and conveniently visualized.
Product Flow Analysis
Product analysis
Competitor analysisOf course, our analysts do not see a bunch of smeared lines on the screen, but quite real numbers and names of counterparties, but we cannot show them.
Further from the users went useful feedback. We were given to understand that raw data (raw) is not very interesting for them, because each of them was engaged in its own pre-training. Therefore, we began to work out the most frequent mappings and renaming, rewrote counterparties, and fixed many errors - there could be duplicates and punctuation marks in columns, someone could enter its counterparts along with the name of the company. In general, there was enough garbage.
They brought the countries to a common view, it helped to collapse and pop open them by region - employees can do a couple of clicks to unload in the CIS, in the countries of South America or North, which is quite important for proper analytics. Collapse is a convenient thing, so we decided to extend this practice to legal entities - as with countries, only the scale of holdings and individual legal entities.
Why analysis is important for working with the market
Thanks to the work done, it became possible to display reports for the last 15-20 years in terms of import and export, and at the same time not to go crazy and not burn a couple of working PCs. Now you can take this time period and deploy it by year or fail by month.
So here. In customs statistics there is such a thing as TNVED, the commodity nomenclature of foreign economic activity. This is a maximum of 10 digits. The more numbers - the more specific the indication of a particular product.
Look at the coffee example.
09 - coffee, tea, mate, Paraguayan tea, spices. Pretty general category.
0901 2 - it will already let us know that we are talking about roasted coffee.
0901 21 - roasted coffee with caffeine (non-roasted and decaffeinated has a different code).
0901 21 000 2 - those same final 10 digits, this is already robusta (Coffea canephora).
The same goes for products that matter to us. That is - which we sell and produce. Of course, coffee is also important, but so far we do not consume it in such quantities as to unload statistics on imports.
And the polymers, plastics and raw materials necessary for their manufacture are important to us.
Here the codes already look that way.
39-40 - plastics and articles thereof; rubber, rubber and articles thereof.
3901 - ethylene polymers in primary forms
3901 1 - polyethylene with a specific gravity of less than 0.94
3901 10 100 0 - linear polyethylene.
And so for each polymer or type of raw material, we fail from the general to the particular. Why bother watching this at all. Using customs data, you can understand in some detail that a certain amount of polymers was imported into the Russian Federation over the year. Or raw materials. That is, someone buys products outside the country that we produce including us here in the Russian Federation. Then there is the opportunity to see the extent to which they buy it, with the help of the guys from advanced analytics you can aim at the right prices, and ultimately make it possible to reach such a customer with the same product, but which we do here, and offer him such a product at reasonable price. Given the means that he spends on customs duties and transportation.
With export the same. One of the products we are interested in is often exported abroad. So, there is a demand for it, on a very constant and good scale. So, you can see what it is, to whom it goes and how much they pay for it. Then figure out whether we can do the same, taking into account the costs of logistics, whether it makes sense or not.
And it also helps to watch the activity of competitors in the same field and, if necessary, adjust their numbers.
But it would be too simple if the TNVED always made it clear exactly what goods were traveling, right?
Therefore, some citizens import polyethylene under a different TNVED code, but here our analysts can study other fields in the customs statistics, and then, using the totality of signs, understand that it is exactly polyethylene, and not what is indicated in the code. This helps to see additional volumes of exports and imports, which, at the first checks, may elude attention. Based on such data, we can already figure it out - and suddenly it makes sense to us to open additional production, which will pay off, judging by the numbers and volumes.
We can additionally enrich such reports with the help of analysis and expertise of the employees themselves - a new field appears in the database, for example, a “product”, from which it is now also possible to make selections and build reports. And for each specific product (and this is determined by both the TNVED and the expert knowledge of colleagues), look that we have a couple of potential customers inside the country, and several more outside it. Therefore, you can start making raw materials for them, or even the final product.
We need to go deeper
You can go further - by choosing such recipients within the country, we can see what these guys still order for themselves from those goods to which we are related. Suddenly, they are interested in not only polyethylene, but also polypropylene, as well as some types of BOPP film? It turns out a rather extensive scope of knowledge about a particular consumer, having studied which, you can immediately offer him the goods, and the right price, and comfortable conditions.
What we have nowWe continue to work iteratively - we enter data, collect feedback from users, and refine our analytical rules. It turns out a kind of teamwork, we learn something from them, they from us, because they have very good expert knowledge, and we have technical knowledge.
After downloading the most critical sources and basic preparation of this data, we finally move from the test storage (all this time we are still in the test, yes) to the combat one. This will remove a lot of problems, because combat = certified, and it stores a lot of data that could not be fed to the test (commercial secrets and other things that are also important for analytics). Now it will actually be a single lake of data with a huge number of sources. Including quotation data - our colleagues from advanced analytics are able to predict prices for a particular product by analyzing many factors - these may be the company's shares, natural disasters in the regions of production, rumors about mergers and acquisitions, and even an unsuccessful tweet from someone guides.
Predictive analytics uses data and provides forecasts, the same forecasts are added to the data lake, and marketing can use them for their reports and analytics.
It turns out such a cycle of data within one lake. So far everyone is happy - the business, the reviews are as positive as possible, because they understand how much time and effort this project saves, and the analysts themselves.
So we are working on.