📜 ⬆️ ⬇️

Big Data at Raiffeisenbank

Hello! In this article we will talk about Big Data in Raiffeisenbank. But before we get to the point, I would like to clarify the very definition of Big Data. Indeed, in the past few years, this term has been used in a variety of contexts, which has led to a blurring of the boundaries of the term itself and the loss of the substantive part. We at Raiffeisenbank identified three areas that we relate to Big Data:



(Note that despite the fact that this scheme looks quite simple, there are a lot of “borderline” cases. If they occur, then we resort to peer review to evaluate whether Big Data technologies are needed to solve incoming problems or you can get by. based on the "classic" technologies RDBMS).

In this article, we will focus primarily on the technologies used and the solutions developed with their help.
')
First, a few words about the prerequisites of interest in technology. By the time of the start of work on Big Data, the bank had several solutions for working with data:


What made us look in the direction of Big Data?


In IT, we were expected of a universal solution that would allow for the most efficient analysis of all data available for the bank in order to create digital products and to improve customer experience.

At that time, DWH and ODS had some limitations that made it impossible to develop these solutions as universal tools for analyzing all the data:

  1. The stringent data quality requirements imposed on DWH strongly influence the relevance of data in the repository (data is available for analysis the next day).
  2. Lack of historical data in ODS (by definition).
  3. The use of relational DBMS in ODS and DWH allows you to work only with structured data. The need to define a data model while writing to DWH / ODS (Schema on write) incurs additional development costs.
  4. Lack of possibility of horizontal scaling of the solution, limited vertical scaling.

When we realized these limitations, we decided to look towards Big Data technologies. At that moment, it was clear that competencies in this area in the future give a competitive advantage, so it is necessary to increase the internal expertise. Since there was no practical competence at the Bank at that time, we actually had two options:

- or form a team from the market (from the outside);
- or find enthusiasts through internal transitions, without actual expansion.

We chose the second option, because he seemed to us more conservative.

Further, we came to understand that Big Data is just a tool, and there are many possible solutions to a specific task with the help of this tool. The problem to be solved presented the following requirements:

  1. You need to be able to analyze the data together in a variety of forms and formats.
  2. You need to be able to solve a wide range of analytical tasks - from flat deterministic reports to exotic types of visualization and predictive analytics.
  3. We need to find a compromise between large amounts of data and the need to analyze them online.
  4. You need to have (ideally) an unlimited scalable solution that will be ready to fulfill requests from a large number of employees.

After studying the literature, reading the forums and reading the available information, we found that a solution that meets these requirements already exists in the form of a well-established architectural pattern and is called “Data Lake”. By deciding to implement Data Lake, we thus aimed to get a self-sufficient “DWH + ODS + Data Lake” ecosystem that is capable of solving any data-related tasks, be it management reporting, operational integration, or predictive analytics.

Our Data Lake variant implements a typical lambda architecture , in which the input data is divided into two layers:



- the “fast” (speed) layer, in which mainly streaming data is processed, the data volumes are small, the transformations are minimal, but the minimum latency time between the occurrence of the event and its display in the analytical system is reached. For data processing, we use Spark Streaming, and for storing the result - Hbase.

- a “batch” layer in which data is processed in batches (batch files), which can include several million records at once (for example, balances on all accounts as a result of closing the trading day), this may take some time, but we can process sufficiently large amounts of data (throughput). The data in the batch layer is stored in HDFS, and to access it we use Hive or Spark, depending on the task.

Separately want to mention Spark. We widely use it for data processing and for us the most significant benefits are the following:


We try to store the data in Data Lake in its original, “raw” form , implementing the “schema on read” approach. We use Oozie to manage our processes as a task scheduler.

Structured input data is stored in AVRO format. This gives us advantages:


For data marts that users will work with via BI-tools, we plan to use the formats Parquet or ORC, since In most cases, this will speed up data retrieval through column storage.

Cloudera and Hortonworks were considered as Hadoop assembly. chose Hortonworks because its distribution does not contain proprietary components. In addition, the 2nd version of Spark is available for the Hortonworks out of the box, and only 1.6 for the Cloudera.

Among the analytical applications that use Data Lake data, note two.

The first is the Jupyter Hub with Python and installed machine learning libraries that our Data Scientists use for predictive analytics and model building.

For the second role, we are now considering an application of the Self-Service BI class, with which users can independently prepare most standard retrospective reports - tables, graphs, pie charts, histograms, etc. This implies that the role of IT will be to add data to Data Lake, provide data and access for the application and users, and ... everything. Users will be able to do the rest themselves, due to which, in particular, we expect a reduction in the final search time for answers to interesting questions.

In conclusion, I would like to tell you what we have achieved so far:

Source: https://habr.com/ru/post/332496/


All Articles