Hello! In this article we will talk about Big Data in Raiffeisenbank. But before we get to the point, I would like to clarify the very definition of Big Data. Indeed, in the past few years, this term has been used in a variety of contexts, which has led to a blurring of the boundaries of the term itself and the loss of the substantive part. We at Raiffeisenbank identified three areas that we relate to Big Data:

(Note that despite the fact that this scheme looks quite simple, there are a lot of “borderline” cases. If they occur, then we resort to peer review to evaluate whether Big Data technologies are needed to solve incoming problems or you can get by. based on the "classic" technologies RDBMS).
In this article, we will focus primarily on the technologies used and the solutions developed with their help.
')
First, a few words about the prerequisites of interest in technology. By the time of the start of work on Big Data, the bank had several solutions for working with data:
- Data Warehouse (DWH, Corporate Data Warehouse)
- Operational Data Store (ODS, Operational Data Storage)
What made us look in the direction of Big Data?
In IT, we were expected of a universal solution that would allow for the most efficient analysis of all data available for the bank in order to create digital products and to improve customer experience.
At that time, DWH and ODS had some limitations that made it impossible to develop these solutions as universal tools for analyzing all the data:
- The stringent data quality requirements imposed on DWH strongly influence the relevance of data in the repository (data is available for analysis the next day).
- Lack of historical data in ODS (by definition).
- The use of relational DBMS in ODS and DWH allows you to work only with structured data. The need to define a data model while writing to DWH / ODS (Schema on write) incurs additional development costs.
- Lack of possibility of horizontal scaling of the solution, limited vertical scaling.
When we realized these limitations, we decided to look towards Big Data technologies. At that moment, it was clear that competencies in this area in the future give a competitive advantage, so it is necessary to increase the internal expertise. Since there was no practical competence at the Bank at that time, we actually had two options:
- or form a team from the market (from the outside);
- or find enthusiasts through internal transitions, without actual expansion.
We chose the second option, because he seemed to us more conservative.
Further, we came to understand that Big Data is just a tool, and there are many possible solutions to a specific task with the help of this tool. The problem to be solved presented the following requirements:
- You need to be able to analyze the data together in a variety of forms and formats.
- You need to be able to solve a wide range of analytical tasks - from flat deterministic reports to exotic types of visualization and predictive analytics.
- We need to find a compromise between large amounts of data and the need to analyze them online.
- You need to have (ideally) an unlimited scalable solution that will be ready to fulfill requests from a large number of employees.
After studying the literature, reading the forums and reading the available information, we found that a solution that meets these requirements already exists in the form of a well-established architectural pattern and is called “Data Lake”. By deciding to implement Data Lake, we thus aimed to get a self-sufficient “DWH + ODS + Data Lake” ecosystem that is capable of solving any data-related tasks, be it management reporting, operational integration, or predictive analytics.
Our Data Lake variant implements a typical
lambda architecture , in which the input data is divided into two layers:

- the “fast” (speed) layer, in which mainly streaming data is processed, the data volumes are small, the transformations are minimal, but the minimum latency time between the occurrence of the event and its display in the analytical system is reached. For data processing, we use Spark Streaming, and for storing the result - Hbase.
- a “batch” layer in which data is processed in batches (batch files), which can include several million records at once (for example, balances on all accounts as a result of closing the trading day), this may take some time, but we can process sufficiently large amounts of data (throughput). The data in the batch layer is stored in HDFS, and to access it we use Hive or Spark, depending on the task.
Separately want to mention Spark. We widely use it for data processing and for us the most significant benefits are the following:
- Can be used as an ETL tool.
- Faster than standard Jobs on MapReduce.
- Higher code writing speed compared to Hive / MapReduce, because The code turns out less verbose, including at the expense of DataFrame'ov and library SparkSQL.
- More flexible, supports more complex processing pipelines than the MapReduce paradigm.
- Python and JVM languages ​​are supported.
- Built-in machine learning library.
We try to store the data in Data Lake in its original,
“raw” form , implementing the
“schema on read” approach. We use Oozie to manage our processes as a task scheduler.
Structured input data is stored in AVRO format. This gives us advantages:
- The data scheme may change during the life cycle, but this will not interfere with the functionality of the applications reading these files.
- The data schema is stored with the data, it is not necessary to describe separately.
- Native support for many frameworks.
For data marts that users will work with via BI-tools, we plan to use the formats Parquet or ORC, since In most cases, this will speed up data retrieval through column storage.
Cloudera and Hortonworks were considered as Hadoop assembly. chose Hortonworks because its distribution does not contain proprietary components. In addition, the 2nd version of Spark is available for the Hortonworks out of the box, and only 1.6 for the Cloudera.
Among the analytical applications that use Data Lake data, note two.
The first is the Jupyter Hub with Python and installed machine learning libraries that our Data Scientists use for predictive analytics and model building.
For the second role, we are now considering an application of the Self-Service BI class, with which users can independently prepare most standard retrospective reports - tables, graphs, pie charts, histograms, etc. This implies that the role of IT will be to add data to Data Lake, provide data and access for the application and users, and ... everything. Users will be able to do the rest themselves, due to which, in particular, we expect a reduction in the final search time for answers to interesting questions.
In conclusion, I would like to tell you what we
have achieved so far:
- They brought the Batch Layer branch to the Prod, load the data that is used for both retrospective analysis (ie, analysts use data to answer the question “how we send it here”) and predictive analysis: a daily forecast based on machine learning demand for cash withdrawals at ATMs and optimization of the collection service.
- Raised Jupyter Hub, gave users the opportunity to analyze data with the most modern tools: scikit learn, XGBoost, Vowpal Wabbit.
- We are actively developing and preparing to launch the Speed ​​Layer branch in the Prod, implementing a Real Time Decision Making class on Data Lake.
- We made a grocery back-log, the implementation of which will allow the fastest pace to increase the maturity of the solution. In the planned number:
- Disaster. Now the solution is deployed in one data center, and in fact we do not guarantee the continuity of the service, and we can also irreversibly lose the accumulated data, should it happen with the data center (this probability is small, but it still exists). We are faced with a problem: the built-in HDFS cannot ensure guaranteed data storage in different data centers. There is a revision to this effect, its fate is not yet clear, we plan to implement our own decision.
- Metadata enrichment (Atlas), Data Management / Governance based on metadata, role-based access based on metadata.
- Explore alternatives to selected architectural components. First candidates: Airflow as an alternative to Oozie, more advanced CDCs as an alternative to Scoop for uploading data from relational databases.
- Introduction of the CI / CD pipeline. With all the variety of technologies and tools we use, we want to ensure that any code change can be automatically rolled out as quickly as possible into the production environment, while guaranteeing the quality of delivery.
There are still a lot of plans to use Big Data in Raiffeisenbank and we will definitely tell about it.
Thanks for attention!