Cases of practical application of Big data
in financial sector companies
Why this article?
This review examines cases of implementation and application of Big Data in real life on the example of "live" projects. For some, especially interesting, in every sense, case I dare to give my comments.
The range of cases reviewed is limited to the examples presented in the public domain on
the Cloudera website .
')
What is "Big Data"

There is a joke in technical circles that “Big Data” is data for which Excel 2010 is not enough to handle on a powerful laptop. That is, if to solve the problem you need to operate with 1 million lines per sheet and more, or 16 thousand columns and more, then congratulations, your data belongs to the category of “Large”.
Among the many more rigorous definitions, for example, the following: “Big data” - data sets that are so voluminous and complex that the use of traditional processing tools is impossible. The term usually describes data over which methods of predictive analytics are used or other methods of extracting value from data and rarely correlate only with the amount of data.
Definition of Wikipedia: Big data (English big data) - designation of structured and unstructured data of huge volumes and significant diversity, effectively processed by horizontally scalable (English scale-out) software tools that appeared in the late 2000s and are alternative to traditional database management systems data and business intelligence solutions.
Big Data AnalyticsThat is, to get the effect of the data, it is necessary to process them in a certain way, analyze and extract valuable information from this data, preferably a predictive property (that is, in other words, try to look and predict the future).
The analysis and extraction of valuable information from data or simply “Big Data Analytics” is realized by means of analytical models.
The quality of models or the quality of detection of dependencies in the available data is largely determined by the amount and variety of data on which a particular model works or is “trained”
Infrastructure Big data. Technological storage and data processing platformIf we ignore exceptions, the main rule of Big Data Analytics is - the more and more diverse the data, the better the quality of the models. Quantity goes into quality. An acute question is the growth of technological capacities for storing and processing large amounts of data. Corresponding powers are provided by a combination of computing clusters and GPU graphics accelerators.
Infrastructure Big data. Data collection and preparationAccording to Viktor Mayer-Schönberger’s Big Data: A Revolution of That Will Transform How We Live, Work, and Think, up to 80% of all labor costs in implementing Big Data projects are related to the infrastructure tasks of providing a technological platform, platform, collection and transformation data quality control data. And only 20% - directly on the Analytics, modeling and machine learning.

Themselves cases of application of Big data in the companies of the financial sector


1. Intercontinental Exchange (ICE),
New York Stock Exchange (NYSE) - Exchange
Link to caseLink to companyCompany DescriptionICE provides the largest platform for trading futures, stocks and options, provides clearing services and services for working with data. Nyse
Project topic: # Infrastructure, # Compliance, # Data Management.
Objective of the project:The exchange generates huge data sets in the course of its work, the use of this data is critical for optimizing and satisfying the ever-growing demands of the market and customers. Once the moment came when the existing DataLake infrastructure did not allow for the timely processing of new and existing data, there was a problem of data separation (data silos).
According to the results of the project:The updated technology platform (Cloudera Enterprise DataHub) provides internal and external users with access to more than 20 petabytes of real-time data (30 terabytes is added daily), thereby improving the process of monitoring the market situation and monitoring compliance with its member trade rules (Compliance). The Legacy database was replaced by the Apache Impala CDH product, which allowed data subscribers to analyze the contents of DataLake.
Comment:The topic of the case is interesting and undoubtedly in demand. Many sites generate real-time data streams that need operational analytics.
The case is only infrastructural, not a word about the analyst. The description of the case does not say anything about the timing of the project, the difficulties of transition to new technologies. In general, it would be interesting to study the details and get to know the participants.

2. Cartao Elo - plastic card issuing and servicing company in Brazil
Link to caseLink to companyCompany Description:Cartao Elo is a company that owns 11% of all plastic payment cards issued in Brazil, with more than one million transactions per day.
Project theme: # Infrastructure, # Marketing, # Business Development.
Objective of the project:The company has set a goal to bring the relationship with the client to the level of personalized offers. Even be able to predict the wishes of customers for a short period of time in order to have time to offer an additional product or service. An analytical platform is required that can process real-time data from such sources as geolocation data on the location of customers from its mobile devices, weather data, traffic jams, social networks, payment card transaction history, marketing campaigns of shops and restaurants.
According to the results of the project:The company implemented DataLake on the Cloudera platform, which, in addition to transaction data, stores other “unstructured” information from social networks, geolocations of customers' mobile devices, weather and traffic jams. DataLake stores 7TB of information and adds up to 10 GB daily. Personalized product offerings are provided to customers.
Comment:Personal product offers are a very popular topic, especially on the Russian market of financial services, many people develop. It is not clear how the project managed to ensure the processing of data about transactions (and social networks and geolocation) in real-time mode. Data from the main transaction accounting system should instantly get into DataLake, in a case they are modestly silent about this, although it is very difficult, given their size and the requirements for card data protection. Also, the topic of “ethics of big data” is not disclosed, when a person is offered a product, he understands that he is being “watched” on the basis of this proposal and intuitively rejects the product simply because of irritation. And then maybe change the credit card at all. The conclusion is that DataLake most likely stores 95% of transaction data, 5% of social network data, etc. and based on these data models are built. Personally, I don’t believe in real-time product offers.

3. Bank Mandiri. Indonesia's largest bank
Link to caseLink to companyCompany Description:Bank Mandiri is the largest bank in Indonesia.
Project theme: # Infrastructure, # Marketing, # Business Development.
Objective of the project:Implement a competitive advantage by implementing a technology solution that generates personalized product offerings to customers based on data. Following the results of implementation, reduce the total cost of IT infrastructure.
According to the results of the project:As it is written in the case, after the implementation of the data-driven Cloudera analytical solution, IT infrastructure costs were reduced by 99%! .. Customers receive targeted product offers, which allows to increase the results of cross sell and upsell sales campaigns. Campaign costs are significantly reduced due to more targeted modeling. The scale of the big data solution is 13TB.
Comment:Case, as it were, hints that following the results of the implementation, the company completely abandoned the relational database infrastructure for modeling product offerings. Even reduced IT costs by as much as 99%.
The data sources for the technological solution still remain 27 relational databases, customer profiles, plastic card transaction data and (as expected) social network data.

4. MasterCard. International payment system
Link to caseLink to companyCompany Description:MasterCard earns not only as a payment system that brings together 22 thousand financial institutions in 210 countries around the world, but also as a data provider for evaluating financial institutions (payment system participants) credit risks of counterparties (merchants) when considering their applications for acquiring services.
Project topic: # Fraud, # Data Management.
Objective of the project:To help its customers, financial institutions to identify counterparties that were previously insolvent and who are trying to return to the payment system by changing their identity (name, address or other characteristics). MasterCard has a MATCH (MasterCard Alert to Control High-Risk Merchants) database created for this purpose. This database contains the story of “hundreds of millions” of unreliable organizations (fraudulent businesses) Participants of the MasterCard payment system (acquirer companies) make up to a million queries to the MATCH database every month.
The competitive advantage of this product is determined by the requirements for a small waiting time for the query results and the quality of detection of the subject of the query. With the growth of volumes and complexity of historical data, the existing relational DBMS has ceased to meet these requirements amid the increasing number and quality of client requests.
According to the results of the project:A Distributed Data Storage and Processing (CDH) platform was introduced to dynamically scale and control the load and complexity of search algorithms.
Comment:Case is interesting and almost in demand. The important and time-consuming component of providing access control and security infrastructure is well and well thought out. Nothing is said about the timing of the transition to a new platform. In general, a very practical case.

5. Experian. One of the three largest credit bureaus in the world
Link to caseLink to companyCompany Description:Experian is one of the three largest global companies (the so-called “Big Three”) credit bureaus. Stores and processes credit history information for ~ 1 billion borrowers (individuals and legal entities). In the USA alone, there are credit records for 235 million individuals and 25 million legal entities.
In addition to credit scoring services directly, the company sells marketing support services, online access to borrowers' credit history and fraud protection and identity theft to organizations.
The competitive advantage of marketing support services (Experian Marketing Services, EMS) is based on the company's main asset — accumulated data and physical and legal borrowers. EMS.
Project theme: # Infrastructure, # Marketing, # Business Development.
Objective of the project:EMS helps marketers gain unique access to their target audience by modeling it using accumulated geographic, demographic and social data. Correctly apply to modeling marketing campaigns accumulated (large) data on borrowers, including such real time data as "recent purchases made", "activity in social networks", etc.
The accumulation and use of such data requires a technology platform that allows you to quickly process, save and analyze these various data.
According to the results of the project:After several months of research, the choice was made on a platform development - Cross Channel Identity Resolution (CCIR) engine, based on the Hbase technology, a non-relational distributed database. Experian loads data into the CCIR engine via ETL scripts from numerous in-house mainframe servers and relational databases, such as IBM DB2, Oracle, SQL Server, Sybase IQ.
At the time of writing this case, more than 5 billion data lines were stored in Hive, with the prospect of 10-fold growth in the near future.
Comment:The case specifics are very captivating, which is very rare:
- number of cluster nodes (35),
- timeline for project implementation (<6 months),
- solution architecture is presented concisely and competently:
Hadoop components: HBase, Hive, Hue, MapReduce, Pig
Cluster servers: HP DL380 (more than kommoditi)
Data Warehouse: IBM DB2
Data Marts: Oracle, SQL Server, Sybase IQ
- load characteristics (100 million records processed per hour, productivity increase by 500%).
ps
Smiles the advice to "be patient and exercise a lot" before embarking on an industrial solution development on Hadoop / Hbase!
Very well presented case! I recommend reading it separately. Especially a lot is hidden between the lines for people in the subject!

6. Western Union. Market Leader for International Money Transfer
Link to caseLink to companyCompany Description:Western Union is the largest operator of the international money transfer market. Initially, the company provided telegraph services (the inventor of Morse code, Samuel Morse, was at the origin of its foundation).
As part of money transfer transactions, the company receives data on both senders and recipients. On average, the company makes 29 transfers per second, totaling 82 billion dollars (as of 2013).
Project topic: # Infrastructure. # Marketing, # Business Development
Objective of the project:Over the years, the company has accumulated large amounts of transactional information that it plans to use to improve the quality of its product and strengthen its competitive advantage in the market.
Implement a platform to consolidate and process structured and unstructured data from multiple sources (Cloudera Enterprise Data Hub). Unstructured data includes, including such exotic for Russia, sources (according to the author) as “click stream data” (data on client’s surfing when opening the company's website), “sentiment data” (natural language processing). and chat bots, customer survey data (surveys) on product and service quality, social network data, etc.).
According to the results of the project:The data hub is created and populated with structured and unstructured data through streaming (Apache Flume), batch loading (Apache Sqoop) and good old ETL (Informatica Big Data Edition).
The data hub is the repository of customer data, allowing you to create verified and accurate product offerings, for example in San Francisco, WU creates separate targeted product offers for representatives
- Chinese culture for customers of local Chinatown branches,
- immigrants from the Philippines living in the Daly City area
- Latinos and mexicans from the Mission District
For example, sending a proposal is linked to a favorable exchange rate in the countries of origin for these national groups against the US dollar.
Comment:Cluster features are mentioned - 64 nodes with the prospect of increasing to 100 nodes (nodes - Cisco Unified Computing System Server), data volume - 100TB.
Separate emphasis is placed on ensuring security and access control for users (Apache Sentry and Kerberos). That says about the thoughtful implementation and real practical application of the results of the work.
In general, I suppose that at present the project is not working at full capacity, the data accumulation phase is going on and there are even some attempts to develop and apply analytical models, but in general the possibilities to correctly and systematically apply unstructured data when developing models are greatly exaggerated.

7. Transamerica. Group of Life Insurance and Asset Management Companies
Link to caseLink to companyCompany Description:Transamerica is a group of insurance and investment companies. Headquartered in San Francisco.
Project theme: # Infrastructure, # Marketing, # Business Development
Objective of the project:Due to the diversity of businesses, customer data may be present in the accounting systems of various group companies, which sometimes complicates their analytical processing.
Implement a marketing analytical platform (Enterprise Marketing & Analytics Platform, EMAP), which can store both own client data of all companies of the group and data on clients from third-party suppliers (for example, all the same social networks). To form verified product offers on the basis of this EPAM platform.
According to the results of the project:As stated in the case, the following data is loaded and analyzed in EPAM:
- own customer data
- CRM customer data
- Data on past insurance payments (solicitation data)
- Customer data from third-party partners (commercial data).
- Logs from the company's Internet portal
- Social media data.
Comment:- Data is downloaded only using Informatica BDM, which is suspicious given the variability of sources and the diversity of the Architecture.
- The scale of "big data" - 30TB, which is very modest (especially given the mention of data about 210 million customers at their disposal).
- Nothing is said about the characteristics of the cluster, the timing of the project, the difficulties encountered in implementation.

8. mBank. Bank of Poland, 4th in terms of assets
Link to companyLink to caseCompany Description:mBank, was founded in 1986, today has 5 million retail and 20 thousand corporate clients in Europe.
Project topic: # Infrastructure, #.
Objective of the project:The existing IT infrastructure was not coping with ever increasing amounts of data. Due to delays in the integration and systematization of data, the date scientists were forced to work with data in the T-1 mode (ie, for yesterday).
According to the results of the project:A data warehouse was built on the infrastructure of the Cloudera platform, which is filled daily with 300GB of information from various sources.
Data sources:• Flat files of source systems (mainly OLTP systems)
• Oracle DB
• IBM MQ
The duration of data integration into the storage has decreased by 67%.
ETL Tool - Informatica
Comment:One of the rare cases of building bank vault on Hadoop technology. No unnecessary pretentious phrases "for shareholders" about a large-scale reduction of TCO on infrastructure or the seizure of new markets due to the point analysis of customer data. Pragmatic and believable case from real life.