📜 ⬆️ ⬇️

How to create a business intelligence system and do not break the wood

To make business decisions correctly, it is necessary to have the most complete and detailed information on the state of affairs in the company. But often this information is limited to annual and quarterly reports.

This is certainly not enough. For effective analysis, enterprises often implement business intelligence systems (English business intelligence, hereinafter - BI-systems). Today we want to share a few tips that can help you create a BI system in your company (and that would help us a year ago).



')

Design


Keep the original data, not the cuts on them


You should never hope that there will be enough fixed schedules and reports. On the contrary, managers will ask you to build more detailed graphs. And there will be a lot of refinements.

For example, we received a request about how much money was spent by single heterosexuals from California under the age of 25 years for each month of the previous year. To be ready to answer this question, we need to have at hand not only a table with full user profiles, but also a table with their payments.

Analyze raw data, not ready slices


Try to analyze the raw data. Do not pre-aggregate. Remember: as soon as you do data aggregation, you lose information.

For example, you need to get statistics on the number of new contacts per day for New Yorkers. If you make an analysis of the data itself, you can confirm the results with specific examples: who, when and with whom.

Forget the “Not Invented Here” syndrome


Remember that you are not the first to create a BI system. And for many tasks ready solutions already exist. Therefore, much of the development can simply be reduced to collecting data and configuring analysis programs.

Today Badoo uses the Vectorwise column database and Pentaho's analytical frontend. Thus, almost everything comes down to loading data into the database.

Remember customers


The system that you design will be used by ordinary managers, in whom the words “first time derivative” can cause unbearable heartburn. The interface to the data should be extremely simple and unambiguous. That is why you should not reinvent the interface. It is better to see what has been invented before you.

Many BI tools have demo pages where you can look at a specific tool in action. It is advisable to ask future users of the BI system to evaluate how clear this or that tool is to them.

Do not delay the creation of a BI-system


Designing, developing and implementing a BI system is a rather long and complicated process. This is the case when 9 women will not be able to have a baby in one month. The implementation of the BI system in Badoo has not yet been completed, but the first significant results were achieved just about 9 months after the start. The BI team included 3 people plus 1 consultant.

Development


Collect data asynchronously


If you want to start collecting data about user behavior, then do it asynchronously. You can write to the logs, you can write in Scribe. Remember that the collection of data about the object should be carried out without any noticeable interference with the behavior of this object. And in principle, problems in the BI system should not affect the object under study.

When developing an infrastructure for collecting information about user behavior, we knew about the large amount of data being processed. And all this data was necessary to collect in one data warehouse. Of course, any problems in the work of this repository should be invisible to users of the site. Therefore, it was decided to write the primary data to the logs, and only then transfer them to the storage with a separate background script. In the future, the logs and parsers were replaced by the Scribe service.

Forget normalization


Do not be afraid to "denormalize" the data. So, if you have a table with users and a table with their payments, you may find the table with user-payment pairs (the result of joining two tables) useful. On the one hand, you get hard data duplication. On the other hand, instead of the complex “join” operation for each request, you will get a simpler operation for calculating unique values.

As an example of the effectiveness of working with the said table, take a request that will give out the amount of money spent by women and men last year:

SELECT sum(money), gender FROM UserPayment WHERE gender IN ('M','F') and year(payment_date) = year(now()) GROUP BY 2 


This query can be easily "parallelized", since all that the DBMS needs to do to execute it is to process the table in one pass.

Watch data streams


Be sure to draw for yourself a data movement scheme in the system. Make sure that there are no cycles (feedback).

Do not allow the objects under study to receive information from the BI system. For example, the manager after analyzing the data decides to send a reminder by mail to a specific group of users. Directly the list of recipients should not be formed in the BI-system.

Implementation


Check the collected data


When introducing a BI system, it is necessary to check the incoming data, and this should be done very carefully. For example, if you get the parameters of the system users, be sure to check the distribution of registration dates, birthdays, etc. Ideally, you should check the distribution of values ​​in each column or even in pairs of columns.

Often when adding new data there is a situation when the value of the column in all rows is the same. Almost always the cause is the human factor - the developer simply forgot about this column.

Superfluous data does not happen, there are repetitions


When you look at what data you need to import into the system, remember that there is no unnecessary data. There are repetitions of data. And now you need to treat the repetitions with suspicion. It is better to take additional data and make sure that you have the same values ​​than to opt out of repeats in advance. This helps to identify errors in the systems.

This is how many bugs were fixed in the process of introducing the BI system into Badoo. These were errors in user profiles, errors in city data, and even a few errors in financial data.

Do not strive for 100% compliance.


Comparing and comparing data from different sources, do not pursue 100% match. If you have reached a 95% match, this is likely already enough. You are still not designing an accounting system, when you have to follow every penny.

Very often, data discrepancies are caused by objective reasons, such as time out-of-sync. For example, the time of registration of payment in its own "billing" and in the payment system. The time difference of 1 second on December 31 may lead to the fact that the same payment will be dated by different years.

Conclusion


These tips are not universal, for each of them you can find an exception. No need to perceive them as absolute truth. On the contrary, the more counter-examples you can bring, the better you will understand the essence of these rules. And if you have specific questions, we will try to answer them.

Alexey alexxz Eremihin, developer of Badoo.

Source: https://habr.com/ru/post/146928/


All Articles