⬆️ ⬇️

Myths and legends about Big Data



One of our clusters for pilot tasks (Data node: 18 servers / 2 CPUs, 12 Cores, 64GB RAM /, 12 Disks, 3 TB, SATA - HP DL380g)



- What is Big Data in general?

Everyone knows that this is processing huge amounts of data. But, for example, working with an Oracle database of 20 GB or 4 Petabytes is not yet Big Data, it is just a highload database.



- So what is the key difference between Big Data and “ordinary” highload systems?

In the ability to build flexible queries. The relational database, by virtue of its architecture, is designed for short, quick queries that go in a single-type flow. If you suddenly decide to go beyond such requests and put together a new complex, you will have to rewrite the database - or it will die under load.

')

- Where does this new load come from?

If you delve a little into architecture, you can see that traditional databases store information very dispersively. For example, we have a subscriber number on one server in one table, and its balance in another table. Performance requires maximum data partitioning. As soon as we start to do complex join'y, performance drops sharply.



- Is there an example of such a task?

Here is an example: it became interesting to us when people switch to modern smart phones with “bricks” calls. It turned out that there is a fairly clear threshold: for a resident of Moscow, for example, a combination of the old phone of 2007 and the use of a number of services, coupled with a large consumption of traffic, could mean a desire to switch to a smartphone. Accordingly, having seen such a border, we looked at exactly which models people go to. And they went further - they decided to find those who are ready to buy such a phone right today. And then - the question of technology. We have about 250 sales offices in Moscow. We single out a group of those who during the week decide to switch to a smartphone according to our data. We monitor when one of them comes up to 50 meters to our salon. And then we send him an SMS with a recommendation about “go see”, if there is a smartphone in this salon and if it is ready for demonstration. Such a system simply blows up a traditional system.



- And how is this solved?

Need a different database architecture. If you need flexible queries, the easiest way to store data is unstructured - because for each new query you will have to build a new optimal structure differently. Regular databases are aimed at maximum performance within limited computing resources.



- So let's just scale them - and the problem will be solved?

In general, if there is where to scale - yes, this is the way out. Agree, it is easier to purchase a couple of servers or storage systems than to rewrite the entire database structure. However, in the case of big data it is not very simple. Relational databases are difficult to scale, usually to a bottleneck in a single storage system. There is a horizontal scaling threshold, after which it becomes easier to write a new structure than to introduce very complex hardware complexes.



- So what is the result?

As a result, you have a certain set of raw data that is perfectly amenable to analysis. It runs on robots on Java-code. They perfectly parallel, because there are no specific requirements for the architecture of the hardware. If you need to add computing power - we just give a little more resources of the virtual environment, or we take and stick the hardware. On a combat relational database, do not do this.



“But this is monstrously slow, isn't it?”

Not always. There are two situations when it makes sense to compare:

  1. Short queries with a small number of join'ov. Here, the database architecture often gives a gain in time, but we pay for it by a sharp increase in load on queries that are not optimal for the selected architecture.
  2. Difficult requests of single type. Sometimes it is easier to make a request for a relational database in 10 minutes and wait two hours for a solution than write a Java robot for Big Data for 4 hours and wait for a response for 5 minutes. Depends on the tasks.


Therefore, as a rule, Big Data is used for tasks that do not repeat very often and are quite extraordinary. If we suddenly decide to make some of the Big Data tasks permanent, and it will be possible to solve it with a relational base, then we will simply “hard-code” it into a regular base with a fixed architecture.



- What are the famous examples of using Big Data?

Sooner or later, almost everyone who works with big data comes to a similar approach: structured and well-broken databases with competent architecture for quick queries, unstructured raw data for complex one-time. On Habré there was a mention about the Twitter approach to architecture - something similar is used there.



- Why then everyone at the conferences is talking about Big Data?

Because the trend is fashionable, journalists love the word. Even if you process 50 thousand records of an online store and call it Big Data, it will be pathetic enough to write in a press release.



- It turns out that one of the goals of Big Data is the opportunity to get away from long project cycles?

Yes, and from the mass of crutches to traditional databases. Java-methodology has long been known, but in analytics it began to be used relatively recently. When solving some analytical problems with traditional methods, it is impossible to say in advance what will work wrong. Sometimes it takes several months to analyze some unpleasant situation with the use of a relational database. The Big Data approach is completely different, since data is collected in real time, stored without processing, and then processed as required based on current tasks, which can change constantly.



- There are examples of already solved problems, where was this visible?

Yes. Partners in Europe looked for devices in the LTE coverage area, but without LTE actually connected. For example, the owner of the iPhone with the old SIM, the owner of one of the Chinese phones without knowing that there is such a thing as LTE. The program took 6 months from partners. We have this project was also implemented as part of the pilot on our platform for 3 weeks. In our implementation, we managed to find quite a few people with LTE-enabled devices who do not use this functionality, perhaps they don’t even know about it. Or another example. VimpelCom has a unit that analyzes devices used by subscribers. This is done for reasons of optimizing the provision of services. For example, if the number of devices of one model begins to be more than 500, we begin to support such devices, in particular, we request information from the manufacturer in order to be able to answer potential customer questions. Our toolkit allows you to quickly build such one-time reports on the use of devices in our network.



- What is the structure of the platform?

  1. Data sources. We put all the data in one place, without even thinking about whether they will be needed or not - just collect the raw array. For example, these are events (calls, breaks, etc.), geolocation, CRM data, billing data, account replenishment data, and so on.
  2. There is a factory of ideas - these are commercial ideas that can be realized if time is allocated to them for developers and computing power.
  3. There is a zone of pilot projects - these are demo versions that are implemented on a small sample of subscribers. And there is a productive (about him below).
  4. Platform. The system of working with Big Data in VimpelCom has already integrated a significant number of so-called “nodes”, machines that collect and analyze data almost in real time. We will continue to expand this network. For example, Verizon already has about five hundred sites involved in a similar system. Our platform is divided into a cluster for pilots (sandbox) and a cluster for producers. In principle, they are almost no different, except for power and support. Cluster for productive more powerful, there is no longer a place for experimentation, it is a “mill” for concentration and data processing. And in the sandbox, new ideas are worked out, and all data sources are connected to it, as well as to the cluster for productives.




- There are examples of prognostic tasks of this kind?

Yes, we are very interested to know when a person travels to another country. Not to inform him on the spot that “Welcome to Kazakhastan”, for example, and even before the start of the trip, offer tariff options and inform about prices on the spot. One of the projects is the analysis of the flow of customers at airports. It is necessary to cut off those who return home (this is simple), cut off airport personnel and taxi drivers (according to previous visits) and identify those who are going to get on the plane in an hour or two. So while they are waiting for a plane, they receive an SMS with a suggestion: sometimes common (where and how to view information), in some cases - for a specific region or country (if the terminal only works in that direction).



- Who does this in Beeline?

Here I am engaged in this and my colleague Victor Bulgakov. I started in the Netherlands in telecom, where I was a real-time billing architect and MNP. Then HP, KPN, was responsible for implementing business intelligence and Data Mining in Adidas - and now Beeline. And Victor worked at Inkombank with ERP. Then, since 1999, we have been engaged in business analytics (in fact, I did it all from scratch) and now Big Data is close.



I think you might want to learn about how we work with data at a low level, or hear about iron. If the topic is interesting - write.

Source: https://habr.com/ru/post/218669/



All Articles