Hello! Today we want to tell about our acquaintance with Big Data, which began in 2012, when the market had not yet covered the wave of popularity of the topic of big data.

By that time, we already had expertise in the field of building data warehouses. We considered various ways to improve the standard HD architectures, because the customer wanted to process large amounts of data in a short time and with limited budget. We understood that large amounts of data for standard storage are perfectly processed on MPP platforms, but de facto it is expensive. So, we need an inexpensive distributed system. She turned out to be Hadoop. It needs minimal initial investment, and first results can be obtained very quickly. In the longer term, horizontal, almost linear scaling, an open platform and many interesting additional functions: for example, NoSQL, fast data search, a sort of SQL data access language.
The test task was to study data enrichment on Hadoop: we measured how long standard data join-s were working. For example, the intersection of 100 GB and 10 GB by the standards of relational databases is serious volumes (it is unreasonable to use indexes with full scan). On our test servers, such tasks were completed in minutes against tens of minutes in relational storage. Taking into account the money spent on the relational storage and the cost of the mid-range array for HD (exceeds the cost of the local array by an order of magnitude on average), the choice for carrying out such calculations and data storage means was obvious.
')
To test the approach to solving the problem, we needed:
- Hadoop development competencies
- test cluster
We did a pilot project on the Hadoop stack, based on the books we read: "
Hadoop: The Definitive Guide " and "
MapReduce Design Patterns ". Our team already had Java expertise, and the transition to the MapReduce paradigm did not become a problem even for those who came from the Oracle Database. Then to start it was enough to read and learn a couple of books.
To speed up testing, we used Amazon EC2's cloud services, which allowed us to get the hardware without delay and start installing the Hadoop stack from Cloudera. In two days the stand was ready. We used 10 instances with 8 GB of RAM and 2 CPUs as hardware. 50 GB disks on each machine, taking into account the triple data replication (by default), was enough with a margin to solve the pilot problem. 10 instances were obtained experimentally, because while reducing instances, performance dropped dramatically. Now, with the development of assemblies from vendors, the cluster is put "in a couple of clicks."
However, join is not the main vocation of Hadoop. His strength is in analytic ability. Well aware of this, we got the first real case. The pilot task was to track subscribers visiting the departure area at Moscow airports and send them a relevant mobile phone offer. From the input data were only subscriber traffic and a list of towers that serve the departure zone at the airport. But this is not Big Data.
Big Data appears at the moment of the second requirement for the task: to identify and exclude from the final sample all the mourners, greeters, taxi drivers, shop workers, etc. The range of cell towers is not limited to the conditional boundaries of the departure zone, therefore all nearby subscribers can get here, including outside the airport building.
Everything is great, only Amazon cluster cannot be used for this - because we are dealing with personal data of the cellular operator. It became obvious that the implementation of Big Data is a matter of time, and the customer decided to buy the first cluster. The cluster sizing was calculated for the year ahead taking into account the Big Data development strategy and purchased 20 HP 380 G8 machines (2 CPU / 48 G RAM / 12x3 Tb disk).
Six months after the start of work with Big Data, we had a team of up to 5 employees, and by the end of 2013 we already had 14 people. We had to thoroughly understand everything that relates to the Hadoop stack. Our employees have completed certified courses from Cloudera: cluster management training, development on MapReduce, HBase. This background allowed us to quickly understand all the subtleties of Hadoop, get an idea of the best development techniques for MapReduce and get down to business. By the way, now there are a lot of good online courses (for example, on Coursera).
The implementation of the first business task implied a permanent job as a trigger: look for the necessary records with the necessary parameters of base stations from the incoming data stream. In Hadoop, subscriber profiles were considered on a daily basis: manually first, and then using machine learning. Subscriber profile data was overloaded into the Redis in-memory key / value storage. The incoming data stream was processed using Apache Storm. At this stage, the subscriber profile, cell tower and its sector of interest to us were taken into account. Further, this stream was processed through the subscribers contact policy (for example, so that the subscriber did not receive SMS more than the number of times assigned) and entered the SMS transmission queue.
For the sake of experiment, we tried to solve the problem only with MapReduce tools, but it turned out badly: high load on the cluster, long initialization of the Java-machine each time. Do not do this.
Now the customer company has its own industrial cluster, and we test technologies on virtual machines and evaluate the possibilities of using them on real tasks.
That somehow our acquaintance and tied.
Oh yeah - my name is Aleksey Bednov and I am ready to answer your questions in the comments.