In my report, I will talk about how we restarted Rambler / Top 100, tools available on the market and about our experience of moving from the batch-data architecture to real-time data. I'll tell you about the architecture of the two solutions and their components. Briefly discuss the features of data processing using Python in Hive, the fundamental problems of storing aggregates, briefly consider the advantages and disadvantages of an alternative approach. Let us analyze in detail how to handle changing events using PySpark, ways to work with various components of the system from PySpark, problems that arise and their solution. Plus, look at the results, the speed of the new system and some pitfalls.
In Spark.ML for recommendations there is an implementation of the ALS algorithm, which shows itself quite well in most real-world examples. In the report, I want to present my implementation of the iTALS algorithm on Spark, which is a generalization of the ALS matrix expansion algorithm for tensors. This algorithm allows to take into account the context in the recommendations, to make them more accurate and flexible. The report will discuss the results of the comparative experiment ALS and iTALS.
Dataset and Dataframe have become the preferred interfaces for working with Spark. Largely due to the active development of the Catalyst query optimizer. In the report, we will look at the motivation for creating Spark.SQL and understand why it is so critical to PySpark. And we will also analyze in detail how the Catalyst works from the inside and how you can extend its functionality.
With the help of dynamic allocation of resources in Spark, you can ensure that a task receives additional resources, if any, in the free pool. Thus, sometimes, you can use the full power of a cluster and perform calculations faster. In the report I will tell how the dynamic allocation of resources helped to make possible the work of 30-40 students in the conditions of the approaching deadline for laboratory work and to live in happiness.
Source: https://habr.com/ru/post/332546/