In the report, I will review three main, in my opinion, new Apache Spark features: continuous streaming, streaming ml and vectorized udf. In the examples we consider the difference between continuous streaming and microbatch, how much faster it is and what restrictions are associated with it. Let us examine the urgent problem of all specialists in machine learning: how to write down the model in the product and do it with the help of a new, unified interface Streaming ML. And finally, consider how the developers have overcome, it seems, the final performance pain of PySpark with the help of UDF vectorization.2. MOOC for Big Data: give everyone a cluster and check the solutions! - Oleg Ivchenko, Assistant @ MIPT / Data Wizard @ BigDataTeam, and Pavel Akhtyamov, Developer Analyst @ Vicman Development / Data Wizard @ BigDataTeam
Last year, our team (BigDataTeam), together with Yandex, launched the Big Data for Data Engineers specialization. The uniqueness of this specialization lies in the fact that students' solutions are tested on a real cluster. Launching such an infrastructure and its integration with Coursera turned out to be quite a laborious task and set us many interesting engineering tasks. We will tell about them in the report. Namely:3. Apache Spark on Kubernetes the easy way - Dmitry Lakhvich [KrivdaTheTriewe], Senior Research Engineer @ Tookitaki / Data Engineer @ Maximtelecom
1) How to build a Spark cluster with Jupyter inside a Docker container
2) how to embed in the coursera its pipeline test tasks using the interface LTI
3) how to transfer a jupyter laptop to a production cluster and test it on it
One of the innovations of Apache Spark 2.3 was experimental support for Kubernetes in the main branch. In this report, I will consider both the architecture of Kubernetes itself, its deployment, the basic configuration in the minimum configuration, and the deployment of Apache Spark applications in Kubernetes. Some subtleties of customization will be considered, as well as the question of why we need another scheduler and what benefits it brings.The event is free, and registration is required .
Source: https://habr.com/ru/post/352772/
All Articles