
Hi, Habrozhiteli! Previously, we translated the article
"Introduction to Apache Spark .
" Now you introduce you to the book of the same name written by Sandy Rizay, Uri Leserson, Sean Owen, Josh Wills.
In this practical book, four data analytics experts at Cloudera describe self-contained patterns for performing large-scale data analysis using Spark. The authors comprehensively review Spark, statistical methods and data sets collected in real-world conditions, and with these examples demonstrate solutions to common analytical problems.
Foreword
Since we started working on the Spark project in Berkeley, I have been striving not only to create fast parallel systems, but also to help more and more new people use large-scale calculations. That is why I am so pleased with the release of this book, written by four data science experts and dedicated to advanced analytics techniques using Spark. Sandy, Uri, Sean and Josh worked for a long time with Spark and have compiled a wonderful selection of materials, containing both theory and examples in equal parts.
Most of all in this book, I like its orientation towards examples taken from real applications running on real data sets. It's not easy to find even one example, not to mention a dozen, covering large amounts of data that you could run on your laptop. However, the authors managed to create a similar compilation and set up everything to run these examples on Spark. Moreover, the authors described in the book not only the basic algorithms, but also the complex nuances of data preparation and model tuning necessary to achieve good results in practice. You can take fragments from these examples and use them to solve your own problems.
')
Big data processing today is undoubtedly one of the most exciting areas of computing technology, still rapidly developing and full of new ideas. I hope that our book will help you get started in this exciting new field.
Matey Zechariah,
technical director of the company
Databricks and Vice President Apache SparkIntroduction
Sandy reeseI am not one of those who often regret something, but that rare moment of laziness in 2011, when I was looking for a way to best distribute the complex tasks of discrete optimization between clusters of computers, obviously did not bring anything good. My consultant told me about this new-fangled Spark, about which he had heard, and I essentially rejected this idea as too good to be true, and hurried back to writing a bachelor’s degree in MapReduce. Since then, both of us - Spark and I - have matured somewhat, but only one of us has experienced a rapid take-off, speaking of which, it is almost impossible to refrain from puns on the topic of fire1. Two years have passed, and it has become absolutely clear that Spark deserves attention.
Spark's predecessors, which make up the extensive family tree, starting with MPI and ending with MapReduce, allow you to write programs that use large resources while hiding small details of the work of distributed systems. Whatever data processing needs prompted the development of such frameworks, to some extent the scope of big data has become so connected with them that its scope is determined by what these frameworks can handle. Spark promises further evolution: to make writing distributed programs similar to writing programs as usual.
Spark perfectly raises the performance of the ETL pipelines and eliminates the headache that causes MapReduce programmers to cause daily desperate appeals to the gods of Hadoop (“Why? But for me, the most exciting thing about this has always been the provision of opportunities for systems analytics. Thanks to a paradigm that supports iterative algorithms and interactive learning mode, Spark finally became the open source framework that allowed data explorers to work effectively with large data sets.
In my opinion, it is best to teach data science with examples. To this end, my colleagues and I wrote a book, trying to raise questions about the relationship between the most common algorithms, data sets and design patterns in large-scale analytics. This book is not intended to be read from cover to cover. Scroll to the page where you describe what you are trying to do or what you are interested in.
What you will find in this book
Chapter 1 shows Spark's place in the broader context of data science and big data analytics. In the future, each chapter will contain a self-sufficient example of analysis using Spark. Chapter 2 will introduce you to the basics of data processing on Spark and Scala on the example of data cleaning. The next few chapters cover the most important topics of machine learning with Spark, including some of the most common algorithms in applications that are independent of the final implementation. The remaining chapters are more like a hodgepodge and demonstrate the use of Spark in a few more exotic applications that, for example, perform requests to Wikipedia through latent semantic links in the text or analyze genomic data.
Using source code examples
Additional materials (examples of source code, exercises, etc.) are available for download
at . This book is designed to help you do your work. In general, if sample code is attached to it, you can use it in your programs and documentation. You do not need to contact us for permission unless you copy a significant amount of code. For example, writing a program that uses several code fragments from this book does not require separate permission. Of course, permission is required for the sale or distribution of a CD with examples from publisher’s books. The answer to the question with a quote from this book, including code samples, does not require permission. Inclusion of a significant amount of code from the book examples in the documentation for your product requires permission.
»More information about the book can be found on
the publisher's website.»
Table of Contents»
ExcerptFor Habrozhiteley 25% discount coupon -
Spark .