📜 ⬆️ ⬇️

Spark: data mining up to 30x faster Hadoop

The University of California at Berkeley developed the Spark framework for distributed computing in clusters. On some tasks, it surpasses Hadoop 10-30 times, while maintaining the scalability and reliability of MapReduce.

Increasing productivity up to 30x is possible on specific tasks in which there is a constant appeal to the same data set. For example, these are interactive data mining and iterative algorithms that are widely used, for example, in machine learning systems. Actually, for these two tasks, the project was created. But Spark is superior to Hadoop not only in machine learning systems, but also in traditional data processing applications.

The main innovation in Spark is the introduction of a new Resilient distributed datasets (RDD) abstraction: this is a set of read-only objects distributed across cluster machines. They are restored in case of disk failure and can reside in memory. For example, with RDD sizes up to 39 GB, access speeds of less than 1 s are guaranteed.

To simplify programming, Spark is integrated into the Scala 2.8.1 programming language syntax, so that RDD can be easily manipulated as if it were local objects. In addition, Spark runs from under the Mesos Manager, so that it can be used in parallel with Hadoop or other frameworks.
')
Here are some examples .

Text search

val file = spark.textFile( "hdfs://..." )
val errors = file. filter ( line => line.contains("ERROR") )
// Count all the errors
errors. count ()
// Count errors mentioning MySQL
errors. filter ( line => line.contains("MySQL") ). count ()
// Fetch the MySQL errors as an array of strings
errors. filter ( line => line.contains("MySQL") ). collect ()


Here is the search for error messages in the logs. Red fragments are Scala closure procedures that are automatically transferred to the cluster, Spark operators are in blue.

Search for text in memory
Spark can cache RDD in memory to speed up work and re-access these data sets. For the previous example, we can simply add one line that will cache only error messages in memory.

errors. cache ()

After this, the processing of this type of data is greatly accelerated.

Count the number of words
This example shows several actions to create a data set with pairs (String, Int) and write it to a file.

val file = spark.textFile( "hdfs://..." )
val counts = file. flatMap ( line => line.split(" ") )
. map ( word => (word, 1) )
. reduceByKey ( _ + _ )
counts. saveAsText ( "hdfs://..." )


Logistic regression
This is a statistical model used to predict the likelihood of an event occurring by fitting data to a logistic curve. This iterative algorithm is widely used in machine learning systems, but can be used in other applications, for example, in spam recognition. This algorithm especially benefits from caching incoming data in RAM.

val points = spark.textFile(...). map (parsePoint). cache ()
var w = Vector.random(D) // current separating plane
for (i <- 1 to ITERATIONS) {
val gradient = points. map ( p =>
(1 / (1 + exp(-py*(w dot px))) - 1) * py * px

). reduce ( _ + _ )
w -= gradient
}
println( "Final separating plane: " + w)


The diagram shows a comparison of the performance of Spark and Hadoop when calculating a logistic regression model on a 30 GB data set in an 80-core cluster.



Spark is published under the free BSD license.
download page
documentation
questions: mailing list

Source: https://habr.com/ru/post/122514/


All Articles