📜 ⬆️ ⬇️

ML on Scala with a smile, for those who are not afraid of experiments



Hello! Today we will talk about the implementation of machine learning on Scala. I will begin by explaining how we have come to such a life. So, our team for a long time used all the possibilities of machine learning in Python. This is convenient, there are many useful libraries for data preparation, a good infrastructure for development, I mean Jupyter Notebook. All anything, but faced the problem of paralleling calculations in production, and decided to use Scala in the sale. Why not, we thought, there are a lot of libraries out there, even Apache Spark is written in Scala! At the same time, today we are developing models in Python, and then repeating training on Scala for further serialization and use in production. But as they say, the devil is in the details.

Immediately I want to clarify, dear reader, this article was not written with the aim of shaking the reputation of Python in machine learning issues. No, the main goal is to open the door to the world of machine learning on Scala, to make a small review of an alternative approach arising from our experience, and to tell you what difficulties we have encountered.

In practice, everything turned out to be not so joyful: there are not so many libraries implementing the classical machine learning algorithms, and those that exist are often OpenSource projects without the support of large vendors. Yes, of course, there is Spark MLib, but it is strongly attached to the Apache Hadoop ecosystem, and I really didn’t want to drag it into the microservice architecture.
')
We needed a solution that will save the world and return a peaceful sleep, and it was found!

What do you need?


When we chose a tool for machine learning, we proceeded from the following criteria:


What did we see?



Environment preparation


So, we have chosen;). I'll tell you how to run it in Jupyter Notebook on the example of the k-means clustering algorithm. The first thing we need to do is install a Scala-enabled Jupyter Notebook. This can be done via pip, or you can use the already compiled and configured Docker image. I am for a simpler, second option.

To make friends with Jupyter from Scala, I wanted to use BeakerX, which is part of the Docker image available in the official BeakerX repository. This image is recommended in the Smile documentation, and you can start it like this:

#   BeakerX docker run -p 8888:8888 beakerx/beakerx 

But here the first trouble was waiting: at the time of this writing, BeakerX 1.0.0 was installed inside the beakerx / beakerx image, and version 1.4.1 is already available in the official github of the project (more precisely, the latest release is 1.3.0, but 1.4.1 is in the wizard and it works :-)).

It is clear that I want to work with the latest version, so I assembled my own image based on BeakerX 1.4.1. I will not bore you with Dockerfile content, here is a link to it.

 #         mkdir -p /tmp/my_code docker run -it \ -p 8888:8888 \ -v /tmp/my_code:/workspace/my_code \ entony/jupyter-scala:1.4.1 

By the way, for those who will use my image, there will be a small bonus: in the examples directory there is an example of k-means for a random sequence with plotting (this is not quite a trivial task for Scala notebooks).

Downloading Smile to Jupyter Notebook


Great, the surroundings are prepared! Create a new Scala notebooks in a folder in our directory, then you need to download the library from the Maven library to work Smile.

 %%classpath add mvn com.github.haifengl smile-scala_2.12 1.5.2 

After the code has been executed, a list of downloaded jar files will appear in its output block.

Next step: import the necessary packages for the example.

 import java.awt.image.BufferedImage import java.awt.Color import javax.imageio.ImageIO import java.io.File import smile.clustering._ 

Prepare data for clustering


Now we will solve the following problem: generation of an image consisting of zones of three primary colors - red, green and blue (R, G, B). One of the colors in the picture will prevail. Cluster the pixels of the image, take a cluster with the most pixels, change their color to gray and build a new image of all the pixels. Expected result: the zone of the predominant color will turn gray, the rest of the zone will not change color.

 //    640  360 val width = 640 val hight = 360 //      val testImage = new BufferedImage(width, hight, BufferedImage.TYPE_INT_RGB) //   .    . for { x <- (0 until width) y <- (0 until hight) color = if (y <= hight / 3 && (x <= width / 3 || x > width / 3 * 2)) Color.RED else if (y > hight / 3 * 2 && (x <= width / 3 || x > width / 3 * 2)) Color.GREEN else Color.BLUE } testImage.setRGB(x, y, color.getRGB) //    testImage 

As a result of the execution of this code, the following image is displayed:



Next step: convert the image into a set of pixels. By pixel we mean an entity with such properties:


It is convenient to use the case class as an entity:

 case class Pixel(x: Int, y: Int, rgbArray: Array[Double], clusterNumber: Option[Int] = None) 

Here, for the color values, the rgbArray array of three values ​​of red, green and blue is used (for example, for the red color Array(255.0, 0, 0) ).

 //      (Pixel) val pixels = for { x <- (0 until testImage.getWidth).toArray y <- (0 until testImage.getHeight) color = new Color(testImage.getRGB(x, y)) } yield Pixel(x, y, Array(color.getRed.toDouble, color.getGreen.toDouble, color.getBlue.toDouble)) //   10   pixels.take(10) 

This completes the data preparation.

Clustering pixels by color


So, we have a collection of pixels of three primary colors, so we will cluster the pixels into three classes.

 //   val countColors = 3 //   val clusters = kmeans(pixels.map(_.rgbArray), k = countColors, runs = 20) 

The documentation recommends setting the runs parameter in the range from 10 to 20.

KMeans this code will create an object of type KMeans . In the output block will be information about the results of clustering:

 K-Means distortion: 0.00000 Clusters of 230400 data points of dimension 3: 0 50813 (22.1%) 1 51667 (22.4%) 2 127920 (55.5%) 

One of the clusters does contain more pixels than the rest. Now we need to mark our collection of pixels with classes from 0 to 2.

 //    val clusteredPixels = (pixels zip clusters.getClusterLabel()).map {case (pixel, cluster) => pixel.copy(clusterNumber = Some(cluster))} //  10   clusteredPixels.take(10) 

Repaint the image


It remains the case for small - select the cluster with the largest number of pixels and repaint all the pixels included in this cluster in gray (change the value of the rgbArray array).

 //   val grayColor = Array(127.0, 127.0, 127.0) //       val blueClusterNumber = clusteredPixels.groupBy(pixel => pixel.clusterNumber) .map {case (clusterNumber, pixels) => (clusterNumber, pixels.size) } .maxBy(_._2)._1 //       val modifiedPixels = clusteredPixels.map { case p: Pixel if p.clusterNumber == blueClusterNumber => p.copy(rgbArray = grayColor) case p: Pixel => p } //  10      modifiedPixels.take(10) 

There is nothing complicated here, we simply group by cluster number (this is our Option:[Int] ), count the number of elements in each group and pull out a cluster with the maximum number of elements. Further, we change the color to gray only for those pixels that belong to the found cluster.

Create a new image and save the results.


We collect from the collection of pixels a new image:

 //       val modifiedImage = new BufferedImage(width, hight, BufferedImage.TYPE_INT_RGB) //     modifiedPixels.foreach { case Pixel(x, y, rgbArray, _) => val r = rgbArray(0).toInt val g = rgbArray(1).toInt val b = rgbArray(2).toInt modifiedImage.setRGB(x, y, new Color(r, g, b).getRGB) } //    modifiedImage 

That's what, in the end, we did.



Save both images.

 ImageIO.write(testImage, "png", new File("testImage.png")) ImageIO.write(modifiedImage, "png", new File("modifiedImage.png")) 

Conclusion


Machine learning on Scala exists. To implement the basic algorithms, it is not necessary to drag some huge library. The above example shows that in developing you can not give up the usual means, the same Jupyter Notebook can, without much difficulty, make friends with Scala.

Of course, for a full review of all the possibilities;) one article is not enough, and this was not included in the plans. The main task - to open the door to the world of machine learning on Scala - I consider accomplished. Whether to use these tools, and, especially, to drag them in production or not, you decide!

Links


Source: https://habr.com/ru/post/452914/


All Articles