ML on Scala with a smile, for those who are not afraid of experiments

Hello! Today we will talk about the implementation of machine learning on Scala. I will begin by explaining how we have come to such a life. So, our team for a long time used all the possibilities of machine learning in Python. This is convenient, there are many useful libraries for data preparation, a good infrastructure for development, I mean Jupyter Notebook. All anything, but faced the problem of paralleling calculations in production, and decided to use Scala in the sale. Why not, we thought, there are a lot of libraries out there, even Apache Spark is written in Scala! At the same time, today we are developing models in Python, and then repeating training on Scala for further serialization and use in production. But as they say, the devil is in the details.

Immediately I want to clarify, dear reader, this article was not written with the aim of shaking the reputation of Python in machine learning issues. No, the main goal is to open the door to the world of machine learning on Scala, to make a small review of an alternative approach arising from our experience, and to tell you what difficulties we have encountered.

In practice, everything turned out to be not so joyful: there are not so many libraries implementing the classical machine learning algorithms, and those that exist are often OpenSource projects without the support of large vendors. Yes, of course, there is Spark MLib, but it is strongly attached to the Apache Hadoop ecosystem, and I really didn’t want to drag it into the microservice architecture.
')
We needed a solution that will save the world and return a peaceful sleep, and it was found!

What do you need?

When we chose a tool for machine learning, we proceeded from the following criteria:

it should be simple;
despite the simplicity, no one has canceled wide functionality;
I really wanted to be able to develop models in the web-interpreter, and not through the console or permanent assemblies and compilations;
documentation is important;
ideally, there should be support, at least responding to github issues.

What did we see?

Apache Spark MLib : did not suit us. As mentioned above, this set of libraries is strongly tied to the Apache Hadoop stack and Spark Core itself, which weighs too much to build microservices based on it.
Apache PredictionIO : interesting project, many contributors, there is documentation with examples. In fact, this is a REST server on which models are spinning. There are ready-made models, for example, text classification, the launch of which is described in the documentation. The documentation describes how to add and train your models. We did not fit, since Spark is used under the hood, and this is more from the area of a monolithic solution, and not microservice architecture.
Apache MXNet : an interesting framework for working with neural networks, there is support for Scala and Python - this is convenient, you can train a neural network in Python, and then load the saved result from Scala when creating a production solution. We use it in production solutions, there is a separate article about it here .
Smile : very similar to the scikit-learn package for Python. There are many implementations of classical machine learning algorithms, good documentation with examples, github support, a built-in visualizer (based on Swing), Jupyter Notebook can be used to develop models. This is just what you need!

Environment preparation

So, we have chosen;). I'll tell you how to run it in Jupyter Notebook on the example of the k-means clustering algorithm. The first thing we need to do is install a Scala-enabled Jupyter Notebook. This can be done via pip, or you can use the already compiled and configured Docker image. I am for a simpler, second option.

To make friends with Jupyter from Scala, I wanted to use BeakerX, which is part of the Docker image available in the official BeakerX repository. This image is recommended in the Smile documentation, and you can start it like this:

#   BeakerX docker run -p 8888:8888 beakerx/beakerx

But here the first trouble was waiting: at the time of this writing, BeakerX 1.0.0 was installed inside the beakerx / beakerx image, and version 1.4.1 is already available in the official github of the project (more precisely, the latest release is 1.3.0, but 1.4.1 is in the wizard and it works :-)).

It is clear that I want to work with the latest version, so I assembled my own image based on BeakerX 1.4.1. I will not bore you with Dockerfile content, here is a link to it.

 #         mkdir -p /tmp/my_code docker run -it \ -p 8888:8888 \ -v /tmp/my_code:/workspace/my_code \ entony/jupyter-scala:1.4.1

By the way, for those who will use my image, there will be a small bonus: in the examples directory there is an example of k-means for a random sequence with plotting (this is not quite a trivial task for Scala notebooks).

Downloading Smile to Jupyter Notebook

Great, the surroundings are prepared! Create a new Scala notebooks in a folder in our directory, then you need to download the library from the Maven library to work Smile.

 %%classpath add mvn com.github.haifengl smile-scala_2.12 1.5.2

After the code has been executed, a list of downloaded jar files will appear in its output block.

Next step: import the necessary packages for the example.

 import java.awt.image.BufferedImage import java.awt.Color import javax.imageio.ImageIO import java.io.File import smile.clustering._

Prepare data for clustering

Now we will solve the following problem: generation of an image consisting of zones of three primary colors - red, green and blue (R, G, B). One of the colors in the picture will prevail. Cluster the pixels of the image, take a cluster with the most pixels, change their color to gray and build a new image of all the pixels. Expected result: the zone of the predominant color will turn gray, the rest of the zone will not change color.

 //    640  360 val width = 640 val hight = 360 //      val testImage = new BufferedImage(width, hight, BufferedImage.TYPE_INT_RGB) //   .    . for { x <- (0 until width) y <- (0 until hight) color = if (y <= hight / 3 && (x <= width / 3 || x > width / 3 * 2)) Color.RED else if (y > hight / 3 * 2 && (x <= width / 3 || x > width / 3 * 2)) Color.GREEN else Color.BLUE } testImage.setRGB(x, y, color.getRGB) //    testImage

As a result of the execution of this code, the following image is displayed:

Next step: convert the image into a set of pixels. By pixel we mean an entity with such properties:

coordinate on the wide side (x);
coordinate on the narrow side (y);
color value;
optional class / cluster number (until clustering is empty).

It is convenient to use the case class as an entity:

 case class Pixel(x: Int, y: Int, rgbArray: Array[Double], clusterNumber: Option[Int] = None)

Here, for the color values, the rgbArray array of three values of red, green and blue is used (for example, for the red color Array(255.0, 0, 0) ).

 //      (Pixel) val pixels = for { x <- (0 until testImage.getWidth).toArray y <- (0 until testImage.getHeight) color = new Color(testImage.getRGB(x, y)) } yield Pixel(x, y, Array(color.getRed.toDouble, color.getGreen.toDouble, color.getBlue.toDouble)) //   10   pixels.take(10)

This completes the data preparation.

Clustering pixels by color

So, we have a collection of pixels of three primary colors, so we will cluster the pixels into three classes.

 //   val countColors = 3 //   val clusters = kmeans(pixels.map(_.rgbArray), k = countColors, runs = 20)

The documentation recommends setting the runs parameter in the range from 10 to 20.

KMeans this code will create an object of type KMeans . In the output block will be information about the results of clustering:

 K-Means distortion: 0.00000 Clusters of 230400 data points of dimension 3: 0 50813 (22.1%) 1 51667 (22.4%) 2 127920 (55.5%)

One of the clusters does contain more pixels than the rest. Now we need to mark our collection of pixels with classes from 0 to 2.

 //    val clusteredPixels = (pixels zip clusters.getClusterLabel()).map {case (pixel, cluster) => pixel.copy(clusterNumber = Some(cluster))} //  10   clusteredPixels.take(10)

Repaint the image

It remains the case for small - select the cluster with the largest number of pixels and repaint all the pixels included in this cluster in gray (change the value of the rgbArray array).

 //   val grayColor = Array(127.0, 127.0, 127.0) //       val blueClusterNumber = clusteredPixels.groupBy(pixel => pixel.clusterNumber) .map {case (clusterNumber, pixels) => (clusterNumber, pixels.size) } .maxBy(_._2)._1 //       val modifiedPixels = clusteredPixels.map { case p: Pixel if p.clusterNumber == blueClusterNumber => p.copy(rgbArray = grayColor) case p: Pixel => p } //  10      modifiedPixels.take(10)

There is nothing complicated here, we simply group by cluster number (this is our Option:[Int] ), count the number of elements in each group and pull out a cluster with the maximum number of elements. Further, we change the color to gray only for those pixels that belong to the found cluster.

Create a new image and save the results.

We collect from the collection of pixels a new image:

 //       val modifiedImage = new BufferedImage(width, hight, BufferedImage.TYPE_INT_RGB) //     modifiedPixels.foreach { case Pixel(x, y, rgbArray, _) => val r = rgbArray(0).toInt val g = rgbArray(1).toInt val b = rgbArray(2).toInt modifiedImage.setRGB(x, y, new Color(r, g, b).getRGB) } //    modifiedImage

That's what, in the end, we did.

Save both images.

 ImageIO.write(testImage, "png", new File("testImage.png")) ImageIO.write(modifiedImage, "png", new File("modifiedImage.png"))

Conclusion

Machine learning on Scala exists. To implement the basic algorithms, it is not necessary to drag some huge library. The above example shows that in developing you can not give up the usual means, the same Jupyter Notebook can, without much difficulty, make friends with Scala.

Of course, for a full review of all the possibilities;) one article is not enough, and this was not included in the plans. The main task - to open the door to the world of machine learning on Scala - I consider accomplished. Whether to use these tools, and, especially, to drag them in production or not, you decide!