What does Java need to do to fully support machine learning

Hello colleagues!

From the latest news on our planned innovations from the field of ML / DL:

Nishant Shakla, " Machine learning with Tensorflow " - a book in the layout, expected in stores in January
')
Delip Rao, Brian McMahan, " Processing of a natural language on PyTorch " - the contract is signed, we plan to start translation in January.

In this context, we wanted to once again return to the painful topic - a weak elaboration of the ML / DL topic in the Java language. Because of the apparent immaturity of these solutions and Java-based algorithms, we once decided to abandon the book of Gibson and Patterson on DL4J, and the article published today by Humphrey Sheil suggests that we were probably right. We offer to get acquainted with the author's thoughts on how Java could finally compete with Python in machine learning.

I recently gave a lecture on the present and future of machine and deep learning (ML / DL) in the enterprise. In the context of a large enterprise, more applied topics and questions are relevant than at a research conference — for example, how we can start using ML with the team, and how best to integrate ML with the systems we have in operation. Then began a panel discussion on Java and machine learning.

The Java language is almost absent in the machine learning segment. There are almost no ML frameworks that would be written in Java (there is DL4J , but personally I don’t know anyone who would use it, MXNet has an API on Scala, but not Java, and this framework itself is not written in Java) . Tensorflow has an incomplete Java API , however, Java has a huge share in enterprise development. Over the past 20 years, trillions of dollars have been invested in this language in almost every imaginable subject area: financial services, electronic transactions, online stores, telecommunications — a list can continue indefinitely. In machine learning, the “first among equals” is Python, not Java. Personally, I really like programming in both Python and Java, but Frank Greco has formulated an interesting question that prompted me to think:

Why does Java compete with Python in ML? Why not take Java to bring serious ML support to mind?

Is this important?

Let's justify this topic. Since 1998, the Java language has been in the big leagues, without it there have been no evolutionary and revolutionary events in the enterprise. Speech is about web technologies and mobile, comparing browser and native solutions, messaging systems, supporting i18n and l10n globalization, scaling out and supporting storage for any enterprise information imaginable — from relational databases to Elasticsearch.

This level of unconditional support provides a very healthy culture that has developed in the Java-commands: "we can", "roll up your sleeves and write the code." There is no such magic component or API that could not be supplemented or replaced with a good team of Java developers.

But this principle does not work in machine learning. Here, Java commands have two choices:

Retraining / completing education in Python.
Use a third-party API to add machine learning capabilities to an existing enterprise system.

None of these options can not be called truly harmless. The first requires a lot of time and investment in advance, plus ongoing support costs. In the second variant, we risk becoming dependent on the supplier, losing the support of the supplier, plus we have to work with third-party components (while paying the price for the network transfer), migrating to a system that can potentially have critical security requirements and have to share information with anyone not from your organization. In some situations this is unacceptable.

The most destructive in this case (in my opinion) is the potential of cultural deterioration - teams cannot change code that they don’t understand or can’t support, therefore responsibilities are eroded, and the main work has to be delegated to someone else. Teams consisting only of Java developers run the risk of missing the next big wave, which will flood enterprise computing — a machine learning wave.

Therefore, it is important and desirable that first-class support for machine learning be introduced in the language and on the Java platform. Otherwise, there is a risk that in the next 5-10 years Java will be ousted by other languages where ML is better supported.

Why is Python so dominant in ML?

To begin, let's discuss why Python has taken a leading position in the field of machine and in-depth training.

I suspect it all started with a completely innocent feature - support list cuts (list slicing). This support is extensible: any Python class that implements the __getitem__ and __setitem__ methods can be cut using this syntax. The following listing demonstrates how simple and natural this Python feature is.

 a = [1, 2, 3, 4, 5, 6, 7, 8] print(a[1:4]) # [2, 3, 4] –      print(a[1:-1]) #  [2, 3, 4, 5, 6, 7] -  0-    print(a[4:]) # [5, 6, 7, 8] –      print(a[:4]) # [1, 2, 3, 4] –      print(a[:4:2]) # [1, 3] (    )

Of course, this is not all. Python code is much more compact and concise compared to the “old” Java code. Exceptions are supported, but not checked, and developers can easily write Python scripts that are suitable as consumables — try “how it works” without drowning in the “everything is a class” worldview. With Python it’s easy to get involved.

However, in my opinion, the most important factor of the preponderance (which does not prevent me from recognizing what kind of hard work the Python community is doing to maintain the connection between Python 2.7 and Python 3) is that they managed to create a much better-designed and fast library for operations. with numbers - NumPy. Numpy is built around ndarray , an object that is an N-dimensional array. I quote the documentation: “ The main object in NumPy is a homogeneous multidimensional array. This is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers. ” All NumPy work is based on recording your data in ndarray and subsequent operations on them. NumPy supports a variety of options for indexing, broadcasting, vectorization for speed, and generally allows developers to easily create large numeric arrays and manipulate them.

The following listing in practice shows the indexing and broadcasting in ndarray - these are the key operations in ML / DL.

 import numpy as np #    a = np.array([1.0, 2.0, 3.0]) b = 2.0 c = a * b print(c) #  [ 2. 4. 6.] -  b   /        c #2-d (   2)   NumPy –     - .e.  > 2 y = np.arange(35).reshape(5,7) print(y) # array([[ 0, 1, 2, 3, 4, 5, 6], # [ 7, 8, 9, 10, 11, 12, 13], # [14, 15, 16, 17, 18, 19, 20], # [21, 22, 23, 24, 25, 26, 27], # [28, 29, 30, 31, 32, 33, 34]]) print(y[0,0]) #     –    ,  0 print(y[4,]) #    4: array([28, 29, 30, 31, 32, 33, 34]) print(y[:,2]) #    2: array([ 2, 9, 16, 23, 30])

Working with large multidimensional numeric arrays, we aim at the very heart of programming for machine learning, and especially deep learning. Deep neural networks are grids of nodes and edges modeled at the level of numbers. Operations at the time of execution when training a network or performing output on its basis require fast multiplication of matrices.

Thanks to NumPy, we managed to do much more - scipy , pandas, and many other NumPy-based libraries. Leading deep learning libraries (Google's Tensorflow, Facebook's PyTorch ) are seriously developing Python. Tensorflow has other APIs for Go, Java, and JavaScript, but they are incomplete and considered unstable. PyTorch was originally written in Lua, and experienced a real surge in popularity when in 2017 it switched from this frankly niche language to the main ecosystem of ML Python in 2017.

Disadvantages of python

Python is not an ideal language, nor is the most popular execution environment, CPython. It has global interpreter locking ( GIL ), so scaling is not easy. Moreover, Python frameworks for deep learning, such as PyTorch and Tensorflow, still pass on key methods to opaque implementations. For example, the cuDNN library from NVidia has greatly influenced the scope of the RNN / LSTM implementation in PyTorch. RNN and LSTM (recurrent neural networks and long-term short-term memory) are very important DL tools for business applications, in particular, because they specialize in the classification and prediction of successive series of variable length - for example. web navigation tracking (clickstream), analysis of text fragments, user events, etc.

For the sake of impartiality to Python, it should be noted that such opacity / limitation applies to almost any framework for ML / DL except those written in C or C ++. Why? Because in order to achieve maximum performance for basic, high-load operations, such as matrix multiplication, the developers descend “closer to the metal” as much as possible.

What does Java need to compete in this field?

I assume that the Java platform needs three major additions. If implemented, a healthy and thriving machine learning ecosystem will begin to spread here:

Add native support for indexing / slices to the core of the language so that you can compete with Python with all its simplicity and expressiveness. Perhaps building such capabilities in Java should be around an already existing ordered collection, the List <E> interface . For such support, you will also need to recognize the need for overloading - it is needed to fulfill point # 2.
Creating a tensor implementation is probably in the java.math package, but also with access to the Collections API. This set of classes and interfaces could work equivalently to ndarray and provided additional support for indexing, in particular, the three types of indexing that are available in NumPy: field access, simple cuts and advanced indexing required for programming.
Provide broadcasting - scalars and tensors of arbitrary (but compatible) dimensions.

If these three tasks could have been performed in the core of the Java language and the runtime environment, we would have opened the way to creating a “NumJava” equivalent to NumPy. The Panama project could also be useful for providing vectorized low-level access to fast tensor operations performed on a CPU, GPU, TPU and not only so that Java ML can be the fastest of its kind.

I'm not saying at all that these add-ons are trivial - no, not at all, but their potential benefits for the entire Java platform are colossal.

The following listing shows how our example of broadcasting and indexing from NumPy might look like in NumJava with the Tensor class, with support for the slice syntax at the base of the language and with regard to the current restrictions on operator overloading.

 //      Java    //  var-  Java 10   //  Java    ,      "a * b" //       ? var a = new Tensor([1.0, 2.0, 3.0]); var b = 2.0; var c = a.mult(b); /** *    , ,      Tensor  Java. */ import static java.math.Numeric.arange; //arange   ,  reshape    var y = arange(35).reshape(5,7); System.out.println(y); // tensor([[ 0, 1, 2, 3, 4, 5, 6], // [ 7, 8, 9, 10, 11, 12, 13], // [14, 15, 16, 17, 18, 19, 20], // [21, 22, 23, 24, 25, 26, 27], // [28, 29, 30, 31, 32, 33, 34]]) System.out.println(y[0,0]); //     –    ,  0 System.out.println(y[4,]); //    4-  (5-     0): tensor([28, 29, 30, 31, 32, 33, 34]) System.out.println(y[:,2]); //    2-  (3-     0): tensor([ 2, 9, 16, 23, 30])

Perspective and call to action

We all know that machine learning will turn the world of business no less than in its time - relational databases, Internet and mobile technologies. There are a lot of HYIP around him, but there are some very convincing articles and conclusions. For example, this article describes the future when the system can learn the optimal configurations of the database server, web server, and application server, in the background, using machine learning. You do not even have to independently deploy ML in your own system - definitely, one of your vendors will be able to do this.

Based on the pragmatic positions outlined in this article, you can write in Java no less frameworks for machine and in-depth training (working on the JRE) than existing frameworks for the web, long-term storage or parsing XML - just imagine! Java convolutional neural network (CNN) frameworks can be imagined for cutting-edge computer vision implementations, such as LSTM recurrent neuroon networks for serial datasets (key to business) with the latest ML capabilities, such as automatic differentiation and not only. Then these frameworks would help to implement and nurture the next generation of enterprise systems that could be seamlessly integrated with existing Java systems using all the same tools - IDE, testing frameworks, continuous integration. Most importantly, they will be written and maintained by our people. If you're a Java fan, don't you like that prospect?

Source: https://habr.com/ru/post/429596/

All Articles

What does Java need to do to fully support machine learning

More articles: