Just add water: H2O.ai development

Hi, Habr! Over the past few years, interest in machine learning and artificial intelligence has grown rapidly. The H2O.ai solution is becoming increasingly popular in this area: it supports fast in-memory machine learning algorithms and has recently received in-depth training support. Today we talk about the development of using H2O.

Fast, scalable and reliable solutions to these categories are increasingly viewed as the necessary tools to ensure business success.
')
H2O.ai developers strive to create a fast, scalable, and open machine learning platform. This article discusses how to effectively develop and use H2O.ai-based machine learning models in Azure.

H2O.ai supports several deployment options, including on a single node, on a cluster of several nodes, and also on Hadoop or Apache Spark clusters. H2O.ai is written in Java and therefore initially supports the Java API. Since the Scala server usually runs on a Java VM virtual machine, H2O.ai also supports the Scala API. In addition, multifunctional interfaces for Python and R are available. Programmers in R and Python can take advantage of the algorithms and capabilities of H2O.ai using the h2o R and h2o Python packages. R and Python scripts that use the h2o library interact with H2O clusters through REST API calls.

Due to the growing popularity of Apache Spark, the Sparkling Water interface was developed, whose purpose is to combine the functionality of H2O and Apache Spark. Sparkling Water allows you to start the H2O service on each Spark artist in a Spark cluster and thus get an H2O cluster. The usual way to share these solutions is to convert the data to Apache Spark when teaching and evaluating using H2O.

Apache Spark natively supports Python through the PySpark interface, and the Pysparkling software package allows you to exchange data between Spark and H2O to launch Sparkling Water applications using Python. The Sparklyr package serves as the R interface for Spark, and the rsparkling tool allows you to exchange data between Spark and H2O to launch Sparkling Water applications using R.

Table 1 and Figure 1 below provide additional information on running Sparkling Water applications in Spark using R and Python.

Artifacts	Using
Jar file H2O	JAR file containing the library for running H2O services
Sparkling Water jar file	JAR file containing the library for running the Sparkling Water application on the Spark cluster
Python "h2o" package	Python interface for H2O
Python package "pyspark"	Python API for Spark
Python package "h2o_pysparkling_ {major version number of Spark}"	Python interface for Sparkling Wate r
Package R "h2o"	Interface R for H2O
Package R "sparklyr"	Apache Spark R Interface
R Package "rsparkling"	R interface for the Sparkling Water package

Table 1. Artifacts for R and Python allowing H2O.ai to run in Spark

Fig. 1. Interaction of R and Python libraries, Sparkling Water and H2O JAR files when running Sparkling Water applications on Spark platform using R and Python

Model development

A virtual machine for processing and analyzing data (DSVM) is a great tool for creating machine learning models in single-node environments. DSVM comes pre-installed with H2O.ai for Python. If you are using R (in Ubuntu), the script from our previous blog post will help set up the environment. If you are working with large data sets, it may be appropriate to use a cluster for development. The following are two recommended options for cluster-based development.

As part of the Azure HDInsight solution, there are many convenient configurations of fully managed clusters available. Azure HDInsight allows users to create Spark clusters with H2O.ai. All necessary components are initially installed on them. Python users can experiment with them using the Jupyter examples that come with the cluster.

R programmers can refer to our previous publication — it describes how to set up an environment for development using RStudio. After you create and train the model, you can save it for evaluation. H2O allows you to save a trained model as a MOJO file. Also, when the model is saved, the h2o-genmodel.jar JAR file is created. It is used to load your trained model when working with Java or Scala code. Python and R code is able to load a trained model directly using the H2O API.

If you need low-cost clusters, you can use the Azure Distributed Data Engineering Toolkit (AZTK) toolkit to run a Docker-based Spark Cluster in the Azure Batch Batch Service with low-priority virtual machines .

The cluster created using AZTK can be accessed through SSH or Jupyter notebooks during development. Compared to Jupyter Notebooks on Azure HDInsight clusters, the Jupyter notebook provides less functionality and does not contain ready-made settings for developing H2O.ai models. In addition, users need to save their work in a reliable external environment, because the AZTK Spark-cluster cannot be restored after disconnection.

Features of the use of these three environments for the development of models are presented in Table 2.

	One virtual machine	HDInsight Cluster SPARK	Azure Batch with Azure Distributed Data Engineering Toolkit
Data volume	Little	Big	Big
Cost of	Low	Depends on cluster size and virtual machine	Payment only consumed resources
Container cluster	Not	Not	Yes, under user control
Horizontal scaling	Not	Yes	Yes
Ready set of tools	Multi-functional toolkit with sample settings for performing H2O.ai in Jupyter	Multi-functional toolkit with sample settings for performing H2O.ai in Jupyter	Limited （With restrictions (by default, Spark Web user interface ports are redirected to localhost: 8080, Spark Jobs UI to localhost: 4040, and Jupyter to localhost: 8888).

Table 2. Modeling Environment

Batch evaluation and retraining of models

A batch rating is also called an autonomous rating. It is usually used for large amounts of data and can take a lot of time. Re-training allows you to restore the performance of the model, which has ceased to correctly register patterns in the new data sets. Batch evaluation and retraining of models are considered batch processing operations, and they can be implemented in a similar way.

Azure Batch Service is great for managing multiple parallel tasks, each of which can be processed by a single virtual machine. The Azure Batch Shipyard tool allows you to create and customize tasks in the Azure Batch service using Docker containers without writing any code. Apache Spark and H2O.ai can be easily added to your Docker image and used with Azure Batch Shipyard.

In Azure Batch Shipyard, each model retraining procedure or batch assessment can be configured as a task. Such tasks, consisting of several parallel tasks, are sometimes called "extremely parallel" load. They are fundamentally different from distributed computing, in which the task requires the exchange of information between tasks. More information is provided on this wiki page.

If the batch processing job requires a cluster for distributed operations (for example, in the case of large amounts of data or the economic viability of such a solution), you can create a Spark cluster based on Docker using AZTK. H2O.ai can be easily added to a Docker image, and the process of creating a cluster, sending a job and deleting a cluster can be automated and started using the Azure function application.

However, with this approach, users need to configure the cluster and manage the images of the containers. If you need a fully managed cluster with detailed monitoring capabilities, take a look at Azure HDInsight. You can now use the Spark Activity element of the Azure Data Factory to send batch jobs to a cluster. However, this requires a constantly running HDInsight cluster, so this option is more suitable for cases in which batch processing is performed frequently.

Table 3 compares the three methods for batch processing in Spark. H2O.ai can be easily integrated into any of these types of environments.

	Azure Application Function + Azure Batch Service with Azure Batch Shipyard	Azure Data Factory + Spark-cluster HDInsight	Azure Application Function + Azure Batch Service with Azure Distributed Toolkit Data Engineering Toolkit
Compute pool type	Available on request.	Provided by user	Available on request.
Spark Job Mode	Local; several nodes independently work on tasks that are part of the task	Cluster; several nodes work as a cluster, performing a separate task	Cluster; several nodes work as a cluster, performing a separate task
Data volume	Little	Big	Big
Cost of	Only the batch pool running time is paid; discount on low-priority nodes	Higher compute node costs, idle clusters are also paid	Only the batch pool running time is paid; discount on low-priority nodes
Containerized tasks	Yes	Not	Yes
How suitable for extremely parallel computing	Perfect	Not ideal	Not ideal
How suitable for distributed computing	Not ideal	Ideal for frequent batch processing	Ideal for infrequent batch processing
Delays	Around 5 minutes (time to start batch pool)	Only send job (cluster is always on)	Around 5 minutes (time to start the cluster)
Horizontal scaling	Yes; automatic scaling when increasing the number of tasks	Yes, without automatic scaling	Yes, without automatic scaling

Table 3. Task orchestration tool and calculations for batch processing

Online assessment

Online evaluation implies a short response time, so it is also called real-time evaluation. In general, an online estimate is used to make predictions for individual points or small sets. If possible, such an estimate should be based on previously calculated cached characteristics. We can load machine learning models, as well as the corresponding libraries, and run the assessment in any application. If the microservice architecture is used to share responsibility and reduce dependencies, it is recommended to implement an online assessment as a web service with a REST API.

Web services for evaluating using H2O machine learning models are usually written in Java, Scala, or Python. As we mentioned in the “Model Development” section, the H2O model is saved in the MOJO format, and the h2o-genmodel.jar file is also generated. Web services written in Java or Scala can use this JAR file to load a saved model and conduct an assessment. Web services written in Python can load the saved model directly using the Python API.

There are many ways to host web services within Azure.

Azure web application - an Azure PaaS format offering for hosting web applications. This is a fully managed platform that helps the user focus on the functionality of their applications. Recently, an Azure web application service for containers based on the Azure web application on Linux was released to host containerized web applications. Azure Container Service with Kubernetes (AKS) is a handy tool for creating and configuring a cluster of virtual machines to run containerized applications.

The Azure web application service for containers and the Azure container service provide high portability of web applications and allow you to flexibly customize their environment. The command-line interface and the Azure Machine Learning Management Model (AML) API are even simpler tools for managing web services in ACS using Kubernetes. A comparison of the three Azure services that enable online assessment systems is provided in Table 4.

	Azure web app (Linux or Windows)	Azure web application service for containers (Linux only)	Azure Container Service with Kubernetes (AKS)
Ability to modify runtime	Not	Yes, through containers	Yes, through containers
Cost of	Depends on the application services plan	Depends on the application services plan	The cost of the virtual machine node depends on user parameters
Virtual network or load balancer support	Yes	Yes	Yes
Deploy Services	Managed by users	Managed by users	Managed by users or executed automatically through the command line interface or the API for managing AML models
Service creation time	Small	Small	About 20 minutes via the command line interface or the API for managing AML models; additional resources are also created: load balancer, etc.
Intermediate Deployment	Yes, through deployment slots	Yes, through deployment slots	Yes; Kubernetes Supports Managed Intermediate Update
Multiple application support	No, but you can use multiple applications within the same application service plan.	No, but you can use multiple applications within the same application service plan.	Yes, you can run multiple applications on one cluster.
Horizontal scaling	Automatic scaling on all service plans except basic	Automatic scaling on all service plans except basic	Managed by users
Monitoring tool	App Insight	App Insight	Log analytics
Continuous integration	Yes	Managed by users	Managed by users
QPS (bandwidth)	Depends on the application services plan	Depends on the application services plan	Managed by users

Table 4. Azure Services for hosting online assessment systems

Edge evaluation

This method involves evaluating the Internet of Things (IoT) devices. With this approach, devices analyze information and make decisions based on the results immediately after collecting data, without transferring them to the processing center. Evaluation on border devices is very convenient in cases of severe restrictions related to data confidentiality, or, if necessary, to obtain an assessment as quickly as possible.

With container technology, Azure Machine Learning and Azure IoT Edge provide easy ways to deploy machine learning models on Azure IoT Edge devices. The use of AML containers maximally simplifies the use of H2O.ai on border devices. For more information about data analysis on edge devices, see our blog’s recent publication, Artificial Intelligence and Machine Learning on the Cutting Edge .

Conclusion

In this article, we discussed how to prepare and deploy solutions based on H2O.ai using Azure services, explored the development and re-training of models, batch evaluation, online evaluation and evaluation on border devices. In this publication on the development of artificial intelligence systems, we mainly considered H2O.ai. However, the findings do not apply only to this environment: they are equally applicable to all solutions on the Spark platform.

Spark adds support for all new platforms, such as TensorFlow and Microsoft Cognitive Toolkit (CNTK), so we are confident that the value of the results obtained will only grow. In order to successfully implement any project, you must select the appropriate product, taking into account the needs from the point of view of business and technology. We hope that our article will help you solve this problem.

Source: https://habr.com/ru/post/358434/

All Articles