📜 ⬆️ ⬇️

Just add water: H2O.ai development

Hi, Habr! Over the past few years, interest in machine learning and artificial intelligence has grown rapidly. The H2O.ai solution is becoming increasingly popular in this area: it supports fast in-memory machine learning algorithms and has recently received in-depth training support. Today we talk about the development of using H2O.



Fast, scalable and reliable solutions to these categories are increasingly viewed as the necessary tools to ensure business success.
')
H2O.ai developers strive to create a fast, scalable, and open machine learning platform. This article discusses how to effectively develop and use H2O.ai-based machine learning models in Azure.

H2O.ai supports several deployment options, including on a single node, on a cluster of several nodes, and also on Hadoop or Apache Spark clusters. H2O.ai is written in Java and therefore initially supports the Java API. Since the Scala server usually runs on a Java VM virtual machine, H2O.ai also supports the Scala API. In addition, multifunctional interfaces for Python and R are available. Programmers in R and Python can take advantage of the algorithms and capabilities of H2O.ai using the h2o R and h2o Python packages. R and Python scripts that use the h2o library interact with H2O clusters through REST API calls.

Due to the growing popularity of Apache Spark, the Sparkling Water interface was developed, whose purpose is to combine the functionality of H2O and Apache Spark. Sparkling Water allows you to start the H2O service on each Spark artist in a Spark cluster and thus get an H2O cluster. The usual way to share these solutions is to convert the data to Apache Spark when teaching and evaluating using H2O.

Apache Spark natively supports Python through the PySpark interface, and the Pysparkling software package allows you to exchange data between Spark and H2O to launch Sparkling Water applications using Python. The Sparklyr package serves as the R interface for Spark, and the rsparkling tool allows you to exchange data between Spark and H2O to launch Sparkling Water applications using R.

Table 1 and Figure 1 below provide additional information on running Sparkling Water applications in Spark using R and Python.
ArtifactsUsing
Jar file H2OJAR file containing the library for running H2O services
Sparkling Water jar fileJAR file containing the library for running the Sparkling Water application on the Spark cluster
Python "h2o" packagePython interface for H2O
Python package "pyspark"Python API for Spark
Python package "h2o_pysparkling_ {major version number of Spark}"Python interface for Sparkling Wate r
Package R "h2o"Interface R for H2O
Package R "sparklyr"Apache Spark R Interface
R Package "rsparkling"R interface for the Sparkling Water package
Table 1. Artifacts for R and Python allowing H2O.ai to run in Spark



Fig. 1. Interaction of R and Python libraries, Sparkling Water and H2O JAR files when running Sparkling Water applications on Spark platform using R and Python

Model development


A virtual machine for processing and analyzing data (DSVM) is a great tool for creating machine learning models in single-node environments. DSVM comes pre-installed with H2O.ai for Python. If you are using R (in Ubuntu), the script from our previous blog post will help set up the environment. If you are working with large data sets, it may be appropriate to use a cluster for development. The following are two recommended options for cluster-based development.

As part of the Azure HDInsight solution, there are many convenient configurations of fully managed clusters available. Azure HDInsight allows users to create Spark clusters with H2O.ai. All necessary components are initially installed on them. Python users can experiment with them using the Jupyter examples that come with the cluster.

R programmers can refer to our previous publication — it describes how to set up an environment for development using RStudio. After you create and train the model, you can save it for evaluation. H2O allows you to save a trained model as a MOJO file. Also, when the model is saved, the h2o-genmodel.jar JAR file is created. It is used to load your trained model when working with Java or Scala code. Python and R code is able to load a trained model directly using the H2O API.

If you need low-cost clusters, you can use the Azure Distributed Data Engineering Toolkit (AZTK) toolkit to run a Docker-based Spark Cluster in the Azure Batch Batch Service with low-priority virtual machines .

The cluster created using AZTK can be accessed through SSH or Jupyter notebooks during development. Compared to Jupyter Notebooks on Azure HDInsight clusters, the Jupyter notebook provides less functionality and does not contain ready-made settings for developing H2O.ai models. In addition, users need to save their work in a reliable external environment, because the AZTK Spark-cluster cannot be restored after disconnection.

Features of the use of these three environments for the development of models are presented in Table 2.
One virtual machineHDInsight Cluster SPARKAzure Batch with Azure Distributed Data Engineering Toolkit
Data volumeLittleBigBig
Cost ofLowDepends on cluster size and virtual machinePayment only consumed resources
Container clusterNotNotYes, under user control
Horizontal scalingNotYesYes
Ready set of toolsMulti-functional toolkit with sample settings for performing H2O.ai in JupyterMulti-functional toolkit with sample settings for performing H2O.ai in JupyterLimited (With restrictions (by default, Spark Web user interface ports are redirected to localhost: 8080, Spark Jobs UI to localhost: 4040, and Jupyter to localhost: 8888).
Table 2. Modeling Environment

Batch evaluation and retraining of models


A batch rating is also called an autonomous rating. It is usually used for large amounts of data and can take a lot of time. Re-training allows you to restore the performance of the model, which has ceased to correctly register patterns in the new data sets. Batch evaluation and retraining of models are considered batch processing operations, and they can be implemented in a similar way.

Azure Batch Service is great for managing multiple parallel tasks, each of which can be processed by a single virtual machine. The Azure Batch Shipyard tool allows you to create and customize tasks in the Azure Batch service using Docker containers without writing any code. Apache Spark and H2O.ai can be easily added to your Docker image and used with Azure Batch Shipyard.

In Azure Batch Shipyard, each model retraining procedure or batch assessment can be configured as a task. Such tasks, consisting of several parallel tasks, are sometimes called "extremely parallel" load. They are fundamentally different from distributed computing, in which the task requires the exchange of information between tasks. More information is provided on this wiki page.

If the batch processing job requires a cluster for distributed operations (for example, in the case of large amounts of data or the economic viability of such a solution), you can create a Spark cluster based on Docker using AZTK. H2O.ai can be easily added to a Docker image, and the process of creating a cluster, sending a job and deleting a cluster can be automated and started using the Azure function application.

However, with this approach, users need to configure the cluster and manage the images of the containers. If you need a fully managed cluster with detailed monitoring capabilities, take a look at Azure HDInsight. You can now use the Spark Activity element of the Azure Data Factory to send batch jobs to a cluster. However, this requires a constantly running HDInsight cluster, so this option is more suitable for cases in which batch processing is performed frequently.

Table 3 compares the three methods for batch processing in Spark. H2O.ai can be easily integrated into any of these types of environments.

Azure Application Function + Azure Batch Service with Azure Batch Shipyard


Azure Data Factory + Spark-cluster HDInsight


Azure Application Function + Azure Batch Service with Azure Distributed Toolkit
Data Engineering Toolkit


Compute pool type


Available on request.


Provided by user


Available on request.


Spark Job Mode


Local; several nodes independently work on tasks that are part of the task


Cluster; several nodes work as a cluster, performing a separate task


Cluster; several nodes work as a cluster, performing a separate task


Data volume


Little


Big


Big


Cost of


Only the batch pool running time is paid; discount on low-priority nodes


Higher compute node costs, idle clusters are also paid


Only the batch pool running time is paid; discount on low-priority nodes


Containerized tasks


Yes


Not


Yes


How suitable for extremely parallel computing


Perfect


Not ideal


Not ideal


How suitable for distributed computing


Not ideal


Ideal for frequent batch processing


Ideal for infrequent batch processing


Delays


Around 5
minutes (time to start batch pool)


Only send job (cluster is always on)


Around 5
minutes (time to start the cluster)


Horizontal scaling


Yes; automatic scaling when increasing the number of tasks


Yes, without automatic scaling


Yes, without automatic scaling


Table 3. Task orchestration tool and calculations for batch processing

Online assessment


Online evaluation implies a short response time, so it is also called real-time evaluation. In general, an online estimate is used to make predictions for individual points or small sets. If possible, such an estimate should be based on previously calculated cached characteristics. We can load machine learning models, as well as the corresponding libraries, and run the assessment in any application. If the microservice architecture is used to share responsibility and reduce dependencies, it is recommended to implement an online assessment as a web service with a REST API.

Web services for evaluating using H2O machine learning models are usually written in Java, Scala, or Python. As we mentioned in the “Model Development” section, the H2O model is saved in the MOJO format, and the h2o-genmodel.jar file is also generated. Web services written in Java or Scala can use this JAR file to load a saved model and conduct an assessment. Web services written in Python can load the saved model directly using the Python API.

There are many ways to host web services within Azure.

Azure web application - an Azure PaaS format offering for hosting web applications. This is a fully managed platform that helps the user focus on the functionality of their applications. Recently, an Azure web application service for containers based on the Azure web application on Linux was released to host containerized web applications. Azure Container Service with Kubernetes (AKS) is a handy tool for creating and configuring a cluster of virtual machines to run containerized applications.

The Azure web application service for containers and the Azure container service provide high portability of web applications and allow you to flexibly customize their environment. The command-line interface and the Azure Machine Learning Management Model (AML) API are even simpler tools for managing web services in ACS using Kubernetes. A comparison of the three Azure services that enable online assessment systems is provided in Table 4.

Azure web app
(Linux or Windows)


Azure web application service for containers
(Linux only)


Azure Container Service with Kubernetes (AKS)


Ability to modify runtime


Not


Yes, through containers


Yes, through containers


Cost of


Depends on the application services plan


Depends on the application services plan


The cost of the virtual machine node depends on user parameters


Virtual network or load balancer support


Yes


Yes


Yes


Deploy Services


Managed by users


Managed by users


Managed by users or executed automatically through the command line interface or the API for managing AML models


Service creation time


Small


Small


About 20 minutes via the command line interface or the API for managing AML models; additional resources are also created: load balancer, etc.


Intermediate Deployment


Yes, through deployment slots


Yes, through deployment slots


Yes; Kubernetes Supports Managed Intermediate Update


Multiple application support


No, but you can use multiple applications within the same application service plan.


No, but you can use multiple applications within the same application service plan.


Yes, you can run multiple applications on one cluster.


Horizontal scaling


Automatic scaling on all service plans except basic


Automatic scaling on all service plans except basic


Managed by users


Monitoring tool


App Insight


App Insight


Log analytics


Continuous integration


Yes


Managed by users


Managed by users


QPS (bandwidth)


Depends on the application services plan


Depends on the application services plan


Managed by users


Table 4. Azure Services for hosting online assessment systems

Edge evaluation


This method involves evaluating the Internet of Things (IoT) devices. With this approach, devices analyze information and make decisions based on the results immediately after collecting data, without transferring them to the processing center. Evaluation on border devices is very convenient in cases of severe restrictions related to data confidentiality, or, if necessary, to obtain an assessment as quickly as possible.

With container technology, Azure Machine Learning and Azure IoT Edge provide easy ways to deploy machine learning models on Azure IoT Edge devices. The use of AML containers maximally simplifies the use of H2O.ai on border devices. For more information about data analysis on edge devices, see our blog’s recent publication, Artificial Intelligence and Machine Learning on the Cutting Edge .

Conclusion


In this article, we discussed how to prepare and deploy solutions based on H2O.ai using Azure services, explored the development and re-training of models, batch evaluation, online evaluation and evaluation on border devices. In this publication on the development of artificial intelligence systems, we mainly considered H2O.ai. However, the findings do not apply only to this environment: they are equally applicable to all solutions on the Spark platform.

Spark adds support for all new platforms, such as TensorFlow and Microsoft Cognitive Toolkit (CNTK), so we are confident that the value of the results obtained will only grow. In order to successfully implement any project, you must select the appropriate product, taking into account the needs from the point of view of business and technology. We hope that our article will help you solve this problem.

Source: https://habr.com/ru/post/358434/


All Articles