Training TensorFlow models with Azure Machine Learning Service

For deep learning of neural networks (DNN) using TensorFlow, the Azure Machine Learning service provides a custom TensorFlow class with Estimator assessment tools. The TensorFlow assessment TensorFlow in the Azure SDK (not to be confused with the tf.estimator.Estimator class) makes it easy to send TensorFlow learning tasks for single-node and distributed runs in Azure computing resources.

One-node training

Learning with the TensorFlow assessment TensorFlow similar to using the Estimator , so first read the article with practical guidance and learn the concepts outlined.

To perform a TensorFlow task, you must create a TensorFlow object. You must have already created the compute_target object of the target computing resource .

 from azureml.train.dnn import TensorFlow script_params = { '--batch-size': 50, '--learning-rate': 0.01, } tf_est = TensorFlow(source_directory='./my-tf-proj', script_params=script_params, compute_target=compute_target, entry_script='train.py', conda_packages=['scikit-learn'], use_gpu=True)

Specify the following parameters in the TensorFlow constructor.

Parameter	DESCRIPTION
`source_directory`	Local directory that contains all the code needed to set the learning. This folder is copied from the local computer to the remote computing resource.
`script_params`	A dictionary that specifies command line arguments for the `entry_script` learning `entry_script` as pairs <command line argument, value>.
`compute_target`	The remote target of the calculation on which the learning script will run. In our case, this is the Azure Machine Learning Computing Environment cluster ( AmlCompute ).
`entry_script`	The file path (relative to `source_directory` ) of the learning script that will be executed on the remote computing resource. This file and the additional files on which it depends should be located in this folder.
`conda_packages`	Required for the learning script list of Python packages that need to be installed with conda. In this case, the learning script uses `sklearn` to load the data, so you need to specify this package for installation. The constructor `pip_packages` parameter can be used for all necessary pip packages.
`use_gpu`	Set this flag to `True` to use the GPU for training. The default is `False` .

Since you are working with the TensorFlow evaluator, the container used for training will by default contain the TensorFlow package and the associated dependencies required for training in the CPU and GPU.

Then send the TensorFlow task:

 run = exp.submit(tf_est)

Distributed learning

The TensorFlow assessment tool also allows you to train models in clusters of CPUs and GPUs of Azure virtual machines. TensorFlow distributed learning is done through several API calls, and the Azure Machine Learning service will, in the background, manage the infrastructure and orchestration functions required to complete these workloads.

Azure Machine Learning supports two distributed learning methods in TensorFlow.

Distributed learning based on MPI using the Horovod platform.
Native distributed learning TensorFlow using the parameter server method.

Horovod

Horovod is an open-source support ring-allreduce platform for distributed learning, developed by Uber.

To run TensorFlow distributed learning using the Horovod platform, create a TensorFlow object as follows:

 from azureml.train.dnn import TensorFlow tf_est = TensorFlow(source_directory='./my-tf-proj', script_params={}, compute_target=compute_target, entry_script='train.py', node_count=2, process_count_per_node=1, distributed_backend='mpi', use_gpu=True)

The above code shows the following new parameters in the TensorFlow constructor.

Parameter	DESCRIPTION	default value
`node_count`	The number of nodes that will be used to set the learning.	`1`
`process_count_per_node`	The number of processes (or worker roles) launched on each node.	`1`
`distributed_backend`	The server part for running distributed learning offered by the MPI assessment tool. To perform parallel or distributed learning (for example, `node_count` > 1 or `process_count_per_node` > 1, or both) using MPI (and Horovod), set `distributed_backend='mpi'` . The Azure Machine Learning service uses the MPI implementation of MPI .	`None`

In the example above, distributed learning will be performed with two working roles — one working role for each node.

Horovod and its dependencies will be installed automatically, so you can simply import them into the train.py learning train.py as follows:

 import tensorflow as tf import horovod

And finally, send the assignment TensorFlow:

 run = exp.submit(tf_est)

Parameter server

You can also run your own distributed learning TensorFlow , which uses the parameter server model. In this method, training is conducted in a cluster of parameter servers and working roles. During training, worker roles calculate gradients, and parameter servers perform statistical processing of gradients.

Create a TensorFlow object:

 from azureml.train.dnn import TensorFlow tf_est = TensorFlow(source_directory='./my-tf-proj', script_params={}, compute_target=compute_target, entry_script='train.py', node_count=2, worker_count=2, parameter_server_count=1, distributed_backend='ps', use_gpu=True)

Note the following parameters in the TensorFlow constructor in the code above.

Parameter	DESCRIPTION	default value
`worker_count`	The number of working roles.	`1`
`parameter_server_count`	The number of servers parameters.	`1`
`distributed_backend`	The server part that will be used for distributed learning. To perform distributed learning using the parameter server, set the value `distributed_backend='ps'` .	`None`

`TF_CONFIG`

You will also need the network addresses and ports of the cluster for tf.train.ClusterSpec , so the Azure Machine Learning service automatically sets the TF_CONFIG environment TF_CONFIG .

The TF_CONFIG environment TF_CONFIG is a JSON string. The following is an example of a variable for a parameter server.

 TF_CONFIG='{ "cluster": { "ps": ["host0:2222", "host1:2222"], "worker": ["host2:2222", "host3:2222", "host4:2222"], }, "task": {"type": "ps", "index": 0}, "environment": "cloud" }'

If you use the high-level tf.estimator TensorFlow API, TensorFlow will analyze this TF_CONFIG variable and form the cluster specification.

If you use a lower level API for learning, you need to independently analyze the TF_CONFIG variable and create a tf.train.ClusterSpec in the learning code. In this example, these actions are performed in the learning scenario as follows:

 import os, json import tensorflow as tf tf_config = os.environ.get('TF_CONFIG') if not tf_config or tf_config == "": raise ValueError("TF_CONFIG not found.") tf_config_json = json.loads(tf_config) cluster_spec = tf.train.ClusterSpec(cluster)

After writing the learning script and creating the TensorFlow object, send the learning task:

 run = exp.submit(tf_est)

Examples

Notebooks for distributed deep learning, see the GitHub repository, section

how-to-use-azureml / training-with-deep-learning

Learn how to run notebooks by following the directions in an article on how to learn this service with Jupyter notebooks .

Additional Information

Source: https://habr.com/ru/post/442120/

All Articles