For deep learning of neural networks (DNN) using TensorFlow, the Azure Machine Learning service provides a custom TensorFlow
class with Estimator
assessment tools. The TensorFlow
assessment TensorFlow
in the Azure SDK (not to be confused with the tf.estimator.Estimator
class) makes it easy to send TensorFlow learning tasks for single-node and distributed runs in Azure computing resources.
Learning with the TensorFlow
assessment TensorFlow
similar to using the Estimator
, so first read the article with practical guidance and learn the concepts outlined.
To perform a TensorFlow task, you must create a TensorFlow
object. You must have already created the compute_target
object of the target computing resource .
from azureml.train.dnn import TensorFlow script_params = { '--batch-size': 50, '--learning-rate': 0.01, } tf_est = TensorFlow(source_directory='./my-tf-proj', script_params=script_params, compute_target=compute_target, entry_script='train.py', conda_packages=['scikit-learn'], use_gpu=True)
Specify the following parameters in the TensorFlow constructor.
Parameter | DESCRIPTION |
---|---|
source_directory | Local directory that contains all the code needed to set the learning. This folder is copied from the local computer to the remote computing resource. |
script_params | A dictionary that specifies command line arguments for the entry_script learning entry_script as pairs <command line argument, value>. |
compute_target | The remote target of the calculation on which the learning script will run. In our case, this is the Azure Machine Learning Computing Environment cluster ( AmlCompute ). |
entry_script | The file path (relative to source_directory ) of the learning script that will be executed on the remote computing resource. This file and the additional files on which it depends should be located in this folder. |
conda_packages | Required for the learning script list of Python packages that need to be installed with conda. In this case, the learning script uses sklearn to load the data, so you need to specify this package for installation. The constructor pip_packages parameter can be used for all necessary pip packages. |
use_gpu | Set this flag to True to use the GPU for training. The default is False . |
Since you are working with the TensorFlow evaluator, the container used for training will by default contain the TensorFlow package and the associated dependencies required for training in the CPU and GPU.
Then send the TensorFlow task:
run = exp.submit(tf_est)
The TensorFlow assessment tool also allows you to train models in clusters of CPUs and GPUs of Azure virtual machines. TensorFlow distributed learning is done through several API calls, and the Azure Machine Learning service will, in the background, manage the infrastructure and orchestration functions required to complete these workloads.
Azure Machine Learning supports two distributed learning methods in TensorFlow.
Horovod is an open-source support ring-allreduce platform for distributed learning, developed by Uber.
To run TensorFlow distributed learning using the Horovod platform, create a TensorFlow object as follows:
from azureml.train.dnn import TensorFlow tf_est = TensorFlow(source_directory='./my-tf-proj', script_params={}, compute_target=compute_target, entry_script='train.py', node_count=2, process_count_per_node=1, distributed_backend='mpi', use_gpu=True)
The above code shows the following new parameters in the TensorFlow constructor.
Parameter | DESCRIPTION | default value |
---|---|---|
node_count | The number of nodes that will be used to set the learning. | 1 |
process_count_per_node | The number of processes (or worker roles) launched on each node. | 1 |
distributed_backend | The server part for running distributed learning offered by the MPI assessment tool. To perform parallel or distributed learning (for example, node_count > 1 or process_count_per_node > 1, or both) using MPI (and Horovod), set distributed_backend='mpi' . The Azure Machine Learning service uses the MPI implementation of MPI . | None |
In the example above, distributed learning will be performed with two working roles — one working role for each node.
Horovod and its dependencies will be installed automatically, so you can simply import them into the train.py
learning train.py
as follows:
import tensorflow as tf import horovod
And finally, send the assignment TensorFlow:
run = exp.submit(tf_est)
You can also run your own distributed learning TensorFlow , which uses the parameter server model. In this method, training is conducted in a cluster of parameter servers and working roles. During training, worker roles calculate gradients, and parameter servers perform statistical processing of gradients.
Create a TensorFlow object:
from azureml.train.dnn import TensorFlow tf_est = TensorFlow(source_directory='./my-tf-proj', script_params={}, compute_target=compute_target, entry_script='train.py', node_count=2, worker_count=2, parameter_server_count=1, distributed_backend='ps', use_gpu=True)
Note the following parameters in the TensorFlow constructor in the code above.
Parameter | DESCRIPTION | default value |
---|---|---|
worker_count | The number of working roles. | 1 |
parameter_server_count | The number of servers parameters. | 1 |
distributed_backend | The server part that will be used for distributed learning. To perform distributed learning using the parameter server, set the value distributed_backend='ps' . | None |
TF_CONFIG
You will also need the network addresses and ports of the cluster for tf.train.ClusterSpec
, so the Azure Machine Learning service automatically sets the TF_CONFIG
environment TF_CONFIG
.
The TF_CONFIG
environment TF_CONFIG
is a JSON string. The following is an example of a variable for a parameter server.
TF_CONFIG='{ "cluster": { "ps": ["host0:2222", "host1:2222"], "worker": ["host2:2222", "host3:2222", "host4:2222"], }, "task": {"type": "ps", "index": 0}, "environment": "cloud" }'
If you use the high-level tf.estimator
TensorFlow API, TensorFlow will analyze this TF_CONFIG
variable and form the cluster specification.
If you use a lower level API for learning, you need to independently analyze the TF_CONFIG
variable and create a tf.train.ClusterSpec
in the learning code. In this example, these actions are performed in the learning scenario as follows:
import os, json import tensorflow as tf tf_config = os.environ.get('TF_CONFIG') if not tf_config or tf_config == "": raise ValueError("TF_CONFIG not found.") tf_config_json = json.loads(tf_config) cluster_spec = tf.train.ClusterSpec(cluster)
After writing the learning script and creating the TensorFlow object, send the learning task:
run = exp.submit(tf_est)
Notebooks for distributed deep learning, see the GitHub repository, section
Learn how to run notebooks by following the directions in an article on how to learn this service with Jupyter notebooks .
Source: https://habr.com/ru/post/442120/
All Articles