How we increased Tensorflow Serving performance by 70%

Tensorflow has become a standard machine learning platform (ML), popular in both industry and research. Many free libraries, tools and frameworks have been created for teaching and servicing ML models. The Tensorflow Serving project helps to serve ML models in a distributed production environment.

Our Mux service uses Tensorflow Serving in several parts of the infrastructure; we have already discussed the use of Tensorflow Serving in video encoding by titles. Today we will focus on methods that improve latency by optimizing both the prediction server and the client. Model forecasts are usually “on-line” operations (on the critical path of an application request), so the main optimization goals are to handle large volumes of requests with the lowest possible latency.

What is Tensorflow Serving?

Tensorflow Serving provides a flexible server architecture for deploying and maintaining ML models. Once the model is trained and ready to use for prediction, Tensorflow Serving requires exporting it to a compatible (servable) format.

Servable is a central abstraction that wraps Tensorflow objects. For example, a model can be represented as one or more Servable objects. Thus, Servable are the basic objects that the client uses to perform calculations. The size of Servable matters: smaller models take up less space, use less memory and load faster. for download and maintenance using the Predict API, models must be in the SavedModel format.
')

Tensorflow Serving integrates the main components for creating a gRPC / HTTP server that serves several ML models (or several versions), provides monitoring components and a custom architecture.

Tensorflow Serving with Docker

Let's look at the basic metrics of performance delays in forecasting with the standard settings of Tensorflow Serving (without CPU optimization).

First, download the latest image from the TensorFlow Docker hub:

docker pull tensorflow/serving:latest

In this article, all containers run on a host with four cores, 15 GB, Ubuntu 16.04.

Export Tensorflow model in SavedModel format

When the model is trained using Tensorflow, the output can be saved as variable control points (files on disk). The output is performed directly by restoring the control points of the model or on a frozen graph (binary file) fixed format.

For Tensorflow Serving this frozen graph needs to be exported to the SavedModel format. The Tensorflow documentation has examples of exporting trained models to the SavedModel format.

Tensorflow also provides many official and research models as a starting point for experimentation, research or production.

As an example, we will use the deep residual neural network (ResNet) model to classify the ImageNet data set of 1000 classes. Download the pre - ResNet-50 v2 model, specifically the Channels_last (NHWC) variant in SavedModel : as a rule, it works better on the CPU.

Copy the RestNet model directory to the following structure:

 models/ 1/ saved_model.pb variables/ variables.data-00000-of-00001 variables.index

Tensorflow Serving expects a numerically ordered directory structure for versioning. In our case, the 1/ directory corresponds to the version 1 model, which contains the model architecture saved_model.pb with a snapshot of the model weights (variables).

Loading and processing SavedModel

The following command starts the Tensorflow Serving model server in the Docker container. To load the SavedModel, you must mount the model directory in the expected container directory.

 docker run -d -p 9000:8500 \ -v $(pwd)/models:/models/resnet -e MODEL_NAME=resnet \ -t tensorflow/serving:latest

Checking container logs shows that ModelServer is running and ready to serve output requests for the resnet model at the gRPC and HTTP endpoints:

 ... I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: resnet version: 1} I tensorflow_serving/model_servers/server.cc:286] Running gRPC ModelServer at 0.0.0.0:8500 ... I tensorflow_serving/model_servers/server.cc:302] Exporting HTTP/REST API at:localhost:8501 ...

Forecasting client

Tensorflow Serving defines a protocol API in protocol buffers (protobufs) format. The gRPC client implementations for the prediction API are packaged as a tensor flow tensorflow_serving.apis package. We will also need the Python package tensorflow for service functions.

Install the dependencies to create a simple client:

 virtualenv .env && source .env/bin/activate && \ pip install numpy grpcio opencv-python tensorflow tensorflow-serving-api

The ResNet-50 v2 model waits on input for floating point tensors in the formatted channels_last data structure (NHWC). Therefore, the input image is read using opencv-python and loaded into the numpy array (height × width × channels) as the float32 data type. The script below creates a prediction client stub and loads the JPEG data into the numpy array, converts it into tensor_proto, to make a gRPC prediction request:

 #!/usr/bin/env python from __future__ import print_function import argparse import numpy as np import time tt = time.time() import cv2 import tensorflow as tf from grpc.beta import implementations from tensorflow_serving.apis import predict_pb2 from tensorflow_serving.apis import prediction_service_pb2 parser = argparse.ArgumentParser(description='incetion grpc client flags.') parser.add_argument('--host', default='0.0.0.0', help='inception serving host') parser.add_argument('--port', default='9000', help='inception serving port') parser.add_argument('--image', default='', help='path to JPEG image file') FLAGS = parser.parse_args() def main(): # create prediction service client stub channel = implementations.insecure_channel(FLAGS.host, int(FLAGS.port)) stub = prediction_service_pb2.beta_create_PredictionService_stub(channel) # create request request = predict_pb2.PredictRequest() request.model_spec.name = 'resnet' request.model_spec.signature_name = 'serving_default' # read image into numpy array img = cv2.imread(FLAGS.image).astype(np.float32) # convert to tensor proto and make request # shape is in NHWC (num_samples x height x width x channels) format tensor = tf.contrib.util.make_tensor_proto(img, shape=[1]+list(img.shape)) request.inputs['input'].CopyFrom(tensor) resp = stub.Predict(request, 30.0) print('total time: {}s'.format(time.time() - tt)) if __name__ == '__main__': main()

Having received a JPEG at the input, a working client will produce the following result:

 python tf_serving_client.py --image=images/pupper.jpg total time: 2.56152906418s

The resulting tensor contains the forecast in the form of an integer value and the probability of features.

 outputs { key: "classes" value { dtype: DT_INT64 tensor_shape { dim { size: 1 } } int64_val: 238 } } outputs { key: "probabilities" ...

For a single request, such a delay is unacceptable. But nothing surprising: the default Tensorflow Serving binary is designed for the widest range of equipment for most uses. You probably noticed the following lines in the logs of the standard Tensorflow Serving container:

 I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

This points to the TensorFlow Serving binary, running on a CPU platform for which it has not been optimized.

Build an optimized binary

According to the Tensorflow documentation , it is recommended to compile Tensorflow from source with all the optimizations available for the CPU on the host where the binary will work. When building special flags, you can activate CPU instruction sets for a specific platform:

Instruction set	Flags
Avx	--copt = -mavx
AVX2	--copt = -mavx2
Fma	--copt = -mfma
SSE 4.1	--copt = -msse4.1
SSE 4.2	--copt = -msse4.2
All processor supported	--copt = -march = native

Clone Tensorflow Serving a specific version. In our case, this is 1.13 (the last one at the time of publication of this article):

 USER=$1 TAG=$2 TF_SERVING_VERSION_GIT_BRANCH="r1.13" git clone --branch="$TF_SERVING_VERSION_GIT_BRANCH" https://github.com/tensorflow/serving

In the dev-image Tensorflow Serving, the Basel tool is used to build. We configure it for specific sets of CPU instructions:

 TF_SERVING_BUILD_OPTIONS="--copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-msse4.1 --copt=-msse4.2"

If there is not enough memory, limit the memory consumption during the build process to the --local_resources=2048,.5,1.0 . For information on flags, see the Tensorflow Serving and Docker Help, as well as the Bazel documentation .

Create a working image based on the existing one:

 #!/bin/bash USER=$1 TAG=$2 TF_SERVING_VERSION_GIT_BRANCH="r1.13" git clone --branch="${TF_SERVING_VERSION_GIT_BRANCH}" https://github.com/tensorflow/serving TF_SERVING_BUILD_OPTIONS="--copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-msse4.1 --copt=-msse4.2" cd serving && \ docker build --pull -t $USER/tensorflow-serving-devel:$TAG \ --build-arg TF_SERVING_VERSION_GIT_BRANCH="${TF_SERVING_VERSION_GIT_BRANCH}" \ --build-arg TF_SERVING_BUILD_OPTIONS="${TF_SERVING_BUILD_OPTIONS}" \ -f tensorflow_serving/tools/docker/Dockerfile.devel . cd serving && \ docker build -t $USER/tensorflow-serving:$TAG \ --build-arg TF_SERVING_BUILD_IMAGE=$USER/tensorflow-serving-devel:$TAG \ -f tensorflow_serving/tools/docker/Dockerfile .

ModelServer is configured using TensorFlow flags to support concurrency. The following parameters configure two thread pools for parallel operation:

 intra_op_parallelism_threads

controls the maximum number of threads for parallel execution of a single operation;
used to parallelize operations that have suboperations that are independent in nature.

 inter_op_parallelism_threads

controls the maximum number of threads for concurrent independent operations;
operations on Tensorflow Graph, which are independent of each other and, therefore, can be performed in different threads.

The default for both parameters is 0 . This means that the system itself selects the corresponding number, which most often means one thread per core. However, the parameter can be manually changed for multi-core concurrency.

Then start the Serving container similarly to the previous one, this time with the Docker image gathered from the sources, and with the Tensorflow optimization flags for the specific processor:

 docker run -d -p 9000:8500 \ -v $(pwd)/models:/models/resnet -e MODEL_NAME=resnet \ -t $USER/tensorflow-serving:$TAG \ --tensorflow_intra_op_parallelism=4 \ --tensorflow_inter_op_parallelism=4

Container logs should no longer show warnings about an undefined CPU. Without changing the code on the same forecast request, the delay is reduced by about 35.8%:

 python tf_serving_client.py --image=images/pupper.jpg total time: 1.64234706879s

Speed increase in client prediction

Is it possible to accelerate? We optimized the server part for our CPU, but the delay of more than 1 second still seems too long.

It so happened that the load on the tensorflow_serving and tensorflow libraries makes a significant contribution to the delay. Each unnecessary tf.contrib.util.make_tensor_proto call also adds a split second.

You will ask: “Do we not need TensorFlow Python packages to actually make prediction requests to the Tensorflow server?” In fact, there is no real need for tensorflow_serving and tensorflow packages.

As noted earlier, the Tensorflow prediction APIs are defined as protobuffers. Therefore, two external dependencies can be replaced with the corresponding tensorflow and tensorflow_serving - and then there is no need to pull out the entire (heavy) Tensorflow library on the client.

First, get rid of the dependencies of tensorflow and tensorflow_serving and add the grpcio-tools package.

 pip uninstall tensorflow tensorflow-serving-api && \ pip install grpcio-tools==1.0.0

Clone tensorflow/tensorflow and tensorflow/serving repositories and copy the following protobuf files into the client project:

 tensorflow/serving/ tensorflow_serving/apis/model.proto tensorflow_serving/apis/predict.proto tensorflow_serving/apis/prediction_service.proto tensorflow/tensorflow/ tensorflow/core/framework/resource_handle.proto tensorflow/core/framework/tensor_shape.proto tensorflow/core/framework/tensor.proto tensorflow/core/framework/types.proto

Copy these protobuf files to the protos/ directory while maintaining the original paths:

 protos/ tensorflow_serving/ apis/ *.proto tensorflow/ core/ framework/ *.proto

For simplicity, the prediction_service.proto can be simplified to implement only Predict RPC in order not to download the nested dependencies of other RPCs specified in the service. Here is an example of a simplified prediction_service. . prediction_service. .

Create gRPC Python implementations using grpcio.tools.protoc :

 PROTOC_OUT=protos/ PROTOS=$(find . | grep "\.proto$") for p in $PROTOS; do python -m grpc.tools.protoc -I . --python_out=$PROTOC_OUT --grpc_python_out=$PROTOC_OUT $p done

Now the entire tensorflow_serving module can be removed:

 from tensorflow_serving.apis import predict_pb2 from tensorflow_serving.apis import prediction_service_pb2

... and replace with generated protobuffers from protos/tensorflow_serving/apis :

 from protos.tensorflow_serving.apis import predict_pb2 from protos.tensorflow_serving.apis import prediction_service_pb2

The Tensorflow library is imported to use the auxiliary function make_tensor_proto , which is needed to wrap the python / numpy object as a TensorProto object.

Thus, we can replace the following dependency and code snippet:

 import tensorflow as tf ... tensor = tf.contrib.util.make_tensor_proto(features) request.inputs['inputs'].CopyFrom(tensor)

importing protobuffers and building a TensorProto object:

 from protos.tensorflow.core.framework import tensor_pb2 from protos.tensorflow.core.framework import tensor_shape_pb2 from protos.tensorflow.core.framework import types_pb2 ... # ensure NHWC shape and build tensor proto tensor_shape = [1]+list(img.shape) dims = [tensor_shape_pb2.TensorShapeProto.Dim(size=dim) for dim in tensor_shape] tensor_shape = tensor_shape_pb2.TensorShapeProto(dim=dims) tensor = tensor_pb2.TensorProto( dtype=types_pb2.DT_FLOAT, tensor_shape=tensor_shape, float_val=list(img.reshape(-1))) request.inputs['inputs'].CopyFrom(tensor)

The full Python script is here . Run an updated initial client that makes a prediction request for optimized Tensorflow Serving:

 python tf_inception_grpc_client.py --image=images/pupper.jpg total time: 0.58314920859s

The following diagram shows the forecast execution time in an optimized version of Tensorflow Serving compared to the standard one, during 10 launches:

The average delay decreased by approximately 3.38 times.

Bandwidth optimization

Tensorflow Serving can be configured to handle large amounts of data. Bandwidth optimization is typically performed for “offline” batch processing, where tight delay boundaries are not a strict requirement.

Server side batch processing

As stated in the documentation , server-side batch processing is natively supported in Tensorflow Serving.

The tradeoffs between latency and throughput are determined by batch processing parameters. They allow you to achieve the maximum bandwidth that hardware accelerators are capable of.

To enable wrapping, set the --enable_batching and --batching_parameters_file . Parameters are set according to SessionBundleConfig . For systems on a CPU, set num_batch_threads to the number of available cores. For GPUs, see the appropriate options here .

After filling the whole package on the server side, the issuance requests are combined into one large request (tensor), and sent to the Tensorflow session by a combined request. In such a situation, CPU / GPU parallelism is truly involved.

Some common uses for Tensorflow batch processing include:

Using asynchronous client requests to populate server side packages
Acceleration of batch processing due to the transfer of components of the model graph to the CPU / GPU
Serial requests from several models from one server
Batch processing is highly recommended for “offline” processing of a large number of requests.

Client-side batch processing

Batch processing on the client side groups several incoming requests into one.

Since the ResNet model is waiting for input in the NHWC format (the first dimension is the number of inputs), we can combine several input images into one RPC request:

 ... batch = [] for jpeg in os.listdir(FLAGS.images_path): path = os.path.join(FLAGS.images_path, jpeg) img = cv2.imread(path).astype(np.float32) batch.append(img) ... batch_np = np.array(batch).astype(np.float32) dims = [tensor_shape_pb2.TensorShapeProto.Dim(size=dim) for dim in batch_np.shape] t_shape = tensor_shape_pb2.TensorShapeProto(dim=dims) tensor = tensor_pb2.TensorProto( dtype=types_pb2.DT_FLOAT, tensor_shape=t_shape, float_val=list(batched_np.reshape(-1))) request.inputs['inputs'].CopyFrom(tensor)

For a batch of N images, the output tensor in the response will contain the prediction results for the same number of inputs. In our case, N = 2:

 outputs { key: "classes" value { dtype: DT_INT64 tensor_shape { dim { size: 2 } } int64_val: 238 int64_val: 121 } } ...

Hardware acceleration

A few words about graphics processors.

The learning process naturally uses parallelization on the GPU, since the construction of deep neural networks requires massive computations to achieve the optimal solution.

But for outputting results, parallelization is not so obvious. Often, it is possible to speed up the output of a neural network to the GPU, but it is necessary to carefully select and test equipment, and to conduct in-depth technical and economic analysis. Hardware paralleling is more valuable for batch processing of “autonomous” outputs (massive volumes).

Before moving to a GPU, consider business requirements with a thorough analysis of costs (cash, operational, technical) for the intended benefit (reduced latency, high throughput).

Source: https://habr.com/ru/post/445928/

All Articles