MobileNet: smaller, faster, more accurate

If five years ago, the neural network was considered a “heavy” algorithm that requires hardware specifically designed for high-load computing, today, no one should be surprised by the deep networks running directly on a mobile phone.

Nowadays, networks recognize your face to unlock the phone, stylize photos for famous artists and determine if there is a hot dog in the frame .

In this article, we’ll talk about MobileNet, an advanced convolutional network architecture that allows you to do all this and much more.

The article will consist of three parts. In the first, we look at the structure of the network, as well as the tricks that the authors of the original scientific work suggested for optimizing the speed of the algorithm. In the second part, we’ll talk about the next version of MobileNetV2, an article about which researchers from Google published just a couple of months ago. At the end we will discuss the practical results achievable with this architecture.

In the previous series

In the last post, we looked at the Xception architecture, which allowed us to significantly reduce the number of parameters in a convolutional network with an Inception-like architecture by replacing conventional convolutions with the so-called depthwise separable convolutions. In short, I remind you what it is.
')
A normal convolution is a filter.

D_{k} * D_{k} * C_{i n}

$D_k * D_k * C_ {in}$ where

D_{k}

$D_k$ - this is the core size of the bundle, and

C_{i n}

$C_ {in}$ - the number of channels at the entrance. The total computational complexity of the convolutional layer is

D_{k} * D_{k} * C_{i n} * D_{f} * D_{f} * C_{o u t}

$D_k * D_k * C_ {in} * D_f * D_f * C_ {out}$ where

D_{f}

$D_f$ Is the height and width of the layer (we assume that the spatial dimensions of the input and output tensors are the same), and

C_{o u t}

$C_ {out}$ - the number of channels at the output.

The idea of depthwise separable convolution is to decompose a similar layer into a depthwise convolution, which is a per-channel filter, and a 1x1 convolution (also called pointwise convolution). The total number of operations for applying such a layer is equal to

(D_{k} * D_{k} + C_{o u t}) * C_{i n} * D_{f} * D_{f}

$(D_k * D_k + C_ {out}) * C_ {in} * D_f * D_f$ .

By and large, that's all you need to know to successfully build MobileNet.

MobileNet structure

On the left, a block of a conventional convolutional network is drawn, and on the right is the MobileNet base unit.

The convolutional part of the network of interest consists of one ordinary convolutional layer with 3x3 convolution at the beginning and thirteen blocks shown on the right in the figure, with a gradually increasing number of filters and a decreasing spatial dimension of the tensor.

A feature of this architecture is the lack of max pooling layers. Instead, to reduce the spatial dimension, a convolution with the stride parameter equal to 2 is used.

The two hyper parameters of the MobileNet architecture are

a l p h a

$\ alpha$ (width multiplier) and

r h o

$\ rho$ (depth multiplier or resolution multiplier).

The width multiplier is responsible for the number of channels in each layer. For example,

a l p h a = 1

$\ alpha = 1$ gives us the architecture described in the article, and

a l p h a = 0.25

$\ alpha = 0.25$ - architecture with a reduced four times the number of channels at the output of each block.

The resolution multiplier is responsible for the spatial dimensions of the input tensors. For example,

r h o = 0.5

$\ rho = 0.5$ means that the height and width of the feature map fed to the input to each layer will be halved.

Both parameters allow you to vary the size of the network: reducing

a l p h a

$\ alpha$ and

r h o

$\ rho$ , we reduce recognition accuracy, but at the same time we increase the speed of work and reduce memory consumption.

MobileNetV2

The emergence of MobileNet has already made a revolution in computer vision on mobile platforms, but a few days ago, Google laid out MobileNetV2 in open access - the next generation of neural networks of this family, which allows to achieve approximately the same recognition accuracy with even faster speed.

What does MobileNetV2 look like?

The main building block of this network is generally similar to the previous generation, but has a number of key features.

As in MobileNetV1, there are convolutional blocks in increments of 1 (in the image to the left) and in increments of 2 (in the illustration in the right). Blocks with a step of 2 are designed to reduce the spatial dimension of the tensor and, unlike a block with a step of 1, do not have residual connections.

The MobileNet block, called by the authors the expanding convolutional block (in the original expansion convolution block or bottleneck convolution block with expansion layer ), consists of three layers:

First comes pointwise convolution with a large number of channels, called the expansion layer .

At the entrance, this layer takes the dimension tensor $D_f * D_f * C_ {in}$ , and on the output gives tensor $D_f * D_f * (t * C_ {in})$ where $t$ - A new hyperparameter, called the level of expansion (in the original expansion factor). The authors recommend setting this hyperparameter to a value from 5 to 10, where smaller values work better for smaller networks, and larger ones for larger ones (in the article itself, all experiments accept $t = 6$ ).

This layer creates a mapping of the input tensor in a space of large dimension. The authors call this mapping "target manifold" (in the original "manifold of interest" )
Then comes depthwise convolution with ReLU6 activation. This layer, together with the previous one, in essence forms the MobileNetV1 building block already familiar to us.

At the entrance, this layer takes the dimension tensor $D_f * D_f * (t * C_ {in})$ , and on the output gives tensor $(D_f / s) * (D_f / s) * (t * C_ {in})$ where $s$ - the step of convolution (stride), because as we remember, depthwise convolution does not change the number of channels.
At the end there is a 1x1 convolution with a linear activation function, reducing the number of channels. The authors put forward the hypothesis that the “target manifold” of high dimension, obtained after the previous steps, can be “laid” into a subspace of a smaller dimension without loss of useful information, which is actually done at this step (as can be seen from the experimental results, this hypothesis is completely justifies).

At the entrance such a layer takes the dimension tensor $(D_f / s) * (D_f / s) * (t * C_ {in})$ , and on the output gives tensor $(D_f / s) * (D_f / s) * C_ {out}$ where $C_ {out}$ - the number of channels at the output of the block.

In fact, it is the third layer in this block, called the bottleneck layer , that is the main difference between the second generation MobileNet and the first.

Now that we know how MobileNet works inside, let's see how well it works.

Practical results

Let's compare several network architectures. Take, for example, Xception, which was the last post, deep and old VGG16, as well as several variations of MobileNet.

Network architecture	Number of parameters	Top-1 accuracy	Top-5 accuracy
Xception	22.91M	0.790	0.945
VGG16	138.35M	0.715	0.901
MobileNetV1 (alpha = 1, rho = 1)	4.20M	0.709	0.899
MobileNetV1 (alpha = 0.75, rho = 0.85)	2.59M	0.672	0.873
MobileNetV1 (alpha = 0.25, rho = 0.57)	0.47M	0.415	0.663
MobileNetV2 (alpha = 1.4, rho = 1)	6.06M	0.750	0.925
MobileNetV2 (alpha = 1, rho = 1)	3.47M	0.718	0.910
MobileNetV2 (alpha = 0.35, rho = 0.43)	1.66M	0.455	0.704

The biggest achievement of these experiments seems to me that now the networks capable of working on mobile devices show accuracy higher than that of the VGG16.

Also, the article about MobileNetV2 shows very interesting results on other tasks. In particular, the authors demonstrate that the SSDLite architecture for the object detection task using the MobileNetV2 in the convolution part exceeds the well-known YOLOv2 real-time detector in accuracy on MS COCO data, while showing 20 times greater speed and 10 times smaller size (in In particular, on the Google Pixel smartphone the MobileNetV2 network allows object detection with 5 FPS).

What's next?

With MobileNetV2, mobile developers have received almost unlimited tools in the field of computer vision - in addition to relatively simple models for image classification, we can now use object detection and semantic segmentation algorithms directly on a mobile device.

At the same time, using MobileNet with the help of Keras and TensorFlow is so simple that, in principle, developers can do this without even delving into the internal structure of the algorithms, as befits a well-known comic.

Source: https://habr.com/ru/post/352804/

All Articles