Deep learning and Caffe on New Year's holidays

Motivation

In this article you will learn how to apply deep learning in practice. The Caffe framework will be used on the SVHN dataset .

Deep Learning. This buzz word has long been ringing in the ears, but it could not be tried in practice. Turned up a chance to fix it! On New Year's holidays, a kaggle contest on house number recognition was appointed as part of an image analysis course.

A part of the well-known SVHN sample was given , consisting of 73257 images in the training and 26032 in the test (unlabeled) samples. Only 10 classes for each digit. The image is 32x32 in RGB color space. As the benchmark shows, methods based on deep learning show accuracy higher than that of a person - 1.92% vs. 2% error!

I had experience with SVM and Naive Bayes based machine learning algorithms. It is boring to use already known methods, so I decided to use something from deep learning, namely a convolutional neural network .
')

Caffe selection

There are many different libraries and frameworks for working with deep neural networks. My criteria were:

tutorials
ease of learning,
ease of deployment,
active community.

Caffe suited them perfectly:

Good tutorials are on their website . Separately, I recommend lectures from the Caffe Summer Bootcamp . For a quick start, you can read about the foundations of neural networks and then about Caffe .
To start working with Caffe, you don’t even need a programming language. Caffe is configured using configuration files, and is launched from the command line.
For deployment, there are chef-kukbook and docker-images .
On github is under active development, and in the Google group, you can ask a question about using the framework.

In addition, Caffe is very fast, because uses a GPU (although you can do without the CPU).

Installation

Initially, I installed Caffe on my laptop using docker and ran it in CPU mode. Neural network training was very slow, but there was nothing to compare with and it seemed to be normal.

Then I came across an $ 25 Amazon coupon and decided to try on AWS g2.2xlarge with NVIDIA GPU and CUDA support. There unfolded the Caffe with the help of Chef . As a result, it turned out 41 times faster - on the CPU 100 iterations took place in 290 seconds, on the GPU with CUDA in 7 seconds!

Neural network architecture

If it was necessary to form a good feature vector in machine learning algorithms in order to obtain an acceptable quality, then this is not necessary in convolutional neural networks. The main thing is to come up with a good network architecture.

We introduce the following notation:

input - the input layer, usually the image pixels,
conv - convolution layer [ 1 ],
pool - subsample layer [ 2 ],
fully-conn - fully connected layer [ 3 ],
output - output layer, gives the estimated class of the image.

For the problem of image classification, the following NA architecture is the main one:

input -> conv -> pool -> conv -> pool -> fully-conn -> fully-conn -> output

The number (conv -> pool) of layers may be different, but usually not less than 2x. The number of fully-conn is not less than 1.

In the context of this contest, several architectures were tried. I received the greatest accuracy with the following:

  input -> conv -> pool -> conv -> pool -> conv -> pool -> fully-conn -> fully-conn -> output

Caffe architecture implementation

Caffe is configured using Protobuf files. The implementation architecture for the contest is here . Consider the key points of the configuration of each layer.

Input layer (input)

Input Layer Configuration

 name: "WinnyNet-F" layers { name: "svhn-rgb" type: IMAGE_DATA top: "data" top: "label" image_data_param { source: "/home/deploy/opt/SVHN/train-rgb-b.txt" batch_size: 128 shuffle: true } transform_param { mean_file: "/home/deploy/opt/SVHN/svhn/winny_net5/mean.binaryproto" } include: { phase: TRAIN } } layers { name: "svhn-rgb" type: IMAGE_DATA top: "data" top: "label" image_data_param { source: "/home/deploy/opt/SVHN/test-rgb-b.txt" batch_size: 120 } transform_param { mean_file: "/home/deploy/opt/SVHN/svhn/winny_net5/mean.binaryproto" } include: { phase: TEST } } ...

The first 2 layers (for the training and test phases) are of type: IMAGE_DATA, i.e. network at the entrance takes images. The images are listed in a text file , where 1 column is the path to the image, 2 column is the class. The path to the text file is specified in the image_data_param attribute.

In addition to images, you can feed data from HDF5 , LevelDB and lmbd to the input. The last 2 options are especially relevant if the speed of work is critical. Thus, Caffe can work with any data, not just images. The easiest way to work with IMAGE_DATA, so he was chosen for the contest.

Also, input layers can include the transform_param attribute. It specifies the transformations to which the input data should be subjected. Usually, before submitting images to a neural network, they are normalized or more cunning operations are performed, for example, Local Contrast Normalization . In this case, mean_file was specified — subtracting the “average” image from the input.

Caffe uses batch gradient descent . The input layer contains the batch_size parameter. In one iteration, the batch_size of the sampling elements arrives at the input of the neural network.

Convolution and subsample layers (conv, pool)

Configuration of layers of convolution and subsample

  ... layers { bottom: "data" top: "conv1/5x5_s1" name: "conv1/5x5_s1" type: CONVOLUTION blobs_lr: 1 blobs_lr: 2 convolution_param { num_output: 64 kernel_size: 5 stride: 1 pad: 2 weight_filler { type: "xavier" std: 0.0001 } } } layers { bottom: "conv1/5x5_s1" top: "conv1/5x5_s1" name: "conv1/relu_5x5" type: RELU } layers { bottom: "conv1/5x5_s1" top: "pool1/3x3_s2" name: "pool1/3x3_s2" type: POOLING pooling_param { pool: MAX kernel_size: 3 stride: 2 } } layers { bottom: "pool1/3x3_s2" top: "conv2/5x5_s1" name: "conv2/5x5_s1" type: CONVOLUTION blobs_lr: 1 blobs_lr: 2 convolution_param { num_output: 64 kernel_size: 5 stride: 1 pad: 2 weight_filler { type: "xavier" std: 0.01 } } } layers { bottom: "conv2/5x5_s1" top: "conv2/5x5_s1" name: "conv2/relu_5x5" type: RELU } layers { bottom: "conv2/5x5_s1" top: "pool2/3x3_s2" name: "pool2/3x3_s2" type: POOLING pooling_param { pool: MAX kernel_size: 3 stride: 2 } } layers { bottom: "pool2/3x3_s2" top: "conv3/5x5_s1" name: "conv3/5x5_s1" type: CONVOLUTION blobs_lr: 1 blobs_lr: 2 convolution_param { num_output: 128 kernel_size: 5 stride: 1 pad: 2 weight_filler { type: "xavier" std: 0.01 } } } layers { bottom: "conv3/5x5_s1" top: "conv3/5x5_s1" name: "conv3/relu_5x5" type: RELU } layers { bottom: "conv3/5x5_s1" top: "pool3/3x3_s2" name: "pool3/3x3_s2" type: POOLING pooling_param { pool: MAX kernel_size: 3 stride: 2 } } ...

3m is a convolution layer with type: CONVOLUTION. Next comes the indication of the activation function c type: RELU. The 4th layer is a subsample layer with type: POOL. Then 2 times the conv, pool layers are repeated, but with different parameters.

The selection of parameters for these layers is empirical.

Fully connected and output layers (fully-conn, output)

Configuration of fully connected and output layers

  ... layers { bottom: "pool3/3x3_s2" top: "ip1/3072" name: "ip1/3072" type: INNER_PRODUCT blobs_lr: 1 blobs_lr: 2 inner_product_param { num_output: 3072 weight_filler { type: "gaussian" std: 0.001 } bias_filler { type: "constant" } } } layers { bottom: "ip1/3072" top: "ip1/3072" name: "ip1/relu_5x5" type: RELU } layers { bottom: "ip1/3072" top: "ip2/2048" name: "ip2/2048" type: INNER_PRODUCT blobs_lr: 1 blobs_lr: 2 inner_product_param { num_output: 2048 weight_filler { type: "xavier" std: 0.001 } bias_filler { type: "constant" } } } layers { bottom: "ip2/2048" top: "ip2/2048" name: "ip2/relu_5x5" type: RELU } layers { bottom: "ip2/2048" top: "ip3/10" name: "ip3/10" type: INNER_PRODUCT blobs_lr: 1 blobs_lr: 2 inner_product_param { num_output: 10 weight_filler { type: "xavier" std: 0.1 } } } layers { name: "accuracy" type: ACCURACY bottom: "ip3/10" bottom: "label" top: "accuracy" include: { phase: TEST } } layers { name: "loss" type: SOFTMAX_LOSS bottom: "ip3/10" bottom: "label" top: "loss" }

The full layer has type: INNER_PRODUCT. The output layer is connected to the layer with the loss function (type: SOFTMAX_LOSS) and the accuracy layer (type: ACCURACY). The accuracy layer only works in the test phase and shows the percentage of correctly classified images in the validation sample.

It is important to specify the weight_filler attribute. If it is large, the loss function (loss) may return NaN at initial iterations. In this case, you need to reduce the std parameter for the weight_filler attribute.

Learning options

Learning Settings Configuration

  net: "/home/deploy/opt/SVHN/svhn/winny-f/winny_f_svhn.prototxt" test_iter: 1 test_interval: 700 base_lr: 0.01 momentum: 0.9 weight_decay: 0.004 lr_policy: "inv" gamma: 0.0001 power: 0.75 solver_type: NESTEROV display: 100 max_iter: 77000 snapshot: 700 snapshot_prefix: "/mnt/home/deploy/opt/SVHN/svhn/snapshots/winny_net/winny-F" solver_mode: GPU

To obtain a well-trained neural network, you need to set training parameters. In Caffe, the learning parameters are set via the configuration protobuf file. The configuration file for this contest is here . There are a lot of parameters, we will consider some of them in more detail:

net - the path to the configuration of the NA architecture
test_interval - the number of iterations between which the NA is tested (phase: test),
snapshot - the number of iterations between which the state of learning NA is preserved.
In Caffe, you can pause and resume training.

Training and Testing

To start training the NN, you need to run the caffe train command with the configuration file , where the training parameters are set:

 > caffe train --solver=/home/deploy/winny-f/winny_f_svhn_solver.prototxt

Short training log

  ....................... I0109 18:12:17.035543 12864 solver.cpp:160] Solving WinnyNet-F I0109 18:12:17.035578 12864 solver.cpp:247] Iteration 0, Testing net (#0) I0109 18:12:17.077910 12864 solver.cpp:298] Test net output #0: accuracy = 0.0666667 I0109 18:12:17.077997 12864 solver.cpp:298] Test net output #1: loss = 2.3027 (* 1 = 2.3027 loss) I0109 18:12:17.107712 12864 solver.cpp:191] Iteration 0, loss = 2.30359 I0109 18:12:17.107795 12864 solver.cpp:206] Train net output #0: loss = 2.30359 (* 1 = 2.30359 loss) I0109 18:12:17.107817 12864 solver.cpp:516] Iteration 0, lr = 0.01 ....................... I0109 18:13:17.960325 12864 solver.cpp:247] Iteration 700, Testing net (#0) I0109 18:13:18.045385 12864 solver.cpp:298] Test net output #0: accuracy = 0.841667 I0109 18:13:18.045462 12864 solver.cpp:298] Test net output #1: loss = 0.675567 (* 1 = 0.675567 loss) I0109 18:13:18.072872 12864 solver.cpp:191] Iteration 700, loss = 0.383181 I0109 18:13:18.072949 12864 solver.cpp:206] Train net output #0: loss = 0.383181 (* 1 = 0.383181 loss) ....................... I0109 20:08:50.567730 26450 solver.cpp:247] Iteration 77000, Testing net (#0) I0109 20:08:50.610496 26450 solver.cpp:298] Test net output #0: accuracy = 0.916667 I0109 20:08:50.610571 26450 solver.cpp:298] Test net output #1: loss = 0.734139 (* 1 = 0.734139 loss) I0109 20:08:50.640389 26450 solver.cpp:191] Iteration 77000, loss = 0.0050708 I0109 20:08:50.640470 26450 solver.cpp:206] Train net output #0: loss = 0.0050708 (* 1 = 0.0050708 loss) I0109 20:08:50.640494 26450 solver.cpp:516] Iteration 77000, lr = 0.00197406 ....................... I0109 20:52:32.236827 30453 solver.cpp:247] Iteration 103600, Testing net (#0) I0109 20:52:32.263108 30453 solver.cpp:298] Test net output #0: accuracy = 0.883333 I0109 20:52:32.263183 30453 solver.cpp:298] Test net output #1: loss = 0.901031 (* 1 = 0.901031 loss) I0109 20:52:32.290550 30453 solver.cpp:191] Iteration 103600, loss = 0.00463345 I0109 20:52:32.290627 30453 solver.cpp:206] Train net output #0: loss = 0.00463345 (* 1 = 0.00463345 loss) I0109 20:52:32.290644 30453 solver.cpp:516] Iteration 103600, lr = 0.00161609

One epoch is (73257-120) / 128 ~ = 571 iteration. Slightly more than 1 epoch, at 700 iterations, the network accuracy in the validation sample is 84%. At 134 era, accuracy is already 91%. At 181 era - 88%. Perhaps, if you train the network more than ages, for example 1000, the accuracy will stabilize and be higher. In this contest, training was stopped at 181 epochs.

In Caffe, you can resume network training from a snapshot by adding the --snapshot option:

 > caffe train --solver=/home/deploy/winny-f/winny_f_svhn_solver.prototxt --snapshot=winny_net/winny-F_snapshot_77000.solverstate

Testing on untagged images

To test the NA, you must create the deploy configuration of the network architecture . In it, unlike the previous configuration, there is no precision layer and the input layer is simplified.

The test sample, consisting of 26032 images, goes without markup. Therefore, in order to evaluate the accuracy of a test sample of a contest, you need to write some code . Caffe has interfaces for Python and Matlab .

To test networks from different eras, Caffe has snapshots. The network of 134 eras showed accuracy (Private Score in kaggle) 88.7%, and the network of 181 epochs - 87.6%.

Ideas to improve accuracy

Judging by the master's thesis , the accuracy of the implemented architecture can reach 96%.

How can you try to increase the accuracy obtained by 88.7%?

Educate the network for more eras. For example, in the tutorial on deep learning in the facial keypoints detection network trained 1000 epochs.
Standardize data so that the expectation is 0 and variance 1. To do this, you need to use HDF5 or LevelDb / lmdb to store data.
Work with learning options. For example, decrease learning_rate every 100 epochs.
You can also try using dropout layers, but this will require training the network for even more epochs than 1000 .
Dataset SVHN contains an additional 600,000 tagged images. They are used in research, but in the context of the contest their use would be dishonest. In this case, you can generate new data based on existing data.

Conclusion

The implemented convolutional neural network showed an accuracy of 88.9%. This is not the best result, but not bad for the first pancake. There is potential to increase accuracy up to 96%.

Thanks to the Caffe framework, immersion in deep learning does not cause great difficulties. It is enough to create a couple of configuration files and start the learning process with one command. Of course, basic knowledge of the theory of artificial neural networks is also needed. I tried to give this (in the form of links to materials) and other information for a quick start in this article.

Source: https://habr.com/ru/post/249089/

All Articles