📜 âŹ†ïž âŹ‡ïž

How HBO did Not Hotdog for the Silicon Valley TV Show



The HBO series "Silicon Valley" released this AI application that recognizes hotdogs and non-hotdogs as an app in the fourth episode of the fourth season (the app is now available for Android, as well as for iOS !)

To achieve this, we developed a special neural architecture that works directly on your phone, and trained it using TensorFlow, Keras and Nvidia GPU.


Although the practical benefits of it are ridiculous, the application is an accessible example of both deep learning and boundary computing. All AI work is 100% provided by the user device, and images are processed without leaving the phone. It gives instant response (no need to share data with the cloud), offline accessibility and better privacy. It also allows us to keep the application running at a price of $ 0, even with millions of users, which represents a significant savings compared to traditional cloud-based approaches to AI.
')

Developer's computer with eGPU connected for teaching AI Not Hotdog applications

The application was developed by the studio specifically for the series by one developer, on a single laptop with a connected GPU, using manually selected data. In this regard, it can serve as a demonstration of what can be achieved today with limited time and resources, a non-technical company, individual developers and similar amateurs. In this spirit, this article attempts to provide a detailed overview of the steps that others can repeat to create their own applications.


application



If you have not watched the show and have not experienced the application (and it should have been!), It takes a photo, and then gives you its opinion, is the hot dog shown in the photo or not. This is a straightforward use that pays tribute to recent research in the field of AI and applications, in particular, ImageNet.

Although we probably allocated more programmer resources to recognize hot dogs than anyone in the world, the application still sometimes makes mistakes in a terrible and / or subtle way.


And vice versa, at times it is able to recognize hotdogs in difficult situations ... As Engadget writes , “This is incredible. In 20 minutes, I successfully recognized more food with this app than I had tagged and recognized songs with Shazam over the past two years. ”


From prototype to production


Ever reading Hacker News, did you catch yourself thinking: “They raised $ 10 million in the first round for it? I can develop this over the weekend! ” This app is likely to make you feel the same way. In the end, the original prototype was really created over the weekend, using the Google Cloud Platform Vision API and React Native. But to create the final version of the application, which eventually entered the application catalog, it took months of additional work (several hours a day) to make meaningful improvements that are difficult to assess from the outside. We spent weeks optimizing overall accuracy, learning time, analysis time, we tried different installations and tools to speed development, and spent the whole weekend optimizing the user interface with iOS and Android permissions (don't even start talking about this topic).

All too often, technical blogs and scientific articles skip this part, preferring to immediately show the final version. But to help others learn from our mistakes and actions, we will present a short list of approaches that did not work for us before we describe the final architecture, which we finally arrived at.

V0: Prototype



An example of the image and the corresponding API issuance from the Google Cloud Vision documentation

To create a prototype, we chose React Native, because it provides a simple sandbox for experiments and helps to quickly provide support for many devices. The experience was successful, and we left React Native to the end of the project: it did not always simplify the process, and the design of the application had to be purposefully limited, but in the end React Native did its job.

We quickly abandoned another main component used for the prototype - Google Cloud Vision API . There are three main reasons:

  1. The first, and most important, is that the accuracy of recognition of hot dogs was not very good. Although he copes with the recognition of a variety of objects, but does not very well recognize one particular thing, and there were various examples of a rather general nature, where the service performed poorly during our experiments in 2016.
  2. By nature, the cloud service will always be slower than the native execution on the device (network lag hurts!), And does not work offline. The idea of ​​transferring images outside the device potentially has legal and privacy implications.
  3. In the end, if the application becomes popular, then work in the Google Cloud can fly into a pretty penny.

For this reason, we started experimenting with what is fashionable to call edge computing (“edge computing”). In our case, this means that after learning our neural network on the laptop, we will transfer it directly to the mobile application, so that the phase of the neural network execution (or output of the conclusion) will be performed directly on the user's phone.

V1: TensorFlow, Inception and Retraining



Thanks to the happy meeting with Pete Varden from the TensorFlow development team, we learned that you can run TensorFlow directly by embedding it in an iOS device, and started experimenting in this direction. After React Native, TensorFlow became the second integral part of our stack.

It took just one day to integrate the Objective-C ++ TensorFlow's sample camera into our React Native shell. It took much longer to master a learning script that helps to retrain the Inception architecture to work with more specific tasks of machine vision. Inception is the name of a family of neural architectures created by Google for image recognition. Inception is available as a pre-trained system, that is, the learning phase is complete and the weights are set. Most often, image recognition neural networks are trained on ImageNet, an annual competition to find the best neural architecture, which recognizes more than 20,000 different types of objects (and hot dogs among them). However, just like the Google Cloud Vision API, the competition encourages the largest possible number of recognizable objects, and the initial accuracy for one particular object out of more than 20,000 is not very high. For this reason, retraining (also called “transfer learning”) aims to take a fully trained neural network and retrain it to better accomplish a specific task that you are working with. This usually involves some degree of “forgetting” either through excision of entire layers from the stack, or slowly erasing the ability of the neural network to distinguish between certain types of objects (for example, chairs) for the sake of greater accuracy in recognizing the object you need (for example, hotdogs).

Although the neural network (in this case, Inception) could be trained on 14 million ImageNet images, we were able to retrain it on just a few thousand hotdog photos to drastically improve the recognition of hotdogs.

The great advantage of transferring training is that you get better results much faster and with less data than you would if you trained a neural network from scratch. Full training could take several months on numerous GPUs and would require millions of images, while retraining can presumably be spent in a few hours on a laptop with a couple of thousand photos.

One of the most difficult tasks we faced was the exact definition of what is considered a hot dog and what is not. The definition of a “hot dog” turned out to be surprisingly difficult (are cut sausages considered, and if so, which species?) And subject to cultural interpretation.

Similarly, the nature of the “open world” of our problem meant that we would have to deal with a virtually infinite amount of input data. Some computer vision tasks deal with a relatively limited set of input data (for example, x-rays of bolts with mechanical defects or without defects), but we had to prepare an application for processing selfies, images of nature and a wide variety of dishes.

Suffice it to say that this approach was promising and led to some improvement in the results, but it had to be abandoned for a number of reasons.

First, the nature of our task meant a strong imbalance in the data for training: there are many more examples of what is not hotdogs than the hotdogs themselves. In practice, this means that if you train your algorithm on three images of hotdogs and 97 images that are not hotdogs, and it recognizes 0% of the first and 100% of the second, then you get a nominal accuracy of 97%! This problem is not solved directly by the TensorFlow retraining tool and essentially forces you to establish a model of depth learning from scratch, import weights and conduct training in a more controlled way.

At this stage, we decided to bite the bullet and start working with Keras, a deep learning library that provides better, easier-to-use abstractions over TensorFlow, including quite cool learning tools, as well as the class_weights option, which is ideal for solving this type of problem. with unbalanced dataset like ours.

We used this opportunity to test other neural architectures, such as VGG, but one problem remained. None of them provided a comfortable work on the iPhone. They consumed too much memory, which led to application crashes, and sometimes it took up to 10 seconds to produce a result, which is not ideal in terms of UX. We tried a lot to solve the problem, but in the end we recognized that these architectures are too cumbersome to work on a mobile device.

V2: Keras and SqueezeNet



SqueezeNet vs. AlexNet, the grandfather of computer vision architectures. Source: SqueezeNet science article

To give you the context of where we are, this is about halfway through the project's development history. By this time, the UI was ready for more than 90%, leaving very little to change. But now it is clear that the neural network was ready at best by 20%. We had a good understanding of the problems and a good data set, but 0 lines of the ready neural architecture code were written, no our code could work reliably on a mobile phone, and even the accuracy would subsequently be drastically improved.

The problem immediately confronting us was simple: if Inception and VGG are too bulky, is there a simpler, pre-trained neural network that we can retrain? On a tip from the always great Jeremy Howard (where has this guy been all our life?) We tried Xception, Enet and SqueezeNet. Very quickly, we chose SqueezeNet due to the explicit positioning of this system as a solution for embedded depth learning systems and due to the availability of the pre-trained Keras model on GitHub (hooray, open-source).

So how big is the difference? An architecture like VGG uses about 138 million parameters (essentially, this is the number of numbers needed to simulate neurons and the values ​​between them). Inception represents significant progress, requiring only 23 million parameters. For comparison, SqueezeNet works with 1.25 million parameters.

This gives two advantages:

  1. At the learning stage, the smaller neural network learns much faster. There are fewer parameters for placement in memory, so you can slightly parallelize learning (larger packet sizes), and the neural network will converge faster (that is, approach an idealized mathematical function).

  2. In production, the model is much smaller and much faster. SqueezeNet consumes less than 10 MB of RAM, while architectures like Inception require 100 MB or more. The difference is gigantic, and it is especially important when working on mobile devices, which may have less than 100 MB of available memory for your application. Smaller neural networks also calculate the final result much faster than larger ones.

Of course, something had to be sacrificed:

  1. The smaller neural network has less available “memory”: it will not be as effective in difficult situations (such as recognizing 20,000 different objects) or even handling complex situations in a narrow class of tasks (for example, understanding the difference between hot dogs in New York style and chicago style). As a result, smaller neural networks typically exhibit lower accuracy than larger networks. When trying to recognize 20,000 ImageNet objects, SqueezeNet shows only 58% recognition accuracy, while VGG shows 72%.

  2. A small neural network is harder to retrain. Technically, nothing prevents us from using the same approach as in the case of Inception and VGG, forcing SqueezeNet to “forget” something - and retrain it specifically to distinguish between hot dogs and non-hot dogs. In practice, we had difficulty adjusting the pace of learning, and the results were always less satisfactory than learning from SqueezeNet from scratch. This may also be partly due to the nature of the “open world” of our task.

  3. In theory, smaller neural networks should rarely be retrained, but we came across this on several “small” architectures. Overfitting means that your network specializes too much, and instead of learning to recognize hotdogs in general, it learns to recognize exactly and only specific photos of the hotdogs on which you trained it. A human analogy would be to memorize specific photos with hot dogs, which are shown to you, instead of abstraction, that a hot dog usually consists of sausages in a bun, possibly with spices, etc. If you are shown a completely new hot dog image, not something you remember, you will be inclined to say that this is not a hot dog. Due to the fact that in small networks usually less “memory”, it is easy to understand why it is more difficult for them to specialize. But in some cases, the accuracy of our small networks jumped up to 99% and suddenly she stopped recognizing images that she did not see at the training stage. The effect usually disappeared when we added augmented data: half-random stretched / distorted images at the input, and instead of 1.00 times, each of the 1000 images is trained on a certain way of changing thousands of images to reduce the likelihood of the neural network to remember this 1000 images. Instead, it should recognize hotdog “signs” (bun, sausage, seasoning, etc.), while remaining flexible and general enough not to be too attached to specific pixel values ​​of particular images in the training set.


Example of augmented data from the Keras blog

At this stage, we began experimenting with customizing the neural network architecture. In particular, we began to use Batch Normalization and try different activation functions.


After adding Batch Normalization and ELU to SqueezeNet, we were able to train a neural network that reached accuracy above 90% when learning from scratch, but it was quite fragile, that is, the same neural network could in some cases be retrained or under-trained in other situations faced with testing in real conditions. Even adding additional examples to the data set and experimenting with data augmentation did not help set up a network that showed normal results.

So although this stage was promising and for the first time gave us a functioning application that worked entirely on the iPhone and calculated the result in less than a second, but in the end we switched to our fourth and final architecture.

3. DeepDog Architecture


from keras.applications.imagenet_utils import _obtain_input_shape from keras import backend as K from keras.layers import Input, Convolution2D, SeparableConvolution2D, \ GlobalAveragePooling2d \ Dense, Activation, BatchNormalization from keras.models import Model from keras.engine.topology import get_source_inputs from keras.utils import get_file from keras.utils import layer_utils def DeepDog(input_tensor=None, input_shape=None, alpha=1, classes=1000): input_shape = _obtain_input_shape(input_shape, default_size=224, min_size=48, data_format=K.image_data_format(), include_top=True) if input_tensor is None: img_input = Input(shape=input_shape) else: if not K.is_keras_tensor(input_tensor): img_input = Input(tensor=input_tensor, shape=input_shape) else: img_input = input_tensor x = Convolution2D(int(32*alpha), (3, 3), strides=(2, 2), padding='same')(img_input) x = BatchNormalization()(x) x = Activation('elu')(x) x = SeparableConvolution2D(int(32*alpha), (3, 3), strides=(1, 1), padding='same')(x) x = BatchNormalization()(x) x = Activation('elu')(x) x = SeparableConvolution2D(int(64 * alpha), (3, 3), strides=(2, 2), padding='same')(x) x = BatchNormalization()(x) x = Activation('elu')(x) x = SeparableConvolution2D(int(128 * alpha), (3, 3), strides=(1, 1), padding='same')(x) x = BatchNormalization()(x) x = Activation('elu')(x) x = SeparableConvolution2D(int(128 * alpha), (3, 3), strides=(2, 2), padding='same')(x) x = BatchNormalization()(x) x = Activation('elu')(x) x = SeparableConvolution2D(int(256 * alpha), (3, 3), strides=(1, 1), padding='same')(x) x = BatchNormalization()(x) x = Activation('elu')(x) x = SeparableConvolution2D(int(256 * alpha), (3, 3), strides=(2, 2), padding='same')(x) x = BatchNormalization()(x) x = Activation('elu')(x) for _ in range(5): x = SeparableConvolution2D(int(512 * alpha), (3, 3), strides=(1, 1), padding='same')(x) x = BatchNormalization()(x) x = Activation('elu')(x) x = SeparableConvolution2D(int(512 * alpha), (3, 3), strides=(2, 2), padding='same')(x) x = BatchNormalization()(x) x = Activation('elu')(x) x = SeparableConvolution2D(int(1024 * alpha), (3, 3), strides=(1, 1), padding='same')(x) x = BatchNormalization()(x) x = Activation('elu')(x) x = GlobalAveragePooling2D()(x) out = Dense(1, activation='sigmoid')(x) if input_tensor is not None: inputs = get_source_inputs(input_tensor) else: inputs = img_input model = Model(inputs, out, name='deepdog') return model 

Design


Our final architecture was largely influenced by a Google paper published on April 17, 2017 on MobileNets , which described the new neural network architecture with accuracy like Inception on simple tasks like ours and using only 4 million parameters or so. This means that it occupies an advantageous position between SqueezeNet, which may have been too simple for our task, and too overloaded with the Inception and VGG architectures, too heavy for mobile use. This article describes some of the possibilities for adjusting the size and complexity of a neural network, specifically for choosing a balance between memory consumption / CPU and accuracy, which is exactly what we thought at the time.

Less than a month before the deadline, we tried to reproduce the results from a scientific article. It was an absolute disappointment when, within a day after the publication of the article, the Keras implementation was already publicly available on GitHub thanks to Refiq Kan Mulley, a student at Istanbul Technical University, whose results we already used when we took Keras SqueezeNet. The size, qualification and openness of the depth learning community, as well as the presence of such talents as Refik, is what makes depth learning suitable for use in modern applications, but it also makes working in this industry more exciting than in any other technical industry. in which I was involved in my life.

In our final architecture, we have significantly moved away from the original architecture of MobileNets and from generally accepted rules, in particular:


So how exactly does this stack work? In-depth training often has a bad reputation as a kind of “black box”, and although many of the components can actually be mysterious, our neural networks often display information about how some of their magic tricks work. We can take individual layers from this stack and see how they are activated on specific input images, which gives us an understanding of what abilities each layer has to recognize sausages, buns, or other most noticeable signs of a hotdog.



Training


The quality of the source data was the most important. The neural network can only be as good as the source data, and the improvement in the quality of the training set probably became one of the three things that we spent the most time on working on this project. To improve it, we have taken the following key steps:



- . : Wikimedia Commons


In the final form, our data set consisted of 150 thousand images, of which only 3,000 were hot dogs. The unbalanced 49: 1 ratio was indicated in the Keras 49: 1 class weight settings in favor of hot dogs. Most of the remaining 147 thousand photographs were different dishes, and only 3000 were not food to help the neural network to make slightly better generalizations and not take the image of a man in red clothes as a hot dog.

Our data augmentation rules are as follows:


These parameters are derived intuitively, based on experiments and our understanding of how the application will be used in real conditions, as opposed to accurate experiments.

In the last stage of our data processing pipeline, we used a multi- process image generator for Patras Rodriguez's Keras . Although Keras has an embedded implementation of multithreading and multiprocessing, the Patrick library worked consistently faster in our experiments for reasons that we did not have time to figure out. This library has shortened the training time by a third.

The network was trained on a 2015 MacBook Pro laptop with an external GPU (eGPU) connected, namely the Nvidia GTX 980 Ti (we probably would have bought 1080 Ti if we started today). We were able to train the neural network on packets of 128 images. The network has been trained for a total of 240 epochs. This means that we passed through it all 150 thousand images 240 times. It took about 80 hours.

We trained the neural network in three stages:




Although the pace of learning was determined by conducting a linear experiment recommended by the authors of a scientific article on CLR, they seem to be intuitive, here the maximum rate at each stage is approximately two times less than the previous minimum, which corresponds to the industry standard of halving the rate of learning if the recognition accuracy is the learning process has ceased to grow.

To save time, we spent part of the training on the Ubuntu Paperspace P5000 instance. In some cases, it was possible to double the size of the packages, and the optimal pace of training at each stage also doubled.

Run neural networks on mobile phones


Even having designed a relatively compact neural architecture and trained it to cope with specific situations in a mobile context, there was still a lot of work to be done for the application to work properly. If you run the first-class neural network architecture without changes, it will quickly consume hundreds of megabytes of RAM, which few of modern mobile devices will withstand. In addition to optimizing the network itself, it turned out that the image processing method and even the TensorFlow method itself have a huge effect on the speed of the neural network, the amount of RAM consumed and the number of failures.

This is probably the most mysterious part of the project. It is rather difficult to find information on this topic, possibly due to the small number of in-depth learning applications that today run on mobile devices. However, we are grateful to the TensorFlow development team, and especially Pete Varden, Andrew Harp and Chad Wipki for the existing documentation and their goodwill in answering our questions.


Instead of using TensorFlow on iOS, we studied Apple’s embedded depth learning libraries (BNNS, MPSCNN, and later CoreML). We would design the neural network on Keras, train it using TensorFlow, export all the values ​​of the weights, re-implement the neural network on BNNS or MPSCNN (or import via CoreML) and load the parameters into the new implementation. However, the biggest obstacle was that the new Apple libraries are only available for iOS 10+, and we wanted to support older versions of iOS. As iOS 10+ grows and these frameworks improve, you may not need to run TensorFlow on a mobile device in the future.

Change the behavior of the application by introducing a neural network on the fly


If you think that injecting JavaScript into your application on the fly is cool, then try introducing a neural network on the fly! The last trick we used in production was the use of CodePush and relatively liberal Apple terms of use to implement new versions of our neural networks live after they were published in the application catalog. This is done mainly to improve recognition accuracy after release, but this method, theoretically, can be used to drastically improve the functionality of your application without having to re-review in the AppStore.

 #import <CodePush/CodePush.h> 
 NSString* FilePathForResourceName(NSString* name, NSString* extension) { // NSString* file_path = [[NSBundle mainBundle] pathForResource:name ofType:extension]; NSString* file_path = [[[[CodePush.bundleURL.URLByDeletingLastPathComponent URLByAppendingPathComponent:@"assets"] URLByAppendingPathComponent:name] URLByAppendingPathExtension:extension] path]; if (file_path == NULL) { LOG(FATAL) << "Couldn't find '" << [name UTF8String] << "." << [extension UTF8String] << "' in bundle."; } return file_path; } 
 

 import React, { Component } from 'react'; import { AppRegistry } from 'react-native'; import CodePush from "react-native-code-push"; import App from './App'; class nothotdog extends Component { render() { return ( <App /> ) } } require('./deepdog.pdf') const codePushOptions = { checkFrequency: CodePush.CheckFrequency.ON_APP_RESUME }; AppRegistry.registerComponent('nothotdog', () => CodePush(codePushOptions)(nothotdog)); 

What would we do differently


There are many things that did not work or did not have the time to experience them, and here are some of the ideas we would explore in the future:


UX/DX, « »


In the end, it would be unforgivable not to mention the obvious and important influence of user interaction (UX), developer (DX) and embedded bias in the development of an AI application. Perhaps each of these topics deserves a separate article (or a separate book), but that was the very specific impact of these three factors on our work.

UX (user interaction)it may be more important at each stage of developing an AI application than a regular application. Right now there are no depth learning algorithms that will give you perfect results, but there are many situations where the right combination of depth learning and UX will produce results indistinguishable from the ideal. Correct expectations regarding UX are invaluable when it comes to developing the correct direction for the design of a neural network and the elegant handling of cases of unavoidable AI failures. Creating AI applications without thinking about user interaction is like learning a neural network without a stochastic gradient descent: you are stuck in a local minimum of an ominous valley on your way to creating a perfectly working AI.


Source: New Scientist

DX (developer engagement) is also extremely important, because the learning time of a neural network is a new hemorrhoid, along with the expectation of compiling the program. We believe that you will definitely put DX in the first place in the list of priorities (therefore, choose Keras), because there is always an opportunity to optimize the environment for subsequent execution (manual parallelization of the GPU, augmentation of data for multiprocessing, the TensorFlow pipeline, even the repeated implementation for caffe2 / pyTorch).


Even projects with relatively stupid documentation like TensorFlow make it much easier to interact with the developer by providing well-tested, widely used and superbly supported environments for learning and running neural networks.

For the same reason, it is difficult to find something cheaper and more convenient than your own GPU for development. The ability to locally view and edit images, edit the code in your favorite editor without delay - this greatly improves the quality and speed of developing AI projects.

Most AI applications will encounter more cultural bias.than our application. But for example, even in our simplest case, initially thinking about cultural features, we taught the neural network to recognize hotdogs in French, Asian hotdogs and even more oddities that we had no idea about before. It is important to remember that AI does not make “better” decisions than a person — they are affected by the same bias as humans, and infection occurs during human learning.

Source: https://habr.com/ru/post/331740/


All Articles