Autocoding "Blade Runner"

Reconstruction of films using artificial neural networks

I bring to your attention the translation of the author's description of the operation of the auto-encoder algorithm used to create the reconstruction of the film “Blade Runner”, which I have already written about . It described the general history of the film and how Warner filed it, and then withdrew the copyright infringement lawsuit. . Here you will find a more detailed technical description of the algorithm and even its code.

In this blog, I will describe the work that I have been doing all of last year - the reconstruction of films with the help of artificial neural networks. First, their ability to reconstruct individual frames from films is trained, and then each frame in the film is reconstructed and a sequence of frames is created anew.
')
The type of neural network used is called autoencoder. Autocoder - a type of neural network with a very small hidden layer size. It encodes a piece of data into a much shorter representation (in this case, a set of 200 numbers), and then reconstructs the data in the best possible way. The reconstruction is not perfect, but the project was for the most part a creative exploration of the possibilities and limitations of this approach.

The work was done in the framework of the thesis at the Faculty of Creative Computing at the Goldsmith Institute.

Reconstruction of Roy Batty's eyes looking at Los Angeles in the first scene

Comparison of the original Blade Runner trailer versus reconstruction

The first 10 minutes of film reconstruction

Technical description

In the last 12 months, the interest and efforts of developers in using neural networks to generate text, pictures and sounds have gone up sharply. In particular, in recent months, methods for generating images have seriously advanced.

Images of bedrooms created by DCGAN

In November 2015, Radford and his co- workers were extremely surprised by the community interested in this topic, using neural networks to create realistic images of bedrooms and people, using the adversarial method of teaching neural networks. The network creates random examples, and the comparing network (discriminator) tries to distinguish the generated images from real images. Over time, the generator begins to produce more and more realistic images that the discriminator is no longer able to distinguish. The controversial method was first proposed by Goodfellow and co-authors in 2013, but before the work of Redford it was not possible to create realistic images using neural networks. An important breakthrough that made this possible was the use of the “convolutional autoencoder] to create images. Prior to this, it was assumed that such neural networks could not effectively create images, since the use of accumulating layers leads to the loss of information between layers. Radford swept away the use of accumulating layers and simply used strided backwards convolutions. (For those unfamiliar with the concept of "convolutional auto-encoder," I prepared a visualization ).

I studied creating models before the work of Redford, but after its publication it became obvious that this was the right approach. However, creating adversarial networks do not know how to reconstruct images, but only create examples from random noise. Therefore, I began to study the possibilities of training a variational auto - encoder - which is able to restore images - with a discriminating network used for the competitive approach, or even with a network that evaluates the similarity of the reconstructed image and a real example. But I did not even have time to do this, and Larsen and co-authors already published a work in 2015 combining both these approaches in a very elegant way - comparing the difference in responses to real and reconstructed images in the upper layers of the discriminating network. They were able to produce a learned similarity metric, a fundamentally superior pixel-by-pixel comparison of recovery errors (otherwise the procedure leads to a blurred reconstruction - see the picture above).

The general scheme of the variational autocoder combined with a discriminating network

The Larsen model consists of three separate networks - the encoder, the decoder and the discriminator. The encoder encodes the data set x into a hidden representation of z. The decoder tries to recreate the data set from the hidden view. The discriminator has both original and reconstructed sets, and determines whether they are real or fake. The reaction to them in the upper layers of the network is compared to determine how close the reconstruction came to the original.

I made this model based on the TensorFlow library, intending to expand it with LSTM to create video predictions. Due to time constraints, I could not do this. But this led me to create a model for generating large non-square images. Previous models described 64x64 resolution images, with 64 pictures for batch processing. I expanded the network to 256x144 pictures with 12-picture packs (my NVIDIA GTX 960 GPU didn’t cope with a lot of them). The hidden representation had 200 variables, which means that the model encoded a 256x144 picture with three color channels (110,592 variables) into a representation of 200 numbers before the image was reconstructed. The network trained on a set of data from all the frames of the Blade Runner, cut and scaled to 256x144. The network has been training in 6 epochs, which took me 2 weeks.

Artistic motivation

Reconstruction of the second test Vojta-Kampf

The film Blade Runner of 1982 by Ridley Scott is an adaptation of the classic science fiction work Do Androids Dream About Electric Boards? By Philip Dick (1968). In the film, Rick Deckard (Harrison Ford) is a bounty hunter, earning by hunting and killing replicants — androids that are so well made that they cannot be distinguished from people. Dekard needs to carry out Vojta-Kampf tests to distinguish between androids and people, asking more and more complex questions from the field of morality and studying the pupils of experimental subjects with the intention of provoking an empathic reaction in people, but not in androids.

One of the main questions of history is the task of determining what is a person and what is not, becoming more and more complex as the complexity of technological developments increases. New androids "Nexus-6", created by the corporation Tyrell, eventually develop emotional reactions, and the new prototype of Rachel contains memory implants, because of which she considers herself a person. The method of separating people from non-people is undoubtedly borrowed from the methodological skepticism of the great French philosopher René Descartes. Even the name Decard is very similar to Descartes . Deckard throughout the film is trying to determine who is a person and who does not, while not expressing the opinion that Deckard himself doubts whether he is a human being.

I will not go into the philosophical problems discussed in the film ( there are two good articles about this), I will only say the following. The success of deep learning systems is manifested in the fact that systems are becoming more and more involved in their environment . At the same time, a virtual system that recognizes images but is not involved in the environment represented by these images is a model whose characteristics are similar to Cartesian dualism, in which body and spirit are separated.

A neural network is a relatively simple mathematical model (if we compare it with the brain), and excessive anthropomorphization of these systems can lead to problems. Despite this, advances in deep learning mean that the subject of including models in their environment and their relationship with the theories about the nature of mind should be viewed in the context of technical, philosophical and artistic consequences.

Reconstruction of "Blade Runner"

The reconstructed film turned out unexpectedly connected. This is not an ideal reconstruction, but given that the model was designed to emulate a set of images of one entity, in the same perspective, it copes well with the work, given how much the individual frames differ.

The static and high-contrast scenes, which change little with time, are reconstructed very well. This is because, in essence, the model “saw” the same frame more often than just 6 training eras. This can be considered as an excess of data, but since the training data set is deliberately distorted, there is no need to worry about it.

The model has a predisposition to compress many similar frames with minimal differences in one (when, for example, the actor speaks in a static scene). Since the model represents individual images, it is unaware that there are small deviations from frame to frame, and that they are quite important.

Also, the models hardly manage to make a distinct reconstruction of scenes with low contrast, especially with the presence of faces. She also hardly reconstructs faces when there is a lot of changes - when she moves or rotates her head. This is not surprising, given the limitations of the model and the sensitivity of people in recognizing faces. Recently, there have been advances in development by combining creating models and networks with spatial models [spatial transformer networks] that can help with these issues, but their description is beyond the scope of this article.

Reconstruction of the scenes of the film “Koyaaniskatsi” using a network trained on “Blade Runner”

Reconstruction of other films

An example of the reconstruction of the film "Koyayaniskatsi" using the network, trained on "Blade Runner"

In addition to the reconstruction of the film on which the network trained, you can make it reconstruct any video. I experimented with different films, but it turned out the best with one of my favorites - Koyaaniskatsi (1982). It consists mainly of scenes with slowing down and acceleration of time, and the scenes vary greatly, making it an ideal candidate for testing the performance of the Blade Runner model.

There is nothing surprising in the fact that the model recreates the film on which she trained much better compared to the videos she never “saw”. You can improve the results by training the network on a large number of diverse video material, for example, on hundreds of hours of random videos. But the model will lose in aesthetic quality, which results from the only finished film. And although most often the individual frames themselves are difficult to disassemble, in motion the images become more coherent and at the same time quite unpredictable.

Reconstruction of Apple’s 1984 ad using a network trained on Blade Runner

Reconstruction of the Matrix III by John Whitney using the network, trained on "Blade Runner"

In addition to Koyaaniskatsi, I checked the network for the reconstruction of two more films. This is a famous 1984 Apple Macintosh ad, shot by the same Ridley Scott. Steve Jobs hired him when he watched Blade Runner at the movies. Advertising has a lot in common with Blade Runner in a visual sense, so the choice of this video was justified.

Another film is the animation of John Whitney "Matrix III". He was a pioneer in computer animation and the first full-time IBM artist from 1966 to 1969. Matrix III (1972) was one of a series of films that demonstrated the principles of harmonic progression. He was chosen to test how the model would cope with the reconstruction of abstract unnatural images.

Autocoding "Clouding"

After Blade Runner, I wanted to see how the model would behave if it was trained on another film. I chose the 2006 film Blurred [A Scanner Darkly]. This is another adaptation of Philip Dick’s novel, and stylistically it is quite different from Blade Runner. “Clouding” was shot by rotoscoping - it was shot on camera, and then each frame was drawn by the artist.

The model caught the film's style quite well (although not as well as in the work of transferring the artistic style ), but it was even harder for her to reconstruct faces. Apparently, this is due to the fact that the film has contrasting contours and there are difficulties with the recognition of facial features, as well as there are unnatural and enhanced changes in light and shadow from frame to frame.

Blade Runner Trailer, reconstructed using a network that has been trained in Cloud Clouds

Fragment of "Koyaaniskatsi" reconstructed using the network trained on the "Clouding"

Again, the reconstruction of some films by models trained on others is hardly recognizable. The results are not as cohesive as with the Blade Runner model, perhaps due to the presence of a much larger number of colors in the Clouding, and that modeling natural images for this model is more complex. On the other hand, the images are unusual and complex, which leads to the appearance of a unpredictable video sequence.

Conclusion

Honestly, I was very surprised at how the model behaved when I started training it on the Blade Runner. Reconstruction of the film came out better than I could imagine, and I am surprised at what happened during the reconstruction of other films. I will definitely do more experiments with workouts on more films in order to see the result. I would like to adapt the training procedure so that it takes into account the sequence of frames so that the network can better distinguish between long sequences of similar frames.

The project code is available on GitHub . Details are also on my site .

Source: https://habr.com/ru/post/369331/

All Articles