T2F: project to convert text to face drawing with in-depth training

Project code is available in the repository.

Introduction

When I read descriptions of the appearance of the characters in the books, I always wondered what they would look like in life. It is quite possible to imagine a person as a whole, but the description of the most conspicuous details is a difficult task, and the results vary from person to person. Many times I couldn’t imagine anything but a very blurred face for the character until the very end of the piece. Only when a book is turned into a film does the blurry face fill with details. For example, I could never imagine how exactly Rachel’s face in the book “The Girl on the Train ” looks like. But when the movie came out, I was able to match Emily Blunt’s face with Rachel’s character. Surely, the people involved in the selection of actors, takes a long time to correctly portray the characters in the script.

This problem inspired and motivated me to find a solution. After that, I began to study the literature on depth learning in search of something similar. Fortunately, there were quite a few studies on the synthesis of images from text. Here are some of those that I based on:
')

arxiv.org/abs/1605.05396 “Generative Adversarial Text to Image Synthesis”
arxiv.org/abs/1612.03242 “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks”
arxiv.org/abs/1710.10916 “StackGAN ++: Realistic Image Synthesis with Stacked Generative Adversarial Networks”

[ projects use generative and adversary networks, GSS (Generative adversarial network, GAN) / approx. trans. ]

After reviewing the literature, I chose an architecture that was simplified compared to StackGAN ++, and is doing quite well with my problem. In the following sections, I will explain how I solved this problem, and I will share preliminary results. I will also describe some of the details of programming and training, for which I spent a lot of time.

Data analysis

Undoubtedly, the most important aspect of the work is the data used to train the model. As Professor Andrew Un spoke in his deeplearning.ai courses: “In the case of machine learning, it is not the one who has the best algorithm that achieves success, but the one who has the best data”. So began my search for a dataset for people with good, rich, and varied text descriptions. I ran across different data sets - either they were just faces, or faces with names, or faces with descriptions of eye color and face shape. But there were none that I needed. My last option was to use an early project - the generation of a description of structural data in natural language. But such an option would add extra noise to an already fairly noisy data set.

Time passed, and at some point a new project Face2Text appeared . It was a collection of a database of detailed text descriptions of persons. Thanks to the authors of the project for the provided data set.

The data set contained textual descriptions of 400 randomly selected images from the LFW database (marked-up faces). Descriptions have been cleared to eliminate ambiguous and minor characteristics. Some descriptions contained not only information about individuals, but also some conclusions drawn from images - for example, “the person in the photo is probably a criminal”. All these factors, as well as the small size of the data set, led to the fact that my project so far only demonstrates the proof of the performance of the architecture. Subsequently, this model can be scaled to a larger and more diverse data set.

Architecture

The T2F project architecture combines two stackGAN architectures for encoding text with a conditional increment, and ProGAN ( progressive GSS growth ) for synthesizing face images. The original stackgan ++ architecture used several GSS with different spatial resolutions, and I decided that this was too serious an approach for any match distribution task. But ProGAN uses only one GSS, which is progressively trained on more and more detailed resolutions. I decided to combine these two approaches.

Explanation of the data flow through is: text descriptions are encoded into the final vector using LSTM (Embedding) (psy_t) network embedding (see diagram). Then the embedding is transmitted through the conditional addition block (Conditioning Augmentation) (one linear layer) to get the text part of the eigenvector (using the VAE repair technique) for the GSS as an input. The second part of the eigenvector is random Gaussian noise. The resulting eigenvector is fed to the GSS generator, and the embedding is fed to the last layer of the discriminator for conditional distribution of compliance. The training of GSS processes proceeds in the same way as in the article on ProGAN - in layers, with an increase in spatial resolution. A new layer is introduced using the fade-in technique to avoid the destruction of previous learning outcomes.

Implementation and other details

The application was written in python using the PyTorch framework. I used to work with tensorflow and keras packages, but now I wanted to try PyTorch. I liked using the python debugger for working with the Network architecture - all thanks to the early execution strategy. In tensorflow recently also included the eager execution mode. However, I do not want to judge which framework is better, I just want to emphasize that the code for this project was written using PyTorch.

Quite a few parts of the project seem reusable to me, especially ProGAN. Therefore, I wrote separate code for them as an extension to the PyTorch module, and it can be used on other data sets. It is only necessary to specify the depth and size of the GSS features. GSS can be trained progressively for any data set.

Workout details

I have trained quite a few versions of the network using different hyperparameters. Work details are as follows:

The discriminator does not have batch-norm or layer-norm operations, so the loss of a WGAN-GP can grow explosively. I used a drift penalty with a lambda of 0.001.
To control one's own diversity obtained from the coded text, it is necessary to use the Kullback – Leibler distance in the losses of the Generator.
To make the resulting images better match the incoming textual distribution, it is better to use the WGAN variant of the corresponding (Matching-Aware) discriminator.
The fade-in time for the upper level must exceed the fade-in time for the lower ones. I used 85% as a fade-in value when training.
I found that higher resolution examples (32 x 32 and 64 x 64) produce more background noise than lower resolution examples. I think this is due to lack of data.
During a progressive workout, it is better to spend more time on smaller resolutions, and to reduce the time you work with larger resolutions.

The video shows the Timelapse Generator. The video is collected from images with different spatial resolution, obtained during the GSS training session.

Conclusion

According to preliminary results, it is possible to judge that the T2F project is efficient, and has interesting applications. Suppose it can be used to compile identikits. Or for cases when it is necessary to spur the imagination. I will continue to work on scaling this project on data sets such as Flicker8K, Coco captions, and so on.

Progressive GSS growth is a phenomenal technology for faster and more stable GSS training. It can be combined with various modern technologies mentioned in other articles. GSS can be used in different areas of MO.

Source: https://habr.com/ru/post/420709/

All Articles