Application of automatic machine learning to neural networks with “transformer” architecture

From the Google AI blog

Since the publication of information about them in 2017, neural networks of the " transformer " type architecture have been applied to tasks of various kinds, from generating texts in fantasy style to writing musical harmonies . What is important, the high quality of the work of “transformers” has shown that when applied to sequential tasks, for example, to language modeling and translation, neural networks with direct propagation can be just as effective as recurrent ones. And although the popularity of transformer and other direct propagation models used in sequential tasks is growing, their architectures are almost always created manually, unlike the computer vision field, where automatic machine learning ( AOM ) approaches have already been discovered by advanced models that are ahead of those that were manual setting. Naturally, we were interested in whether the application of the AOM to successive tasks could achieve the same success.

Having conducted an evolutionary neuroarchitecture search (neural architecture search, NAS), and using the translation as a model of successive tasks, we found an evolved transformer (ET), a new transformer architecture that demonstrates improvements in various natural language processing (OEE) tasks. ET not only achieves advanced results in translation, but also demonstrates an improvement in efficiency when modeling a language compared to the original transformer. We publish a new model in the Tensor2Tensor library, where it can be used for any consistent task.

Technician Development

To begin the evolutionary search for neuroarchitecture, we had to develop new techniques, since the task used to assess the “fitness” of each of the architectures was a demanding computational resource. As a result, these searches are more demanding than similar searches in the field of computer vision, capable of operating with smaller databases, for example, CIFAR-10 . The first of these techniques is a warm start, sowing the original evolutionary population with the transformer type architectures instead of random models. This helps to concentrate searches in a deliberately strong area of the search space, which allows us to quickly find the best models.
')
The second technique is a new method developed by us called progressive dynamic obstacle race (Progressive Dynamic Hurdles, PDH). This algorithm complements the evolutionary search, allowing you to allocate more resources to the strongest candidates, unlike previous works, where the same amount of resources was allocated to each candidate NAS model. PDH allows us to complete a model assessment earlier if it is terribly bad, while rewarding prospective architectures with a large amount of resources.

Evolved transformer

Using these methods, we conducted a large-scale NAS search on our translation task and found ET. Like most sequence-to-sequence neural networks (sequence to sequence, seq2seq) architectures, it has an encoder that encodes the incoming sequence in the insert, and a decoder that uses these inserts to create the output sequence. In the case of a translation, the input sequence is a sentence for translation, and the output is a translation.

The most interesting feature of ET is the convolutional layers at the bottom of the modules of both the encoder and decoder, added in a similar branching way to both these places (that is, the inputs pass through two different convolutional layers before folding).

Comparison of the encoder architecture of a conventional transformer and ET. Pay attention to the branching convolutional structure at the bottom of the module, independently formed in the encoder and in the decoder. The decoder is described in detail in our work .

This is especially interesting because the encoder and decoder during the NAS do not share architectures with each other, and the utility of this architecture was discovered independently in the encoder and in the decoder, which speaks in favor of such a scheme. If the original transformer fully relied on applying attention to the same data that he himself generated [self-attention], ET is a hybrid that takes advantage of both self-attention and wide convolution.

Evaluation

To test the effectiveness of this new architecture, we first compared it with the original transformer that worked with the task of translating from English to German, which we used during the search. We found that ET has the best performance on BLEU and connectivity on all sizes of parameters, and the largest gain in size is comparable to mobile devices (~ 7 million parameters), which indicates efficient use of parameters. On larger sizes, ET reaches advanced results on WMT '14 En-De with a BLEU score of 29.8 and a SacreBLEU score of 29.2.

Comparison of ET and the original transformer on WMT'14 En-De with different volumes. The greatest advantage is achieved with small sizes, while the ET shows good performance on larger sizes, ahead of the largest transformer with the number of parameters smaller by 37.6% (comparable models are taken in circles).

To test generalizability, we compared ET with a transformer on additional problems of natural language processing. At first, we checked translations for different pairs of languages, and found that the efficiency of ET is higher, and its separation roughly corresponds to what was demonstrated in the English-German translation; and again, thanks to the efficient use of parameters, the largest gap is observed on medium-sized models. We also compared the decoders of both models in the LM1B modeling language, and saw a significant improvement in connectivity indicators.

Future plans

These results are the first step in exploring the architecture search application for serial models with direct distribution. The software is distributed as open source as part of the Tensor2Tensor project, where it can be used on any consecutive problems. To improve reproducibility, we also open the search space code we used in our search, and Colab with the implementation of PDH. We look forward to the results from the research community, armed with new models, and we hope that others will be able to take these new search techniques as a basis!

Source: https://habr.com/ru/post/460099/

All Articles

Application of automatic machine learning to neural networks with “transformer” architecture

Technician Development

Evolved transformer

Evaluation

Future plans

More articles: