Sequence-to-Sequence Part 2 models

Hello!

The second part of the translation, which we posted a couple of weeks ago, in preparation for the start of the second stream of the Data Scientist course. There is another interesting material and an open lesson ahead.

In the meantime, we went further into the wilds of models.
')
Neural Translation Model

While the kernel of the model's sequence-to-sequence is created by functions from tensorflow/tensorflow/python/ops/seq2seq.py , there are still a couple of tricks used in our model to translate models/tutorials/rnn/translate/seq2seq_model.py , o which is worth mentioning.

Sampled softmax and projection output

As mentioned above, we want to use sampled softmax to work with a large output dictionary. To decode from it, you have to track the projection of the output. Both softmax sampled losses and output projections are generated by the following code in seq2seq_model.py .

 if num_samples > 0 and num_samples < self.target_vocab_size: w_t = tf.get_variable("proj_w", [self.target_vocab_size, size], dtype=dtype) w = tf.transpose(w_t) b = tf.get_variable("proj_b", [self.target_vocab_size], dtype=dtype) output_projection = (w, b) def sampled_loss(labels, inputs): labels = tf.reshape(labels, [-1, 1]) # We need to compute the sampled_softmax_loss using 32bit floats to # avoid numerical instabilities. local_w_t = tf.cast(w_t, tf.float32) local_b = tf.cast(b, tf.float32) local_inputs = tf.cast(inputs, tf.float32) return tf.cast( tf.nn.sampled_softmax_loss( weights=local_w_t, biases=local_b, labels=labels, inputs=local_inputs, num_sampled=num_samples, num_classes=self.target_vocab_size), dtype)

First, note that we create a sampled softmax only if the number of samples (512 by default) is smaller than the target dictionary size. For dictionaries smaller than 512, it is better to use standard softmax loss.

Then, create a projection of the output. This is a pair consisting of a matrix of weights and a displacement vector. When used, the rnn cell returns the shape vectors of the number of training samples to size , rather than the number of training samples on target_vocab_size . To restore logites, you need to multiply it by the weights matrix and add an offset, which is what happens in lines 124-126 in seq2seq_model.py .

 if output_projection is not None: for b in xrange(len(buckets)): self.outputs[b] = [tf.matmul(output, output_projection[0]) + output_projection[1] for ...]

Bucketing and padding

In addition to the softmax sampled, our translation model also uses bucketing , a method that allows you to effectively manage sentences of different lengths. To begin with, we will explain the problem. When translating from English to French, we have English sentences of different lengths L1 at the entrance and French sentences of different lengths L2 at the exit. Since the English sentence is transmitted as encoder_inputs , and the French sentence is output as decoder_inputs (with the prefix of the GO character), you must create a seq2seq model for each pair (L1, L2 + 1) of the lengths of the English and French sentences. As a result, we get a huge graph consisting of many similar subgraphs. On the other hand, we can “pad” (pad) each sentence with special PAD symbols. And then we will need only one seq2seq model for “full” lengths. But such a model will be ineffective on short sentences - you will have to encode and decode many useless PAD symbols.

As a compromise between creating a graph for each pair of lengths and stuffing up to a single length, we use a certain number of groups (buckets) and stuff each sentence up to the length of the group above. In translate.py we use the following default groups.

 buckets = [(5, 10), (10, 15), (20, 25), (40, 50)]

Thus, if the input comes in an English sentence with 3 tokens, and the corresponding French sentence in the output contains 6 tokens, they will go into the first group and will be packed to length 5 at the input of the encoder and length 10 at the input of the decoder. And if in the English sentence there are 8 tokens, and in the corresponding French 18, then they will not fall into the group (10, 15) and will be transferred to the group (20, 25), that is, the English sentence will increase to 20 tokens, and the French to 25.

Remember that when creating decoder input, we add a special GO character to the input. This happens in the get_batch() function in seq2seq_model.py , which also turns the English sentence upside down. Reversal of the input data helped to achieve an improvement in the results of the neural translation model in Sutskever et al., 2014 (pdf). To finally understand, let's imagine that there is a sentence “I go.” At the input, divided into tokens ["I", "go", "."] , And at the output - the sentence "Je vais.", Divided into tokens ["Je", "vais", "."] . They will be added to the group (5, 10), with the input representation of the encoder [PAD PAD "." "go" "I"] [PAD PAD "." "go" "I"] and input decoder [GO "Je" "vais" "." EOS PAD PAD PAD PAD PAD] [GO "Je" "vais" "." EOS PAD PAD PAD PAD PAD] .

Run it

To train the model described above, you need a large Anglo-French corps. For training, we will use 10 ^ 9 Franco-English Corps from the WMT'15 site , and test news from the same site as a working sample. Both datasets will be loaded into train_dir when the next command runs.

 python translate.py --data_dir [your_data_directory] --train_dir [checkpoints_directory] --en_vocab_size=40000 --fr_vocab_size=40000

You will need 18GB of hard disk space and a few hours to prepare the training building. The body is unpacked, dictionary files are created in data_dir, and then the body is tokenized and converted to integer identifiers. Pay attention to the parameters responsible for the size of the dictionary. In the example above, all words outside the 40,000 most frequently used will be converted to a UNK token, which means an unknown word. Thus, when the dictionary is resized, the binary re-forms the body with the id-token. After data preparation, training begins.

The values specified in the translate are very high by default. Large models that have been studying for a long time show good results, but this can take too much time or too much GPU memory. You can set up a smaller model for training, as in the example below.

 python translate.py --data_dir [your_data_directory] --train_dir [checkpoints_directory] --size=256 --num_layers=2 --steps_per_checkpoint=50

The command above will teach a model with two layers (by default there are 3), each of which has 256 units (by default - 1024), with a checkpoint for every 50 steps (by default - 200). Experiment with these options to see what size model is appropriate for your graphics processor.

During the workout, each step of the steps_per_checkpoin t binary will give you statistics on past steps. With the default settings (3 layers of size 1024), the first message looks like this:

 global step 200 learning rate 0.5000 step-time 1.39 perplexity 1720.62 eval: bucket 0 perplexity 184.97 eval: bucket 1 perplexity 248.81 eval: bucket 2 perplexity 341.64 eval: bucket 3 perplexity 469.04 global step 400 learning rate 0.5000 step-time 1.38 perplexity 379.89 eval: bucket 0 perplexity 151.32 eval: bucket 1 perplexity 190.36 eval: bucket 2 perplexity 227.46 eval: bucket 3 perplexity 238.66

Note that each step takes a little less than 1.4 seconds, perplexing the training set and perplexing the working sample in each group. After about 30 thousand steps, we see how the perplexions of short sentences (groups 0 and 1) become unambiguous. The training building contains about 22 million sentences, one iteration (one run of training data) takes about 340 thousand steps with the number of training samples in the amount of 64. At this stage, the model can be used to translate English sentences into French using the --decode option.

 python translate.py --decode --data_dir [your_data_directory] --train_dir [checkpoints_directory] Reading model parameters from /tmp/translate.ckpt-340000 > Who is the president of the United States? Qui est le président des États-Unis ?

What's next?

The example above shows how to create your own end-to-end English-French translator. Run it and see how the model works. Quality is acceptable, but an ideal translation model cannot be obtained with default parameters. Here are a few things that can be improved.

First, we use primitive tokenization, the basic function of basic_tokenizer in data_utils . A better tokenizer can be found on the WMT'15 website . If you use it and a dictionary of large sizes, you can achieve improved translations.

In addition, the default parameters of the translation model are not perfectly configured. You can try to change the learning rate, attenuation, initialization of the model weights. You can also replace the standard GradientDescentOptimizer in seq2seq_model.py with something more advanced, for example, AdagradOptimizer . Try and follow the improvement of the result!

Finally, the model presented above can be used not only for translation, but also for any other sequence-to-sequence task. Even if you want to turn a sequence into a tree, for example, to generate a parse tree, this model can produce state-of-the-art results, as shown in Vinyals & Kaiser et al., 2014 (pdf) . So you can create not only a translator, but also a parser, chat bot or any other program you want. Experiment!

That's all!

We are waiting for your comments and questions here or we invite you to ask them to a teacher in an open lesson .

Source: https://habr.com/ru/post/432302/

All Articles

Sequence-to-Sequence Part 2 models

More articles: