Machine intelligence in a Gboard keyboard

Most people spend a significant part of their time every day at the keyboard on their mobile device: composing letters, chat messages, social networks, etc. However, mobile keyboards are still rather incoherent. The average user types from a mobile keyboard about 35% slower than a physical one. To change this, we recently introduced many great improvements to Gboard for Android . We strive to create a clever mechanism that allows you to quickly enter text, at the same time offering hints and correcting errors, in any language of your choice.

Considering the fact that a mobile keyboard translates touch into text in much the same way that a voice recognition system translates voice into text, we used Speech Recognition. First, we created robust spatial models that match fuzzy touch sequences to the touchscreen with keyboard keys, just like acoustic models match sound sequences with phonetic units. Then we created a powerful decoding engine based on end transducers (FST) to determine the most likely phrase for a given touch sequence. We knew that with its mathematical formalism and broad success in voice applications, the FST decoder will provide the necessary flexibility to support a wide variety of complex input options, as well as language functions. In this article we will tell in detail what was included in the development of both of these systems.

Neural Spatial Models

Entering from a mobile keyboard is subject to errors, commonly referred to as “fat finger” (or tracing spatially similar words in a sliding set, as shown below), as well as motor and cognitive errors (as typed in typos, inserting extra characters, missing characters or changing characters in places). A smart keyboard must take these errors into account and predict the implied word quickly and accurately. Essentially, we created a spatial model for Gboard that eliminates these character-level errors by matching touch points on the screen with actual keys.

Average trajectory for two spatially similar words: “Vampire” and “Value”
')
Until recently, Gboard used 1) Gaussian model to determine the probability of pressing adjacent keys; 2) a rule-based model for representing motor and cognitive error. These models were simple and intuitive, but they did not allow directly optimizing the metrics that correlate with the best quality of the set. Based on our experience with Voice Search acoustic models , we replaced both the Gaussian model and the rule-based model with a single high-performance long-term short-term memory model (LSTM) that was trained with the CTR criterion for associative transient classification .

However, learning this model turned out to be much more difficult than we expected. While acoustic models were trained in audio data with accompanying text prepared by a person, it was difficult to prepare accompanying text for millions of touch-tone sequences and finger trajectories on the keyboard. So the developers used the interaction signals from the users themselves - corrected auto-corrections and hints choices - as negative and positive signals in teaching with partial involvement of the teacher (semi-supervised learning). So extensive data sets for training and testing were formed.

Source data points corresponding to the word “could” (left) and normalized selected trajectories with deviations in the sample (right)

The brute force method has tried many techniques from the literature on speech recognition on neural spatial models (NSM) to make them compact and fast enough to work on any device. Hundreds of models were trained on the TensorFlow infrastructure, optimizing various signals from the keyboard: autocompletions, hints, touchscreen slides, etc. After more than a year of work, finished models have become about 6 times faster and 10 times more compact than the original versions. They also reduced the number of incorrect autocorrections by about 15% and the number of incorrectly recognized gestures on offline datasets decreased by 10%.

Final converters

While NSM uses spatial information to help identify keystrokes or keystrokes, there are additional language restrictions — lexical and grammatical — that can be used. Lexicon tells us which words exist in a language, and probabilistic grammar - which words can follow which others. To encode this information, we used finite transducers (FST). They have long been a key component of Google’s speech recognition and synthesis systems. FSTs provide a fundamental way of representing various probabilistic models (lexicons, grammars, normalizers, etc.) from natural language processing, as well as the mathematical framework necessary for influencing, optimizing, combining, and searching for models * .

In Gboard, the character-to-word converter is a compact keyboard lexicon, as shown in the illustration below. It encodes ways to convert key sequences to words, allowing for alternative key sequences and arbitrary spaces.

The converter encodes “I”, “I've”, “If” along trajectories from the initial state (bold circle with the designation “1”) to the final states (circles in a double contour with the designations “0” and “1”). Each arc is labeled with the key's input value (before the colon) and the corresponding result word (after the colon), where ε denotes an empty character. The apostrophe in “I've” can be omitted. The user can sometimes miss the space. To take this into account, in the converter, the space between words is optional. The ε character and backward arcs allow more than one word.

A n-gram probabilistic converter is used to represent the language model for the keyboard. The state in the model represents (up to) n-1 vocabulary context. The arc going out of this state is marked with the winning word and the probability with which it follows this context (based on textual data). In combination with the spatial model, which gives the probabilities of sequences of keystrokes on the keys (discrete values in the case of individual keystrokes or continuous gestures in a sliding set), this model is used in the ray search algorithm .

The general principles of FST - streaming, support for dynamic models and others - made it possible to make significant progress in developing a new keyboard decoder. But it was necessary to add a few additional features. When you speak out loud, you do not need a decoder to guess the end of a word or the next word - and save a few syllables in speech; but when you type, help in autocompletions and predictions will be very helpful. We also wanted the keyboard to provide organic multilingual support, as shown below.

Trilingual set in Gboard.

It took complex efforts to make the new decoder work, but the fundamental nature of the final converters has many advantages. For example, transliteration support for languages like Hindi is a simple extension to the basic decoder.

Transliteration models

In many languages with complex alphabets, romanization systems have been developed to display characters in the Latin alphabet, often in accordance with their phonetic pronunciations. For example, the pinyin “xixixi” corresponds to the Chinese characters “谢谢” (“thank you”). Pinyin keyboard allows you to conveniently type words in the QWERTY layout and automatically "translate" them into the desired alphabet. In the same way, a Hindi keyboard allows you to type “daanth” for the word “दांत” (“tooth”). While pinyin have a common romanization system, Hindi transliteration is not so clear. For example, “daant” would also be a valid alternative for “दांत”.

Transliteration of sliding input in Hindi

When we had a converter that converts sequences of letters into words (lexicon) and an automatic weighted language model for sequences of words, we developed a weighted converter for converting 22 Indian languages between Latin character sequences and alphabets. Some languages have several scripts (for example, the Baudaux language can be written in Bengali or in devangari), so we have created 57 input methods in just a few months between transliteration and native writing.

The universal nature of the FST decoder allowed us to use all the work done earlier to support autocompletions, predictions, sliding sets and many UI functions without any additional effort, so that our Indian users from the very beginning received an excellent quality program.

Smarter keyboard

In general, our last job reduced the decoding delay by 50%, reduced the proportion of words that users had to correct manually by 10%, enabled transliteration for the 22 official languages of India, and led to the emergence of many new features that you might notice.

We hope that the latest changes will help you type the text on the keyboard of your mobile device. But we understand that this task is by no means solved. Gboard can still make assumptions that seem strange or of little help, and gestures can be recognized in words that a person would never type. However, our transition to powerful algorithms for machine intelligence opens up new possibilities that we are actively exploring in order to create more useful tools and products for our users around the world.

Thanks

This work was done by Cyril Allausen, Wise Alsharif, Lars Hellsten, Tom Ouyan, Brian Roark and David Rybach, with the assistance of the group Speech Data Operation. Special thanks to Johan Shalkvik and Corinne Cortes for their help.

* A set of relevant algorithms is available in the free OpenFst library. ↑

Source: https://habr.com/ru/post/329884/

All Articles