WaveNet: computer-synthesized human-like speech

DeepMind is an autonomous division of Google that develops in the field of artificial intelligence. This company developed AlphaGo , a system that beat go world champion tho goi sedol.

But DeepMind's lot is not just games. Now employees of the company are engaged in the development of computer speech synthesis systems. As with all other DeepMind projects, a weak form of artificial intelligence is involved. According to experts, it can dramatically improve the situation with synthesized speech.

Using computers to synthesize speech is not at all a new idea. The simplest solution is the use of fragments of speech of a real person, translated into a figure. We are talking about individual sounds that make up more complex sound phrases, words and sentences. But this method can not be called perfect. Here, any person immediately notices problems with pronunciation and intonation.
')
In other cases, different mathematical models are used to synthesize sounds from which words and sentences can be collected. There are about the same problems as in the previous case. Yes, and immediately understand what the car says, not a man.

Both methods are similar in that larger ones are assembled from smaller fragments. As a result of this compilation, the computer utters words and complex phrases.

The third method, WaveNet, proposed by DeepMind, combines the advantages of the previous two. The method uses the training of neural networks using fragments of real human voices. The system also receives information about the rules of linguistics and phonetics that correspond to each individual case. In the process of work, the system is shown a string of text and gives a “listen” to the corresponding set of sounds. After that, the system tries to synthesize human speech using a number of fragments. This is done step by step, with training on the example of each specific fragment. The development is carried out in such a way that each previous “passed material” gives the neural network an idea of the new task.

The analogue of what the WaveNet system and the usual speech synthesis system can do is to create a cup. The usual computer speech synthesis system uses Lego cubes to create a cup. As a result, the cup looks good, but it's not really a cup, but its imitation. WaveNet uses clay to make a cup. The work is carried out manually, without a potter's wheel, but the cup turns out to be similar to a cup. So with speech. WaveNet synthesizes human speech, which is slightly different from what we are used to, but not very significantly.

The result is impressive. Listen to what happened, here . Sounds already really human. Of course, there are differences, but they are no longer as significant as in other cases.

The only problem is that this method requires a lot of machine time and resources. A system that can generate coherent human speech must be very powerful. The fact is that WaveNet, for synthesizing human speech, processes 16,000 audio samples every second. And even in this case, the result is of average quality. However, in the tests for the definition of "person or machine" the result was about 50%. That is, half of the volunteers who listened to the audio sample created by the machine, thought it was a person speaking.

DeepMind researchers have already loaded more than 44 hours of speech into the system. The words, sounds and phrases loaded into the system belong to the 109 participants in the experiment who speak English. As it turned out, WaveNet can simulate the speech of almost every participant in the experiment. The system reproduces even aspiration and speech defects of the original "speaker".

Despite the fact that the system speaks quite well, it is still far from real perfection. Another problem is that the weak form of AI is not yet able to understand the language. The maximum success in this direction has been achieved by IBM with its cognitive system IBM Watson. But here, too, so far we are talking about the recognition of not too complex verbal and written commands, as well as answers to simple questions. Cognitive systems are not able to keep up the conversation. Nevertheless, technologies are developing, and experts say that in 5-10 years the situation can change drastically.

A number of scientists claim that the weak form of AI is still lacking specific components of the mind. And it does not depend on the size of the network itself. “Language is built on other possibilities, probably lying more deeply and present in babies even before they begin to speak the language: visual perception of the world, working with our motor apparatus, understanding the physics of the world and the intentions of other creatures,” says Tenenbaum.

DeepMind and a team of researchers from Oxford University are currently working on another project. This is the creation of a conditional "red button" for a strong form of AI, which, presumably, can get out of control of a person after a person creates an artificial mind.

Source: https://habr.com/ru/post/397327/

All Articles

WaveNet: computer-synthesized human-like speech

More articles: