We have already told you about the project Robot Vera from a business point of view. It's time to learn more about service insides. Today you will learn about the infinity of resources, semantics and technologies that are used in this project. Under the cut - video decoding.
Dmitry Zavalishin is a Russian programmer, author of the concept of OS Phantom, organizer and member of the OS Day conference program committee, founder of the DZ Systems group of companies. In 1990–2000 he took an active part in the creation of the Russian segments of the Internet (Relcom) and Fidonet, in particular, ensured the transparent interaction of these networks. In 2000-2004, he was responsible for the design, development and development of the Yandex company portal, created the Yandex.Guru service (hereinafter Yandex.Market). You can read more on the Wiki .
Vladimir Sveshnikov - came to St. Petersburg in 2006 from the Far East. He received a law degree at the University of Finance and Economics. In 2009, they organized the First Street company, which was engaged in the design of unqualified personnel from the CIS. Later I started outsourcing staff, and by 2012, he and a friend had two major customers — the Healthy Baby and Dixie stores. The annual turnover of First Street amounted to 30 million rubles, and in 2013 - 50 million. But soon Vladimir realized that he did not see himself in outsourcing and wanted to make a technology start-up.
Interview
Hello!On the air of DZ Online technology, and our first guest, Vladimir Sveshnikov, is co-founder and CEO of the company Robot Vera, which deals with the selection of personnel using artificial intelligence.This is one of the first, probably, startups that actually allow artificial intelligence to people.We have already met with you, discussing the business objectives of this area, and found out about why it is needed and why it is good.And today Vladimir will tell us a little bit about how it all works, what problems arose on the way to that ideal ... in any case, the current solution, which is now.Vladimir, hello!
Yes hello! Yes, I’m happy to tell you how we started. The fact is that there were only three of us in the team, it was just a year ago. Now we have more than 50 people in the company working together. But when we were three, I was fully responsible for the entire technical part. Initially, we started doing such simple things. That is, we just took and started repeating the recruiter process. We were looking for a resume for him, candidates were called for him, an e-mail was sent for him describing the vacancy. And since I have a certain technical background.
Although I am a lawyer by education, I later changed my profile to programmers. And I realized that these processes are very routine, very monotonous, they can be automated. And the first thing we did ... this, I remember, was the day we were looking for a resume. It took about half a day: half a day searching for a resume, half a day, respectively, calls. Then I climbed ... we did it on the sites SuperJob, Jobs, Zarplata.ru. Then he looked at their ip looked, I realized that you can do what we did half a day, in one minute. And we did it together with a partner. He, in general, was looking for a resume one day, and I went to do it all in one minute, went to drink tea. He comes, says: “And why are you drinking tea?”, I say: “I have already fulfilled my standard”.
I made my norm.
And, strictly speaking, this was the first impetus. That is, we realized that technologies can be used and automate some processes that are not automated at all in HR. Well, and then we, accordingly, have already begun to actively engage in calls. That is, we hired call center operators, and automated calls, automated ... they made a button, they started calling from the browser. In general, they did everything as simple as possible so that the operator could sit down, put on headphones and, in general, the only thing he did was synthesize speech and recognize speech. Then we realized that these technologies are already on the market now, they show a fairly good quality level, and we can use them.
That is, that person synthesized speech and recognized speech?In this sense, you at that moment realized it, as a detail of this very machine for communication with ... Until this moment, everything was simple.Pick up a list of vacancies via ip, select them by some keyword.Although there are subtleties by and large.But God is with him.Probably, we will return to them later.At some point you began to study voice: synthesize speech and recognize speech.To synthesize - everything is also clear: there were scripts, they are less fixed, probably, for a specific selection.And recognition is ... After all, you started with very simple questions and answers first?
Yes.
Did this rest on the fact that recognition did not work well?
Yes, definitely. There are a few moments. There, if you look at it ... First of all, we have been looking for an approach for a very long time, how people can make them understand what they are talking to the robot, how they built the dialogue. That is, at first people have a shock, they do not understand what to say, what to answer (especially when we call to the regions). Moscow, St. Petersburg is somehow normal, but in the regions people are directly surprised: what is a robot? (There any different vocabulary can be heard at this moment).
And then we became ... we made it so that it began to be presented and then it sets a certain standard communication format. That is, she says: “I can recognize the answers“ yes ”or“ no. ” Answer my yes or no questions. Then people begin to understand that now, if they say yes, then ... well, that is, how they communicate, they understand. Because before that they have a dissonance. That is, like a robot, robots, like, not yet. What? Call from the future? Well, in general, this is all. Accordingly, yes, most likely, it is speech recognition here. It now works so much that it can completely recognize different words. That is, we now have scripts where they select vacancies, where they ask questions. That is, we all recognize it well. But it was “yes” or “no” that was for people to understand how to communicate with the robot. That is, that was the main moment.
That is, you could do more at once?
Yes.
Or is it not?Because then semantics begins.
Yes, well, we have already added semantics later. That is, we added here just a few months ago exactly the answers to the questions. That is exactly to recognize what he said - we could for a long time. We even had such a point there in this script: if he says no, the vacancy is not interesting to him, then we ask “Why is it not interesting?”. And there he answers why it is uninteresting.
But is this a record?Do you just write down the answer without trying to analyze?
We recognize it.
Do you recognize?
Yes, and show in your account just as the answer.
As text?
Yes.
Were there any problems at this moment?Here is what you describe seems rather trivial.In general, it seems that just about anyone on the planet can now take, lift a pack of libraries and collect all this on their knees in two days.
There are certainly problems, of course. The main problem was, probably, if we talk about the technological aspect, that we had a rather complicated product. If we are talking only about calls, only about recognition, only about speech synthesis, this is a separate story. She is very big, complicated. There, too, for example, we did what was there ... we use external speech cognition. We use google, yandex. Firstly, there is no such benchmark. That is, you need to look at your tasks exactly how your text is, how they recognize your audio recordings. That is, the first thing we did was we did this analysis. We looked accordingly, which works better. Then we realized that despite the fact that one of the campaigns works better and shows better results, anyway, the speed of response from her at some moments may be longer. That is, she can answer more. And then we began to send an entry to several speech recognition systems. Including Microsoft, Google, we had Amazon, Yandex. That is, immediately at four we sent and received some kind of the fastest possible response.
Now we use two or three systems maximum. That is, at peak times. And so the main difficulty was that we had besides ... we had to start the resume search first, then ... she is a robot, she does everything herself. She herself is looking for, then after she has found, she calls herself, then after that she, accordingly, sends email to those who have not answered "yes" yet.
Do not look after the robot?
Well, we look. Monitoring system. That is, this is all we went through. And first there, since I did all this alone. I did it not quite right, I did it quickly, in a hurry. And in general, I had one Docker container there, and there was a base in it ... That is, I didn’t break all this into micro services, as is customary and as we did now. It was all one container, one image in which everything worked on one virtual machine. And actually, there we made a new virtual machine, a new image for each new client. And it often happened that there under loadings, since there was no monitoring system, everything was falling there. One of the stories was when a big customer came to us. We had pilots with him for two or three days, and then at some point he decided to call on his stop-lists, there he downloaded several thousand candidates. Of course, I had a memory leak and it all covered up, since it was one container, no image was saved. I’ve been there through telephony, through all of this I’ve been restoring this story almost all night so that they don’t lose their calls. In general, yes, there were such problems, but here, probably, if we talk about ...
Well, these are such typical problems, in fact.Enough trite.They are also not very connected with hi-tech and recognition.This, in general, probably would have happened in any startup, in which ... yes in any, because all startups, probably, are first done by the forces of the first person who is a technological ideologue.Namely with the recognition quality itself?
With the quality of recognition, of course, there is a problem. They are solved in different ways. That is, for example, if we recognize the address, submit to the recognition system that it is an address. Then she gives better quality. If we recognize a question, then we mark that it is a question. But, in general, the quality is now quite good, if we have normal audio recordings, if we do not have any extraneous noise, if we don’t have any ... well, a person speaks in normal speech, there are no defects.
This, by the way, is an interesting point.You know, I've talked with the city Mos.ru, Moscow city services.They, too, are actively engaged in similar technologies, and there too, the tasks are quite massive.There is a completely ridiculous task - collecting information about water meters.You can call and voice call, the robot recognizes.And on the contrary, they say that it is imperfect speech or speech with accents, with strong accents, just the opposite, it is covered quite well by an algorithm and is covered even better than by living people.Do you have a reverse situation if I hear you correctly?
Well, we have such a problem right here, that there were a lot of people with accents, to be honest, no. That is, in our country, basically, everything is somehow standardly spoken. It probably still depends on what ... how to set. That is, if we ask you to choose an answer option, for example, a vacancy, she pronounces it: select "sales manager", "loader" or "storekeeper".
Trying so little to reduce them to a narrow spectrum?
Yes, definitely. We had a case there where we collected questionnaires, resumes, where we asked a person to tell about himself, to tell about his experience. And there, of course, there were all sorts of different interesting, very funny stories. Well, that is, how they told about themselves and how it was all recognized. Certainly there is a big enough error now. And of course there is no such thing that she fully recognizes speech there.
How do you measure error?After all, in fact, if a person just told himself, then it is impossible to build some obvious metrics that could compare what was recognized with the correct text.
It is only a manual now. We have specific account managers who listen to some of the recordings, and then they watch the answers.
Manually verify the point?That is, is it such a more or less ordinary, selective quality verification?
Yes. That is, when we simply chose which recognition system to take as the main one, we took the order, a little less than 1000 audio recordings that we had and each audio recording here ... we recognized the text there, checked the recognized text with the audio recording. Here we were a few people sitting and it did.
But this is the choice of the system?That is, there is an n system, you have a corpus of already known texts, about which the correct answer is known.You chase them, compare them.There is an obvious mechanism.And, by the way, who in the end is the best of all these four recognized?
Well, they each have their own ... well, of course, it is now considered on the market that Google is the best. Well, we have there, for example, Microsoft gave us a record faster. Here, probably, in different ways you can watch. It is impossible to single out one system that we take as a basis. But we always use 2-3 now.
Yandex is an outsider?Does he recognize worse and slower answers?
Yandex recognizes addresses very well. This is probably the best. Here, if we have an address, then without even thinking, we immediately take Yandex. Because it is the best option.
But this is probably just because they have a good base of addresses?
Yes, yes, Yandex. Navigator. Of course, there is Yandex. Taxi. That is, they have a lot of voice samples, when there the driver in the taxi calls the address. That is, they have it very well worked out. There we are not even trying any other ... well, of course we tried as part of a general analysis, but Yandex is much better.
Is there such a trivial thing, such that the output of the signer is driven through a grammar analyzer, which is checked for validation?And it is used, as a certain metric as recognition?
Yes. If we start now to talk about what we are doing now ... well, here, if we automatically measure them somehow, there are certainly benchmarks, we look international. There recently, Mozilla made open source its own, speech recognition, which showed a quality criterion, that is, accuracy in terms of quality, much like Google.
Including Russian?
No, they are only in English yet. Prior to Russian, they can teach. But we are still looking at the international market, therefore, for us, as it were ... We now have the first partner in Dubai, and for us, that's it.
Is it English in Dubai after all?
Yes, there is completely English. All their work sites are in English. Well, there is an Arabic translation there, but the English page has much more attendance.
Returning to the problems.If I understand correctly, I then looked at your articles on Habr, what is happening there, then you got semantics.
Yes, I'll tell you more now. The second task that we have begun to solve ... we are in the design of solving the problems that business brings to us. Business periodically tells us, we need to make our product better, we need to continue on the path of replacing the recruiter. We need, respectively, to make sure that robot Vera can call for an interview, fully agree, address and place to agree. And we started doing this thing. There, in principle, everything is quite simple, but OK, they added a script there invitation to an interview, got some answers, if not interesting, then another date was offered. There is synchronization with the calendar. In general, a fairly simple task, but we did not even have time to start it, we realized that another task, which is very important, is that candidates do not receive feedback.
From the robot?
Yes. They follow the script of the employer. The employer decided to ask three questions, the candidate heard these three questions, answered them, but the candidate could not ask his questions. It turns out such a one-sided system. That is, it is such a B2B business, but at the same time we have a large B2C part. That is, we have a lot of candidates, to whom we have already done more than a half million interviews and, accordingly, this is half a million people who spoke with the robot, who would potentially like to ask their questions, hear some answers. So we began to solve this problem. And we realized that, for example, a simple salary question may sound different. That is, it cannot be programmed at the level of a simple hardcore list of words. Well, for example, there the question of salary may sound “what about the money?”, And maybe “what is the financial component?” That is, we both have that case.
And, accordingly, we didn’t answer the wrong question, didn’t answer that question, because we mortgaged: income, salary. Then we began to look for some options, stumbled upon a car ... I’ve been learning machine learning for a long time, we have people in the team who are actively involved in this. And we remembered that there is such a story, as there is a Word2vec library, it is based on neural networks. Google displays pages on it. That is, for example, if we write in a request ... in the same place, requests to Google are about the same as our vacancy questions. And, accordingly, Google solves this problem with the help of this library. That is, which is better to show. That is, it takes the text there and accordingly shows which document is better, higher. Document ranking. Essentially, how does it work? In general, all words are transformed into a vector, expressed in vector space.
How much is N-dimensional?
Now I will not say for sure. But these parameters are lined up. That is, they can be changed. And the quality of the model depends on them. We took the standard Word2vec model, taught it on the hull from ... well, there is about 150 Kb hull. These are millions of books. There the corps of Wikipedia enters, all the articles of Russian Wikipedia, they are all translated into text and it is being studied on this text. How is she trained? That is, she runs through this text and looks. For example, there is the sentence “I called by phone” and, for example, there is the sentence “I called by mobile”. Since “I called” in both cases it is the same context, it is “telephone” and “mobile” ...
Assumes that they are close.
Yes. There she just brings them together, she randomly places them at first, then draws this distance in points. And so we get such a definite mapping of our words in a vector space.
Metric semantic proximity of words.
Yes. And then we count the cosine distance, or we count the Euclidean distance. , , , 10% . , , , , . , , . , , . 70% . , .