Recently, the phrase "machine learning" (Machine Learning, ML) has become incredibly fashionable. Like any technology, enthusiasm here exceeds the level of implementation of specific products. It is possible to argue, but few algorithmic technologies from the time of tremendous innovations from Google 10-15 years ago led to the emergence of products that were widely distributed in popular culture. Not that since then there have been no breakthroughs in machine learning, there have not been so stunning and based on computational algorithms. Netflix can use smart recommendations, but it is without Netflix. But if Bryn and Page had not analyzed the graph structure of the web and hyperlinks for their own selfish purposes, we would not have Google.
Why is that? After all, tried the same. A lot of startups wanted to carry the technology of natural language processing to the masses, but they all took turns into oblivion, after people actually tried to use them. The difficulty of obtaining a good product using machine learning is not in the understanding of the basic theory, but in the understanding of the field of activity and the task. Understanding so deep that on an intuitive level to see what will work and what does not. Interesting problems have no ready-made solutions. Our current level in any application areas, for example, the same processing of natural language, is driven more by the revelations related to this area than by new techniques for solving common machine learning problems. Often, the difference between a program used every day and a half-year coursework is a special look at the problem and a good solution model.
I'm not trying to convince you not to make cool machine-based products. I'm just trying to clarify why this is so difficult.
Progress in machine learning.
Machine learning has come a long way in the last decade. When I entered graduate school, the training of linear classifiers for classes of objects with large indents (ie,
SVM ,
support vector machine ) was performed by the
SMO algorithm. The algorithm required access to all data at once. Training time increased to indecency with the growth of the training sample. In order to simply implement this algorithm on a computer, it is necessary to understand non-linear programming, and in order to highlight the essential constraints and fine-tune the parameters - and black magic. Now we know how to train classifiers of almost the same efficiency in linear time
online using a relatively simple algorithm . Similar results also appeared in the theory of (probabilistic) graphical models: the
Monte-Carlo Markov-chain and
variational methods simplify the derivation of complex graphical models (although MCMC has been used by statistics for quite some time, in large-scale machine learning this technique has been used quite recently). It is ridiculous - compare the top articles in the
Association for Computational Linguistics (ACL) and see that the machine learning techniques used recently (2011) have become much more sophisticated compared to those used in 2003.
')
Progress in education is also colossal. Studying at Stanford in the first half of the 2000s, I took Andrew Ng courses in machine learning and Daphne Koller in probabilistic graphical models. Both of these courses are among the best ones I've taken at Stanford. Then they were available about a hundred people a year. The course Koller, perhaps, is not just the best of Stanford. He taught me a lot in teaching. Now these courses
are available to everyone on the web .
As a person engaged in machine learning in practice (in particular, the processing of natural languages), I can say that all these achievements have facilitated many studies. However, the key decisions that I make are not related to an abstract algorithm, the type of the objective function or the loss function, but to a set of features characteristic of a specific task. And this skill comes only with experience. As a result, although it is great that a wider audience gets an idea of what machine learning is, this is still not the most difficult part when creating smart systems.
Ready solutions are not suitable for interesting tasks.
The real problems you want to solve are much more unpleasant. than the abstractions offered by machine learning theory. Take, for example,
machine translation . At first glance, it is similar to the
task of statistical classification : you take a sentence in a language and want to predict which sentence in your language will correspond to it. Unfortunately, the number of sentences in any common language is combinatorially huge. Therefore, the problem solver cannot be a “black box”. Any good solution method is based on decomposing the problem into smaller ones. And the program is then already trained to solve these smaller tasks. I argue that progress in complex tasks, like machine translation, is achieved by a higher-quality partitioning and structuring of the search space, and not by clever translation algorithms that we teach in this space.
The quality of machine translation over the past ten years has grown by leaps and bounds. I think that this is mainly due to the key insights in the field of translation, although the general improvements in the algorithms also played a role. Statistical machine translation in its current form goes back to the remarkable work "
The mathematics of statistical machine translation ", which introduced the
architecture of a noisy channel on which translators will be based. If explained on the fingers, it works like this: for each word there are its possible translations into our language (including the empty word, in case there is no equivalent in our language). Imagine this as a probabilistic dictionary. The received words are rearranged to receive a sentence that is already sounding in our language. Many details have been omitted in this explanation - how to work with candidate sentences, permutations, how to train models of standard permutations from a certain language to a target, how to evaluate the harmoniousness of the result, in the end.
A key breakthrough in machine translation occurred precisely with the change of this model. Instead of translating individual words, new models began to consider
translations of whole phrases . For example, the Russian “evening” roughly corresponds to the English “in the evening”. Before the translation by phrases, a model based on a word-by-word translation could receive only a single word (IBM model 3 allows you to get several words from each word in the target language, but the probability of seeing a good translation is still small). It is hardly possible in this way to get a good offer in English. Translating phrases leads to a smoother, more lively text, similar to the speech of a carrier. Of course, adding pieces of phrases leads to complications. It is not clear how to evaluate a part of the phrase despite the fact that we have never seen such a partitioning of it. No one will tell us that “in the evening” is a phrase that must fit some phrase in another language. What is surprising here is that the difference in the quality of translation is not created by a cool machine learning technique, but by a model tailored to a specific task. Many, of course, tried to use more cunning learning algorithms, but the improvement from this was, as a rule, not so great.
Franz Och, one of the authors of the phrases translation approach, came to Google and became a key figure in the Translate group. Although the foundation of this service was laid in the time of Franz’s work at the
Information Sciences Institute (and, before that, in graduate school), many ideas that allowed a step further translation by phrases came from engineering work related to scaling these ideas to the web. This work has generated tremendous results in large-scale language models and other areas of NLP. It is important to note that Ox is not only a high-profile researcher, but also, by all accounts, an outstanding hacker (in the best sense of the word). This is a rare combination of skills and allowed to go all the way from a research project to what
Google Translate is now.
Task definition
But it seems to me that creating a good model is not the whole problem. In the case of machine translation or speech recognition, the task is clear, and the quality criteria are easy to understand. Many NLP technologies that will be fired in applications over the next decades are blurred much more. What exactly should the ideal research contain in the field of modeling of feature articles,
conversations ,
characterization of reviews (the third lab in
nlp-class.org )? How to make a mass product based on this?
Consider the task of
automatic referencing . We would like a product that reviews and structures the content. However, for a number of reasons, it is necessary to limit this formulation to something for which you can build a model, structure it and, ultimately, evaluate it. For example, in the summarization literature, a task is usually formulated as a selection of a subset of sentences from a collection of documents and their ordering. Is this the task to be solved? Is this a good way to annotate a piece of text written in long, complex sentences? And even if the text is well annotated, will these frankenstein sentences look natural to the reader?
Or, for example, review reviews. Do people need a black and white "good / bad" assessment? Or you need a more complete picture (ie, “food is cool, the situation sucks”)? Are customers interested in the attitude of each particular visitor or an accurate analysis of the aggregate reviews?
Usually, these questions are answered by the authorities and let the engineers and researchers to implement. The problem is that machine learning quite severely limits classes of tasks that are solvable from an algorithmic or technical point of view. Based on my experience, people who are knowledgeable in approaches to solving similar problems, who have a deep understanding of the problem area, can offer ideas that simply do not arise among specialists without such an understanding. I'll draw a rough analogy with the architecture. It’s impossible to build a bridge just “like this.” Physics and physics and technology impose severe restrictions on the structure, and therefore it makes no sense to let people without knowledge in these areas to design bridges.
To summarize, if you want to make a really cool product using machine learning, you need a team of cool engineers / designers / researchers. In all areas, ranging from basic machine learning theorems to building systems, domain knowledge, user interaction techniques and graphic design. Preferably, world-class specialists in one of these areas and well versed in the other. Small talented teams with a complete set of the above will be well oriented in the world of design and product promotion, full of uncertainties. For large companies, where R & D and marketing are located in different buildings, this will not work. Cool products will be created by teams in which everyone has their eyes lit, who see the context completely, and who have enough space in the notorious garage.