Artificial intelligence, challenges and risks - through the eyes of an engineer

Good afternoon, colleagues. Today, I want to look soberly with the eyes of an engineer on artificial intelligence and Deep learning, which is so popular now, to streamline, build facts and work out a winning strategy - how can you ... fly up, fly over and not fall on someone's head? Because when it comes to laboratory models in python / matplotlib / numpy or lua, it comes to highly loaded production in the client service, when an error in the source data negates all efforts - it becomes not that fun, but even numerological medieval ecstasy begins and engineers start dancing day and night, hoping to be cured of the newfangled plague)

Dancing engineers vainly hoping to be healed

Modern development

It is useful to start to recall how, in principle, modern software development is arranged. The basis, as a rule, is taken by classical, well-studied in the "last century" algorithms: search, sort, etc., etc. Favorite by all DBMS is a good example of bearded, carefully checked and studied classical algorithms (forgive us Codd and Data for such pop).

Algorithms are added to the algorithms agreed upon by the community of engineers (although often not agreed upon by anyone, advanced by evil forces) - standards: network (DNS, TCP / IP), file formats, operating system services (POSIX), etc.
')
And, of course, languages. In recent years, unfortunately, some dismal nonsense has begun in this area: automatic getters and setters are added and removed, and types are automatically introduced and output (as if this is so importantly important), the ideas of functional programming from “Alice in Wonderland "And Haskel on Scala, trying to partially hide C ++ holes in Rust, create C for humanities in the form of Go, once again come up with javascript from scratch (ECMAScript 6) and continue to believe in the beauty of the model of actors with all my heart and completely different nails with Erlang. All this movement, not without reason, is fueled by the idea of future multi-threaded programming on multi-core processors and GPUs, possibly in clusters, with the help of actors, immutable-data, functional HLS and implicit distortions in the form of currying and recursively building a binary tree. But, obviously, a close breakthrough is not at all visible - logic collided with physics and everything began to rest on the limitation of the human brain, the compilers' stupidity in solving more or less interesting problems and the ability of people, well, excuse me, make mistakes and without malicious intent to produce tons of bugs. And everything with the new force sounds the statement of Edsger Dijkstra, that serious programming is for smart people, no matter how cool and no miracle technology will help here. Although futuristic quantum computers can really help us, should we believe in something?

In general, in order not to get confused with languages and the code does not blot out with bugs, there is a well-known means - to write ... "correctly." But in order to understand the meaning of “right”, you need to read a lot of books, shovel hundreds of thousands of lines of code in the “tomorrow at 10:00 should work and please customers” mode and drink a lot of cups of strong coffee and break a lot of keyboards about the heads of colleagues. But this rule works. Is always.

The world through the eyes of an engineer

As for libraries, no one, of course, does not write everything now from scratch. This is stupid and expensive. But using “other people's” libraries there always remains the risk that they were written a little “wrongly”, and it’s almost impossible to influence this at the moment, and they also add oil from above: “take it ready, so they took it and it works”. So everyone, of course, makes and uses Linux (an operating system, but, in fact, the same “library” of access to the hardware), nginx, apache, mysql, php, standard collection libraries (java, c ++ STL). Libraries, unfortunately, are very different from each other, both in speed and in the degree of documentation and the number of “uncorrected errors” - therefore, teams with a certain sense and distinguish reliable “rat skins” from well-developed but few useful achieve success non-transparent and / or inadequate with slightly non-standard load solutions.

Thus, theoretically and, having made certain efforts, it is practically possible to create adequate software in a short time with severely limited resources, using mathematically proven algorithms and libraries of sufficient, but not critical degree of corruption, in programming language that has proved its “normality”. There are many examples of success, but there are many ;-)

There are not many examples of successes, but they are

Machine learning

In this area, basically we are talking about learning algorithms. When a business, say, has accumulated a certain bigdate and wants to monetize it and pull out a useful algorithm that helps clients and / or increases productivity. Analytically, this problem can be extremely difficult or impossible to solve in the forehead, expert years are required, taking into account many factors, the challenge to Cthulhu - and it seems that you can do it easier and more “arrogant”: drag a deep neural network with a “face” according to the data and poke it into they are then as long as the error falls below a certain level (remember that two types of errors are usually checked — a training error and a generalization error on a test dataset). It is believed that if there are a million or more examples, a deep neuron can start working at a level not worse than a person, and if there are more examples, there is hope to teach it to transcend a person. And if there are fewer examples, the neuron can also bring its feasible benefit by helping, but not replacing the person.

Useful neuron inside r2d2

It sounds beautiful and practical - there is data, “big data”: face, neuron, learn and help humanity. But as you know, the devil is in the details.

Threshold of entry into the development and machine learning / Deep learning

It's no secret that technology development are divided into categories according to the level of occurrence. In the simplest and most accessible technologies, people are always much more, often with a completely non-core education, and in such an environment, many wrong, short-lived and warring libraries are often created - JavaScript and Node.js fit well into this niche. It is clear that if you dig deep, then a lot of unobvious details appear, real Gurus appear with a capital letter - but it’s quite possible to enter this area in shorts and a butterfly net over the weekend.

Young front-end JavaScript developers. But in order to become a megagur, you will have to study for a long time, the summer is definitely not enough.

Dynamic programming languages like PHP, Python, Ruby, Lua can be attributed to the “medium” level of entry category. Everything is much more complicated here - more advanced OOP concepts, partially implemented functional programming features, sometimes primitive multithreading is available, modularity and tools for creating large programs are more pronounced, system functions are available, standard data types and algorithms are partially implemented. In a week, without tension, it is quite possible to figure out and start creating a useful code, even without specialized education.

Dynamic weakly typed scripting languages have a small entry threshold. Over the summer, you can learn to kindle a fire and catch a fish

The well-established industrial languages such as C ++, Java, C # usually seem to fall into the “higher” category of entry category and seem to have begun to attack Scala, bash and VisualBasic (the last two are just a joke). Here we will encounter highly developed industrial complexity management tools, high-quality libraries of typical data structures and algorithms, powerful capabilities for creating domain dialects, a huge amount of high-quality documentation, additional advanced libraries for debugging, profiling, and excellent visual development environments. It is easiest to enter and work in this category, having a profile education or a great love for programming and several years of intensive experience and good knowledge of algorithms and data structures - because the work is often conducted at a rather low level and the knowledge of the subtleties of the operating system, network protocols are often important here.

Industrial programming languages require exhausting training and often do not forgive mistakes.

Thus, in principle, a novice developer is separated from a useful engineer by several months of hard, exhausting work or years of emotional hanging out in software projects and wiping the board after brainstorming.

But with machine learning a bit ... otherwise. Analysts are often just born. The learning process resembles the study of playing a musical instrument - 2-3 years of ear training, 2-3 years of the range to drive to the blue in the face, 3 years in the orchestra, 5 years to cut corpses in the morgue and tons of sweat. It is simply impossible to teach a person the basics of mathematics and statistics, mathematical analysis, linear algebra, differential calculus, probability theory for months, years are needed and ... and not everyone will pull and go to the next course. Many will go the distance to other departments. Being a scientist is not for everyone, no matter how much it would be desirable.

Internship analysts

Or maybe blow over? Everyone believes in this at the beginning: I'll figure it out over the weekend! But unfortunately, in order to understand how the most elementary “logistic regression” in machine learning, which is a kind of “hello world” in programming, it is necessary to understand well at least a couple of sections of higher mathematics: probability theory and linear algebra. And to understand the logic of “stochastic gradient descent,” you also need to know differential calculus, at least in its basic form.

Raw semi-laboratory frameworks

Drive adds high humidity. The popular deep learning frameworks on the market are raw to the extent that the real mold appears on the keyboard in the morning.

Popular frameworks for machine learning. They are not lost, they are just still very raw.

It is clear why. The parade of "universal" frameworks began only in the past, in 2015. Deep learning surely went up for the third time only in 2006, after many decades of uncertainty and stagnation. GPU quite recently suddenly appeared at the right time in the right place.

Unfortunately, TensorFlow is still very slow and weird in production, Torch7 suffers from a lack of normal documentation and Lua, deeplearning4j tries to like the GPU, and candidates like Theano in python do not understand how to effectively exploit in production without hard drugs. Yes, there are legends that the training of a neural network is one thing, and its operation is another, and completely different people and technologies should be engaged in this - but reality considers money and, you see, terribly inconvenient, expensive and not very reasonable. The most versatile and aimed at solving specific business problems in a short time in a “normal” industrial language is similar so far only deeplearning4j - but from, too, is in a phase of active growth and maturation with all that it implies.

How to choose a neural network architecture to solve a business problem?

Only scientific publications can be read in bird dialects with overt matan without a mat, so for most engineers the most practical and useful way of exploring the possibilities of architecture is most likely to dig into the source code of the frameworks, most of the time on the “damned” student python and study numerous examples. in different frameworks it becomes more and more and it can not but rejoice.

Therefore, the general recipe is to choose the architecture that is most appropriate for the problem being solved as an example in the code, implement it 1 in 1 in a convenient framework for operation, order a prayer service and you may be lucky! What does it mean "maybe lucky"? And everything is very simple. You will encounter the following range of engineering risks:

Analyst, lead developer and project manager are preparing for the prayer "On the reduction of neurons." They say that they will soon prove a theorem on the influence of prayers on the behavior of gradient descent under conditions of numerous plateaus, saddles and local minima

1) The neuron architecture works well with the data of the researcher, but it can work quite differently with your data or work quite the opposite.
2) Your framework may not have the entire spectrum of elementary cubes: automatic differentiation, updater algorithm with fine adjustment (updater), extended regularization tools (dropout and others), necessary error function (loss), a certain operation on data (vector product) and etc. You can replace them with analogs, but it brings risks.
3) Sometimes, though rarely, you want to pull in the production Matlab or R ;-) Here is one piece of advice - go to the doctor right away.
4) Most likely, you will need to adjust the neural network for additional business requirements that appeared later and do not fit the ideal world of mathematics at all. For example, to significantly reduce the level of false positives, increase Recall, decrease the learning time, adapt the model to a much larger set of data, add and take into account new information. And here, as a rule, you need to delve very deeply into the architecture of the neuron and twist the hidden screws - and twist for good luck, this is the way to disrupt the release dates and nightmares. And without a professor you can sit with a screwdriver with a confused facial expression for a month and two and six months.

What to do, what to twist? C ++ developer tries to understand the difference between softmax and softsign

Hello tensors!

For an engineer, a tensor is simply a multidimensional array that allows you to perform various operations and sacrifices on yourself. But it is necessary to get used to working with tensors ;-) The first weeks even from three-dimensional tensors are a headache, not to mention much more “deeper” tensors for recurrent and convolution networks. You can kill a lot of time for these low-level manipulations, debugging tsiferok in tensors and finding errors in one value of 40,000. Be sure to consider this risk - tensors only seem simple.

Be careful! Attempts to represent the structure of 4 or more dimensional tensors leads to aggressive strabismus

Features of working with GPU

It may not be obvious at the beginning, but it is usually faster to train the neuron and get its answers when all the necessary data (tensors) are loaded into the GPU memory. And the memory of these precious and so adored gamers devices is limited and usually much less server memory. We have to crutches - to introduce intermediate caching of tensors in RAM, partial generation of tensors during the passage through the data set (for everyone may not fit), and so on. Therefore, we also take into account this important engineering risk affecting the labor intensity. Dates can be easily multiplied by 3.

Video card. On it, it turns out, you can not only play!

Neural network in production

Suppose you are very “lucky”, you have worked hard and brought the laboratory prototype to production quality, raised a web server, loaded a trained neuron into the server memory or immediately into the GPU memory and quickly responded adequately. But ... the data is changing and behind them you need to change / retrain the model. You need to constantly monitor the quality of the work of the neuron, measure its accuracy and a number of other parameters depending on the specific business task and carefully consider the procedure for updating it and before / retraining. Believe me, there is much more trouble here than with a classical DBMS, which needs to be optimized once every 5 years and once every 10 years remove the web from the motherboard :-)

And there are legends that the neural network can be simply ... retrained and you will not need to re-educate it on the entire volume of data. Actually - it is impossible, but everyone is very very necessary and sometimes ... lucky. If the data is small enough and you need to try to remember them as much as possible (without reducing the error on the test dataset, of course), then you just “finish training” without “retraining” without the risk of something important that you can’t forget. There is no guarantee that having remembered something new, the stochastic gradient descent (SGD) will not forget the important old one ;-) And if there is a lot of data (millions of photos for example) and there are no requirements to remember this particular example, it will work (but the prayer will not hurt).

Developer, Tester and Suicide

Not everyone realizes that if during classical programming, errors could occur only either in your code, or in a third-party library, or due to a hormonal surge, then when learning and operating a neuron, everything becomes more complicated and that is why:

1) Everything has become bad and inaccurate for you, because it just took the information hidden in the initial data set and changed to the devil
2) You have an error in the neural network architecture and it just manifested itself due to changes in the source data. Be kind, put on a helmet and study the gradients and weights of each layer: if there is no gradient attenuation, if there is no gradient explosion, how weights are evenly distributed and whether regularization is necessary, if there are any problems in the output error function attenuation of the information / gradient flow in sigmoid border modes and other specific activation functions, etc. etc. - headache is provided for a long time and in earnest and a helmet just for beauty.
3) You were lucky and you found a mistake in the neural network framework ... a week before the release.

All of the above teaches us that programming client services that use deep machine learning requires not just super-reinforced concrete, but thermonuclear, covering all existing and even non-existent code with a grid of asserts, tests, comments, and thoroughly pouring out paranoia tuned on perfectionism.

Tincture of paranoia, extract - 5 years

findings

We have openly and honestly identified the key facts and risks associated with the implementation and use of deep neural networks in high-load services - from the point of view of an engineer. Conclusions - not yet done. It can be seen that there is a lot of work, this work is not simple, but it is extremely interesting and success comes only to professionals who know how to combine knowledge and people from different areas and who have a taste for synergy. Obviously, you need to not just perfectly program and feel the system with your fingertips, but either understand or actively involve experts in the field of mathematics in such projects and create positive, creative conditions for colleagues so that more interesting and effective ideas and ways of their concise and quick will come. implementation. Otherwise ... it will take months to sit with a screwdriver in front of the collapsed Gravizap, twist the cogs at random, burn the matches and look with envy at competitors' solutions glowing in the sky with their power and beauty of artificial intelligence. I wish you all engineering luck, converging neural networks, confidence, energy, and as few subtle errors as possible! Koo! ;-)

Source: https://habr.com/ru/post/315484/

All Articles