
Interview with Professor Igor Baskin, Doctor of Physical and Mathematical Sciences, leading researcher of the Physics Faculty of Moscow State University.
What is the greatest difficulty for neural networks to learn to establish the relationship between the structure of a substance and its physical and chemical properties?')
The greatest complexity and key feature of the use of neural networks, like any other method of machine learning, to find the relationship between the structure and properties of chemicals is that in this case they must simulate a real nature with its extremely complex and sometimes unknown organization, managed strict, but not always applicable in practice laws.
This is a fundamental difference from standard problems solved using neural networks, for example, image recognition. Indeed, the fact that the figure 8 is depicted as two adjoining circles is not a consequence of any laws of nature - it is simply the subject of agreements between people. But the Romans at one time decided that it was better for them to portray the same number as VIII. Since such agreements are made for the sake of convenience, their form is chosen such that the natural neural networks in a person’s head are easily recognized at a subconscious level.
Therefore, it seems to me, artificial neural networks, which to some extent imitate certain aspects of information processing in a person’s head, are also easily handled.
Now for the chemicals. The fact that aspirin has an anti-inflammatory effect is due to the possibility of its acetylsalicylic acid molecules inhibit the cyclooxygenase enzyme due to the complementarity of the spatial forms of the drug molecules and the enzyme and the favorable balance of many forces acting in the system.
The task of predicting the properties of chemicals depending on their structure, unlike, for example, the problem of image recognition, has never been encountered in the evolution process, and therefore the natural neural networks in our brain cannot solve it just as easily on a subconscious level.
Indeed, any child (and even some animals) can easily distinguish a cat from a dog in a picture, but even a dozen Nobel laureates, looking at the formula of a chemical compound, are unlikely to immediately guess the full set of its properties.
This is a task of another level of complexity. In solving it, a big problem is even that which is usually not perceived at all as something complicated, for example, how to imagine a neural network to be analyzed. When processing images, for example, the natural representation is a set of pixel intensities.
But how best to imagine the structure of matter for a neural network is already a task of tasks, and it does not have such simple solutions. The systems used to solve it. molecular descriptors, i.e. special computational procedures that allow one to describe the structure of matter using a set of numbers have many drawbacks. How best to present and process information about chemicals is the very intensively developed science in recent years - chemoinformatics.
Without reliance on scientific baggage accumulated in chemoinformatics, any attempt to use neural networks to establish a connection between the structure of a substance and its properties turns into a clean game with numbers and does not lead to practically important results. This is probably the main difficulty in using neural networks for this purpose.
What are the 10 main tasks for artificial intelligence in synthetic chemistry?1. How to synthesize a given chemical compound from the available reagents?
2. How to synthesize a chemical compound with a given activity?
3. How will it look and how to synthesize a combinatorial library of chemical compounds focused on a given type of biological activity?
4. What will be the result of the reaction, if you mix the given chemical compounds under the specified conditions?
5. In what conditions should the given reaction be carried out? How to optimize such conditions (temperature, solvent, catalyst, additives)?
6. What is the likely mechanism for a given reaction?
7. How to increase the yield for a given reaction?
8. List the possible chemical reactions
9. Evaluation of the synthetic availability (ease of synthesis) of a given compound.
10. Predict the kinetic and thermodynamic characteristics of simple reactions and the yield of complex reactions.
What is the task - the calculation of all possible chemical reactions?Perhaps this is a combination of the two tasks listed above: (8) a listing of possible chemical reactions and (10) to predict their kinetic and thermodynamic characteristics and output.
What is the best way now in chemo informatics about the structure of a substance? Any multi-dimensional matrices? How fully do they describe the whole structure? Are there any spaces to fill?There is no simple and unambiguous answer to this question. It all depends on what types of substances in question, and also under what conditions and in what aggregative conditions they are considered. In addition, the choice of a specific type of presentation depends on the purpose for which it is done - for the unambiguous identification of the substance, storage in the database, for building any models, for transferring information between programs.
In chemoinformatics for all these purposes, as a rule, different representations are used. The simplest case is saturated hydrocarbons - organic compounds consisting only of carbon and hydrogen atoms and not containing multiple bonds. For their representation, it is convenient to use graphs in which the vertices correspond to carbon atoms, and the edges to the relations between them. It is interesting to note that it was the task of explaining the presence of various isomers in organic compounds that stimulated the creation and development of the fundamentals of graph theory, and the task of listing isomers - the combinatorial theory of groups. Both of these sections of discrete mathematics in the future have found very wide application in almost all areas of scientific knowledge.
The next level of complication is arbitrary low molecular weight organic compounds. These are, for example, the molecules of most drugs, as well as the initial reagents and intermediates for their synthesis. It is also convenient to use graphs for their identification, but this time with labeled vertices and edges. In this case, the labels of the vertices are the designations of chemical elements, and the labels of the edges are the bond orders.
In this case, for the internal representation of molecules in the computer’s RAM, graph connectivity matrices can be used, but in reality they often use complex data structures including tables of atoms and bonds.
To efficiently organize the search for structures in databases and their comparison among themselves, special bit strings called “molecular fingerprints” (fingerprints) are most popular among representations.
To build models that link structures of compounds with their properties, feature vectors are used as representations, which are called molecular descriptors in chemoinformatics. There is a huge variety (thousands!) Of various types of molecular descriptors.
For the exchange of information between programs and for the "external" presentation of chemical structures, text strings called SMILES are currently the most popular. The task of presenting organic compounds is complicated by such purely chemical phenomena as electrolytic dissociation, mesomerism and tautomerism, as a result of which one organic substance can be described by a whole set of different graphs and, therefore, it can have several representations, of which for the purposes of identification usually choose the "canonical" representation .
The task is even more complicated if it is necessary to take into account geometric and spatial isomerism (stereoisomerism), which is not always possible to do at the graph level and often requires elements of the hypergraphic representation. Also, for modeling purposes, it is necessary to take into account the presence of multiple spatial forms, conformers, in flexible molecules. All of these circumstances must be taken into account when selecting chemical representations for machine learning.
At the following levels of complication, for example, during the transition to supramolecular complexes, synthetic polymers, solid materials, the task of finding the most adequate representation for the structure of a substance becomes even more difficult, and no satisfactory solution has been proposed for it to date.
Existing approaches in computer science of polymers and crystals are mainly focused on modeling, and then for the most simple cases, and attempts to create computer science for supramolecular chemistry have not yet been made. Thus, here it is necessary to speak not about spaces, but about small areas studied within a large terra incognito.
For those interested in methods of representing chemicals on a computer, I would recommend our monograph: T.I. Majidov, I.I.Baskin, I.S. Antipin, A.A. Varnek “Introduction to chemoinformatics. Computer representation of chemical structures. Kazan: Kazan. Univ., 2013, ISBN 978-5-00019-131-6.
What are the main achievements in synthetic chemistry this year?Synthetic organic chemistry is almost 200 years old, and the main peak of its development as a fundamental science occurred in the second half of the last century, when its basic laws were formulated and a real possibility was shown to synthesize substances of any level of complexity.
Now they are increasingly talking about synthetic chemistry as an already well-established applied discipline, whose main task is to find the best ways to obtain substances with the necessary properties. As a result, it has long been divided into many areas (for example, medical chemistry, petrochemistry, catalysis, chemistry of various types of materials), in each of which there is a continuous progressive development.
For me the greatest interest is the work of recent years in the field of robot chemistry - a new scientific and applied discipline aimed at automating the process of synthesis of substances with the help of special robots working under computer control.
I would especially like to note the achievements of recent years in creating miniature chemical reactors integrated into computer chips, which allows the synthesis, isolation, analysis and even biological testing of synthesized substances in the literal sense of a computer under the control of artificial intelligence.
What are the successes of machine learning in synthetic chemistry? Where do we stand?I will begin by explaining the historical context. Since the term “artificial intelligence” appeared in the fifties of the last century, chemistry (and especially synthetic organic chemistry) was considered, along with medical diagnostics, as one of the main areas of its future use. Most of the remaining tasks were set much later.
At the first stage of its development, the main emphasis was placed on the use of so-called. expert systems based on the rules stored in knowledge bases formulated by expert chemists and the inference engine.
The first successful expert system in the field of synthetic chemistry was the LHASA program, developed under the leadership of Nobel Prize in Chemistry Elias Corey by the early 70s of the last century. It can be argued that LHASA made a revolution in its time both in the field of synthetic organic chemistry and artificial intelligence, and determined the main directions of development of computer synthetic chemistry for many years to come. It so happened that it was synthetic chemistry that became the field where even in the 80s the capabilities of artificial intelligence came very close and were almost equal to the capabilities of experienced synthetic chemists. This determined the popularity of synthetic chemistry among specialists in artificial intelligence in the 70s and 80s.
Nevertheless, despite the great successes achieved by artificial intelligence in the field of synthetic chemistry, by the 90s, the popularity of this area has sharply decreased and even practically became zero.
There was a paradoxical thing, which is still being discussed among specialists. Although the computer's ability to plan synthesis has come close to the capabilities of synthetic chemists, the latter are still needed to carry out the synthesis, and no computer will replace them in this. As a result, the computer program began to be perceived as an expensive “toy”, without which you can do and which should not be wasted. This just coincided with the beginning of the "winter" in the field of artificial intelligence, when the principal shortcomings of the rule-based expert systems became clear: only a small part of the knowledge can be presented and formulated by the experts in the form of clear rules, and therefore their bulk, which experts perceive only level of intuition, is not involved in the framework of expert systems.
Something like this led to the collapse of the once ambitious fifth-generation Japanese computer program.
The first works on the use of machine learning in the planning of synthesis appeared in the late 80s and early 90s as attempts to overcome the aforementioned deficiency of rule-based expert systems by teaching the computer itself (without the help of expert people) to extract knowledge of the reactivity of chemical compounds from Then began to form a database containing information on published chemical reactions in the literature.
At first, this knowledge took the form of rules intended to supplement the knowledge bases included in existing expert systems, and then “fuzzy” rules that imitate the intuition of a synthetic chemist began to be extracted, for which neural networks began to be used in the early 90s. It must be said that, at present, the task of automatically extracting knowledge about reactivity from databases of published reactions for subsequent use as part of expert systems of the new generation is the central focus of machine learning in synthetic chemistry.
Another important direction now is also the use of machine learning to establish a connection between the structure of a substance and its properties, which allows the synthesis of those substances that, according to the constructed models, should be characterized by the desired set of properties.
The first examples of automatic data extraction used databases from dozens of reactions, thousands, tens of thousands went on, and now work is already underway with millions and tens of millions of reactions that cover all the reactions carried out throughout the world over 200 years of synthetic chemistry. There was a transition from quantity to quality, and the world entered the era of “big data” (big data). Since the beginning of the 90s, the power of computers has increased by several orders of magnitude, especially with the advent of GPU graphics cards.
In recent years, the methodology of “deep learning” has also become available, allowing one to extract knowledge from a large amount of data based on very complex patterns. All this has led in recent years to an explosion of interest in the use of artificial interest in synthetic chemistry. Over the past two years, more important and interesting works have been published than in the previous 20 years combined. Thus, the "winter" ended and immediately without the "spring" was replaced by a very "hot summer." Nowadays, due to the huge amount of accumulated knowledge, it becomes very difficult for a very experienced synthetic chemist to compete with artificial intelligence in the planning of synthesis.
For those who want to understand this issue in more detail, I would recommend our just published monograph: I.I.Baskin, T.I. Majidov, A.A.Varnek “Introduction to chemoinformatics. Part 5. Informatics of chemical reactions. Kazan: Kazan. Univ., 2017, ISBN 978-5-00019-907-7.
How close are we to list possible chemical reactions? Does science know about 90 million reactions? What is the order of the unknown ?
It is possible to list only something discrete and clearly differing among themselves, for example, low-molecular-weight organic compounds, which are described by different graphs. In the case of reactions, the very formulation of the enumeration problem is very nonobvious. For example, is the hydrolysis reaction of ethyl acetate and methyl acetate different reactions or two examples of the same reaction?
Is hydrolysis of ethyl acetate in alkaline and acidic environments - are these different reactions or the same reaction carried out under different conditions? , , , , .
, . () , , , 10 60, , . () . , . , .
, , : « , ?», , (, ) . , , ( ) .
() , . , « » (lead optimization).
, . , , WODCA (W.-D. Ihlenfeldt, J. Gasteiger, Angew. Chem., Int. Ed. Engl., 1995, 40, 993-1007).
– ? ?. , .
1. . (J. Gasteiger, , ) – «» , , . : EROS, WODCA. .
2. . (G. Schneider, , ). DOGS , . ALOE, - ( ) .
3. . (K. Funatsu, , ). KOSP, , . SOPHIA .
4. . (BA Grzybowski, (), , ). – Chematica. , « », « », 10 10 .
5. . (P. Baldi, (), ). : Reaction Explorer ( ) Reaction Predictor ( , )
6. . . (WH Green, KF Jensen, (), ) , .
7. . (MP Waller, , , ) «» (neurosymbolic) , .
8. . (A. Varnek, , ) . « », , , , .
9. .. (, ) , . , , . (. ).
, — , . USPEX - , . — ? ?( , ) 25 . – «» (, ) « » ( ).
. .
. , . , .
(unsupervised learning) — , , . (embedding), ( word2vec, GloVe ..). — , , — . ? ?, , . LSTM, , . , , .
, – SMILES. «-». , . , , .
, 2011 . ( ) .
(unsupervised learning), 30 . 80- 90- ( 30- !) (PCA), 90- 2000- - , – (GTM – Generative Topographic Mapping). , . (1-SVM), « » , (RBM – Restricted Boltzmann Machine).
? ?, . , , . ( « »), .
«» «». , , , , -.
, . - « » (electronic notebooks), .
, . « », , , , . , , , , . , , , . , . .
, , , , , . «». , , . , , . , , . , .
.