Does your code speak Russian?

Yes, it is not yet widespread. Natural language processing is not yet sufficiently developed and is not integrated with the development. Also, there are no convenient ways to integrate the code with a search or a virtual assistant (such as Siri). Voice commands imitate GUI paths (click-open-click). The Semantic Web is trying to introduce applications to the meaning, but still cannot reach a wide audience. Behavior-driven development (BDD) relies on DSL (domain-specific language), which is close to natural, but this is still not enough to teach your code to speak.

However, your code can interact with natural language today, though not using existing approaches. Modern technologies do not understand natural language and it is not clear when this will be possible. Therefore, we need a different approach:

which will be adapted to the natural language
which will be easy to train a wide range of people
which can act as a link between natural language and computer entities
who can create an environment in as many areas of software engineering as possible

Adaptation

Imagine that you are trying to understand what "brrh" means in one of the cosmic languages, based on the following facts: (a) "brrh" is an instance of "eyyuya", (b) "brrh" has attributes of "length" and "vvrrh" , he can "move" and "hrght", (c) "brh" is a noun. You can link these facts into a semantic network, you can create a dictionary and an ontology based on them, you can find "brrh" in the search (based on relevance), you can talk about this with a virtual assistant (who answers that "brrh is the same Eyyuya, you know "), you can even break the" brh vvrhnilsya ghrattsem "on the subject-predicate-addition. But none of these methods is capable of this word. Real understanding comes only when words (and the meaning behind them) are explained in terms that we know, i.e. we understand what they look like. “Brhr is the blue sprout of a vegetable-fruit” is an example of such an explanation (and a “vegetable-fruit” is an example of what does not exist, but what we can approximate). And the real problem is that we do not have technologies that work with such definitions.

Relevance is too uncertain and its results are just statistical guesses. The relevance between "planet", "sphere", "star", "dust cloud" and "physics" does not help much to understand what a planet is, if only because these words can refer not only to "planet." Whereas "a planet is an astronomical object orbiting a star" helps because based on similarity.

The disadvantage of processing natural language is that it relies too much on the rules of natural language. These rules are necessary for the text to be grammatically correct, but cannot help understanding the meaning in all situations. "Jupiter rotates in orbit", "Jupiter's orbit", "Jupiter's rotation in orbit" are very close in meaning. The distinction between the static nature of Jupiter and the dynamic nature of the orbit is very conditional and depends on the circumstances or context. What really matters is (a) what “Jupiter” is, (b) what “orbit” is, (c) what is their combination. But the text is usually no information about this.

Any approach that works with dictionaries and ontologies (or classes / objects, etc.) has another drawback. It expresses the relations of the subject area (including similarity), but is weakly related to natural language. And, accordingly, it does not express some characteristics that are inherent in the external world and in knowledge, which are expressed in natural language. Even the advanced approach of the Semantic Web has limitations: (a) although it claims that it can change classes and belonging to them during execution, but this cannot be completely done, because dictionaries and ontologies defined at the design stage are taken as a basis, (b) URIs are necessary for identification, (c) OWL declarations are necessary for meaning, (d) semantic reasoners are necessary for classification and consistency checking.

Therefore, we need an alternative approach that will:

Use balanced similarity (in the form of ambiguous natural language, whose ambiguity will be eliminated as much as possible)
Have a component structure, using the results of ambiguity resolution (explicit boundaries between identifiers, explicit relationships)
Rely on human-friendly natural language (against machine-friendly identifiers like livesIn, countryID, etc.)
Selectively use dictionaries and ontologies
Have unlimited flexibility (the ability to apply statements not only to the things to which they were originally intended, but also to those similar to them), inherent in natural language
Have unlimited extensibility (the ability to change the model to provide a more detailed or more concise description at any time) inherent in natural language
Involve people for classification, consistency, reasoning, etc.

Total :

Adaptation to a natural language does not imply its literal use, but rather a balance between its ambiguity and the accuracy inherent in computer approaches.

Training

What can facilitate the study of this approach? First, the already mentioned use of natural language identifiers. Secondly, the elimination of the ambiguities of natural language. Third, relaxed rules. Fourth, markup text. Why these items?

Natural language identifiers for a native speaker in most cases do not require additional training. Eliminated ambiguities (borders between identifiers, explicit relationships) simply save time for other readers / consumers of information (who do not have to do the same in the process of use). Relaxation of the rules allows users to work with different levels of knowledge. For example, "Jupiter is an instance of the planet" is not so obvious to many users, whereas "Jupiter is a planet" is obvious to most. Markap text is needed, because it can be used transparently (and even be invisible to users, for example: What is the diameter of the planet ?) and allows you to fix the meaning as soon as it is recognized by the creator of the information.

This is significantly different from what is already in programming and other semantic technologies. Identifiers are not always recognizable abbreviations, natural language is not processed or available as esoteric results, the rules are very strict, there is no markup text. Take the Semantic Web for example . URIs can be read by people to some extent. Triplets (subject-predicate-complement) try to imitate natural language, but the result is difficult for a person to read. Heavy standards are full of rules. An alternative, presumably, are Notation3 and Turtle , which seem to be easily readable by humans. But here again we see "human friendly" URIs and titles like dc: title (which may look readable in this example, but there will be dc_fr12: ttl in another). Microformats offer a slightly different approach that can only be used in HTML, and which, after all, is a kind of domain-specific language (DSL). Although DSL is seen as a promising direction, it has its advantages and disadvantages , where the latter can be explained with one phrase: the need to know a new language. In all cases, we see that learning is a very important factor that we simply cannot ignore.

Total :

The ascending, gradual approach with weakened rules allows starting literally from scratch, as is possible in many programming languages. You can start working with natural language words and relationships, which can be limited to literally two or three of the most important. This minimizes the need for complex, expensive and insufficiently reliable (from the point of view of text understanding) processing of natural language. Recognizing identifiers and basic relationships is a simpler task than recognizing nouns, verbs, adjectives, adverbs or triplets (subject-predicate-words) or class / field / method definitions. This can be done and used without additional analysis by both people and algorithms.

Connecting link

What does the getBallVolume (diameter) function do? The classical interpretation is in the ratio of output and input in the form of a description as "Returns the volume of the ball according to diameter". In terms of natural language, this can be expressed as the question "What is the volume of the ball?" or "What is the volume of a ball with a diameter equal to X?" or in the form of a sentence mentioned above. To associate a function with a natural language, we need to associate the input and output of a function with, respectively, the input and output in a natural language. How to do it? The question can be divided into meaningful identifiers and relations: "What is the volume {what} of the ball?", Where (1) the output of the function corresponds to "what" or unknown, (2) the input of the function corresponds to the "volume {what} of the ball "or" the ball {has} volume ", (3) the relations" {is} "and" {what} "/" {is}} "connect the input and output identifiers. Now we can write a test using the library of meaningful.js :

meaningful.register({ func: getBallVolume, question: ' {_} {}  {} ', input: [ { name: '' } ] }); expect(meaningful.query(' {_} {}  {}  {}  { } 2')). toEqual([ 4.1887902047863905 ]);

What's going on here? (1) the getBallVolume function is registered to answer the question "What is the volume of the ball?" with the diameter parameter, (2) the question is "What is the volume of a ball with a diameter of 2?" (which is roughly equivalent to that mentioned in the code), (3) the expected result is checked. How it works? Inside, the incoming question and the question related to the function are compared and if they are similar (ie, their respective components are similar), then the result can be found: (a) "Which { }" is similar to "Which { }" in the second question, (b) “the volume {of which} of the ball” is present in both questions, (c) in the register () function, “diameter” is not included in the question, but is present as an input parameter, so it can be put in accordance with the “diameter” in the second question, (d) “diameter {matters} 2” is used as an input and getBallVolume (2) is called, (e) the result of the function is returned as an answer to natural language question.

A slightly more complicated example (registration of the getBallVolume function is implied here):

 function getPlanet(planetName) { return data[planetName]; //      } meaningful.register({ func: getPlanet, //    -       question: ' {_} {}  {} ', input: [{ name: '', //       func: function(planetName) { return planetName ? planetName.toLowerCase() : undefined; } }], output: function(result) { return result.diameter; } //     JSON- }); //    meaningful.build([ ' { } ', ' {} ' ]); expect(meaningful.query(' {_} {}  {} ')).toEqual([ 1530597322872155.8 ]);

How it works? (a) "Jupiter {is an instance of the planet}, so we can consider the question" What is the volume of Jupiter? " as "What is the volume of the planet?", (b) "the planet {is} ball", so we can treat this question as "What is the volume of the ball?", (c) "diameter {of what} Jupiter" can be extracted from the attribute of the diameter of an object the planet returned from a call to getPlanet ("Jupiter"), (d) is called getBallVolume () with a Jupiter diameter value.

In Java (as well as in other multi-paradigm programming languages) such examples may look even more elegant:

 @Meaning  class BallLike { @Meaning  int diameter; //        @Meaning  //  /    //  "  {_} {}  {} ?" double getVolume(); }

This approach is simpler than the one offered by the virtual assistants program interfaces: Siri , Google Assistant , or Cortana . Only the fact that here we see several types of software interfaces can discourage more than possible advantages from integration with a talking virtual assistant. Of these programmatic interfaces, one technology is most similar to this approach: a structured data stamp , but it is not readable enough.

What the Semantic Web offers is not concise and is similar to XML processing. Queries in the Semantic Web in the form of SPARQL are limited, like any kind of SQL-like language. Natural language issues are more than just the choice of fields / properties. They also affect space-time, cause-effect and some other important aspects of reality and cognition, which require special treatment. Theoretically, we can use the question "What place" instead of "Where", "What date" instead of "When", or "What reason" instead of "Why", but the constant questions with "What / what / what" will not sound very familiar in natural language.

This approach can also be compared with what search engines do. Which can extract the diameter of the planet , but the result can not be reused directly. the volume of the planet is simply not working. Also, we cannot retrieve other even more specific data. This approach can help to get answers to these and many other questions, and also makes the code itself the subject of a search in natural language.

Total :

An unobtrusive approach does not force your data to be compatible with heavy standard or special data structures (like triplets) and special data stores (like tripletstora ). On the contrary, it can be adapted to your data. It is rather a type of interface that can be applied to both new and old (already almost not supported) data. That is, instead of building a Giant Global Data Graph, to which the Semantic Web seems to be eager, this approach suggests creating a "Question and Answer Web", which will be partially discrete (since separate identifiers and relations are discrete) and partially continuous (since both identifiers and relationships can be approximately similar to others).

Environment

As we can see, markup can be transformed both into a natural language (and used by search engines or virtual assistants), as well as into calls to a software interface (which forms a kind of natural language interface). Markup turns text into a component structure, where each element can be replaced with a similar one. Therefore, in most cases, the conversion to natural language can be quite simple or at least simplified: "What is the diameter of {what} the planet?" can directly correspond to the question "What is the diameter of the planet?" and be similar to "What is the diameter of an astronomical object?". As for the natural language interface (NLI), at first the idea of using "What {is} the volume {of what} of Jupiter" instead of jupiter.getVolume () or getBallVolume (jupiter.diameter) seems redundant. But no one says that NLI calls should apply to all lines of code. This can only be applied to those parts of the code that are relevant to high-level design. Moreover, this interface has certain advantages: (a) specification compatible with natural language, (b) we do not work with specific calls to the program interface (which reduces the need for in-depth study of a specific API), (c) the names of classes / methods / functions / parameters are more explicit, (d) comments can be turned into a more meaningful markup that can be reused, etc. Also, NLI can be the easiest way (even compared to the command line) to develop an interface for small applications and devices (for example, on the Internet of Things) or in an environment with many programming languages.

Due to the dual compatibility with natural language and code, the markup can also be used in other areas of software engineering: requirements, tests, configuration, user interface, documentation, etc. For example, the string from the requirements “The application must provide the diameter of the planets” can be labeled and correlated with function calls, tests, or parts of the user interface.

Markap can open another area of research: cause and effect. Modern applications take tentative steps in this direction and usually try to explain the reasons for errors or unexpected behavior. But this practice does not apply to many other areas: for example, which options affect a certain functionality, etc. For this to work even more efficiently, such explanations must be available for reuse. That is, if some kind of functionality is not available, then the application can reveal a chain of cause-effect-alternatives with direct access to those places of the user interface or to options that the user can immediately change.

Although behavior-driven development moves in a similar direction, there are some very important differences. DSL, similar to natural language, may look good, but in the end, it cannot be reused without tools that recognize this DSL. On the other hand, why should we be limited to only tests or design? We interact with natural language and high-level domain architecture at all stages of software engineering.

Total :

This approach is applicable not only to the code, but also to text documents, user interface, functionality, configuration, services and web pages . This opens up new horizons of interaction. Thus, a web page can be used as a kind of function along with algorithmic functions, and, conversely, a function from a program code can be used as a kind of web page in a search.

Conclusion

The greatest challenge for the Semantic Web is to answer the question why the community needs it, as well as RDF, triplets, tripletstora, automatic reasoning (reasoning), etc. Applications have linked data for many years before the Semantic Web. The architects defined the domain using various tools (including ontologies) long before it. It is completely incomprehensible whether the advantages of using heavy standards similar to XML / UML / SQL of migration and training costs can outweigh the benefits. It is possible that this is why the Semantic Web is not in standard libraries of widely used programming languages and there are no plans for inclusion. The Semantic Web positions itself as the “Web of Data”, which allows intelligent agents to cope with heterogeneous information. As implied, people will receive information from the black box, which will reason for them. The standards of the Semantic Web are far from being able for people to read them (as a natural language) and this does not seem to bother anyone. Nor is there any discussion about the applicability and usability of semantics. What for? Intellectual agents will do everything themselves. Just adapt to the Semantic Web. Inspiring, isn't it?

On the contrary, the proposed approach is lightweight and can be implemented using the built-in features of JavaScript and underscore.js (not to mention multi-paradigm languages). The final prototype contains only about 2 thousand lines of code. Lightness leads to simplified syntax analysis, simple data structures, and not very complex reasoning chains. Is this an oversimplification of the Semantic Web? It is also possible that your local database may be an oversimplification of Big Data, but both options have a different scope of applicability.

Can this approach respond to the challenges of the Semantic Web ? It must be remembered that they are caused, first of all, by the potential immensity (vastness), ambiguity (vagueness), uncertainty (uncertainty), inconsistency (inconsistency) and susceptibility to errors (fallibility) of knowledge. Humanity has been living with these challenges for many years, but they do pose a threat to inflexible, limited, highly specialized, categorical algorithms.

Any information is potentially enormous, ambiguous, uncertain, contradictory, error-prone. Take at least that diameter of Jupiter. Its meaning looks simple and definite only as long as we limit ourselves and present it as a “once and for all frozen” truth. But it is not. It can be very difficult if we want to calculate it very precisely, which can involve many subtasks. It can be ambiguous, because varies in time, can be calculated by various methods and with different accuracy. It may be uncertain if we recall all the assumptions, for example, that Jupiter is the perfect ball. The value in the metric system can be controversial (in the absence of means of translation) with other units of length. All that you need to know about the exposure to errors can be found in the search query "Jupiter's diameter": sources provide a dozen different values.

In the real world, we all understand that. We realize that we simply may not have enough resources to deal with each value given these challenges. But it can be really dangerous if we consider a black box, which we absolutely trust. We can not afford this. We need constant feedback and correction (as an option, in the dialogue between the person and the algorithm). We should consider these challenges rather as features or a routine rather than a problem.

Vastness (vastness) encourages curiosity and planning. Too much of a task can always be divided into subtasks that can be delegated.
Ambiguity (vagueness) leads to flexibility. We use a rather ambiguous natural language, where each word is sufficiently accurate only under certain conditions. The world is changing every moment and everything that we “caught in reality” often becomes useless in the very next moment. But ambiguous characteristics, such as "young" can tell even more about a person than his very accurate age. The definition of qualities is always the result of the process of assessment, generalization, certain reasoning. This saves the time that other people need to spend on the same qualifications (although there is a chance that the result is not acceptable for our requirements).
Uncertainty carries with it alternatives and forces us to manage risk. This is perfectly normal when we get several results at the output. Indeed, sometimes a person who can or must make a decision should be engaged in choosing the desired result.
Inconsistency (inconsistency) motivates versatility and versatility. Indeed, consistency, in most cases, stems from the many ways of organizing the surrounding world. This can be done in very different ways and, to eliminate some of them, we must have a complete picture. But just then we may not have it in many situations.
Exposure to errors (fallibility) is constrained by skepticism. Algorithms are far from skeptical (as yet?). But people, too, because we do not have time to recheck all the facts from various sources.

All these points imply one or another degree of human involvement. After all, it is precisely people who can interpret whether a result is sufficiently huge, ambiguous, uncertain, contradictory or erroneous. This may be valid, but may be revised, it may change the direction of the search, it may require manual changes in the data or code (which is the type of reasoning), etc. People know how to handle this information, so it must be supported by them (the community of users and engineers along with numerous libraries, applications and devices). And such support is possible only if the code can speak in natural language. And this is exactly what this approach can make a reality.

This is a translation of the article . Links lead to English-language resources, because similar in Russian do not contain all the necessary information. Also, examples of the code are given in Russian for illustrative purposes, while the working tests still use English.

Source: https://habr.com/ru/post/317872/

All Articles

Does your code speak Russian?

Adaptation

Training

Connecting link

Environment

Conclusion

More articles: