“How to do it?”: API search - methods and problems

Modern programs are largely built from ready-made bricks - libraries. The unique code and architectural solutions in each program is relatively small. It often happens that the existing libraries are not of very high quality, but even the coolest programmer will not rewrite them.

This fact is reflected in the change in training courses. Sassman, the author of SICP, the most well-known programming course, said: " engineering is in the mid-90s, and even more so in the 2000s, it is very different from the engineering of the 80s. In the 80s, good programmers spent a lot of time thinking , and then they wrote a little code that worked. The code worked close to the hardware, even Scheme - everything was transparent at all stages. As with a resistor, just look at the color marking to find out the nominal power, tolerances, resistance and V = IR is all you need to know. 6.001 was conceived as a course for engineers to learn how from small cubes, which they thoroughly understand, using simple techniques to make complex constructions that do what they want from them. But programming is not the same now. Now you pick the incomprehensible or nonexistent the documentation for the software, even it is not known who wrote it.You should thoroughly examine the libraries to find out how they work, try different source data and see how the code reacts. This is a fundamentally different job, and it requires a different course of study. "

Building bricks are standardized - a bricklayer usually does not have to choose a brick that is suitable for this particular place. With libraries, the opposite is true - what is intended for PDF processing is not suitable for creating a distributed computing system. There is a need to find the necessary library, in it the necessary function and to understand how to embed it in your program. Google, like any other natural-language-focused search engine, does not help much so far. So consider other approaches.

In statically typed, what makes a function can be guessed by its type. And vice versa - knowing what the function should do, assume its type (signature). For example, the function of extracting an element from the list should be of type

[a] -> a

- here “a” is not the currently determined type of the element, and “[a]” is the type of the list of the elements of the type “a”. Search engines by type are (for some languages) and greatly simplify the life of a developer - this is ocamlfind for OCaml, Hoogle and Hayoo for Haskell, Scalex for Scala.
')
For dynamically typed languages, this approach can also work if each function is assigned a type, even if ignored by the compiler. Sometimes this can be done automatically - modern systems often know how to deduce a type from code (as many compilers of statically typed functional languages do), but for languages with inheritance, it is hard to implement (as in Scala, the type of function arguments must be described explicitly, but it can be very cumbersome ). You can collect statistics on the use of functions in real programs - although this does not guarantee a correctly described generalized type, but it can be useful for searching and documenting (not everyone writes documentation before their library begins to be used - why not simplify your life: -)).

Ted Kaehler, an employee of Alona Kai (author of the Smalltalk language), offers more radical search techniques. One of them is to write a test for this function and run it on all existing ones. It sounds scary, but personally I often had to look for a method from the Java library, calling everything that seemed appropriate by the name from the Scala REPL. And everything that can be automated should be automated. :-).

The second technique is to annotate all library functions as varied as possible and search by annotations. According to the annotations, a taxonomy of functions and libraries is built up, which would allow to manage this zoo (the term “taxonomy” is very appropriately coined by biologists).

The programmers on R went even deeper into the taxonomic / ontological approach ( article , presentation ). This is not surprising - R is often used to analyze large, complexly structured data - in the same place as ontologies.

Everyone knows that ontology is an exact specification of conceptualization . But what is the specification and conceptualization know not much. I am not one of them, so I will try to describe how I understand it.

Ontology more or less formally describes the subject area - what are the objects in it and what are the relations between them. There are many languages for ontologies, but the W3C RDF standard (for more complex cases of OWL) is most common.

In RDF, everything is defined by the “triples” subject-predicate-object. F1.R is_a RFunction. "F1.R" is an R-function. To describe the threes, W3C tried to impose XML, but thought about it in time and developed a human syntax .

RDF is often called a graph, and databases capable of handling it are referred to as graph. The main language of communication with such databases is SPARQL - a rather successful hybrid of SQL and Prolog. A typical query represents the pattern of the fragment of the graph that you want to find. In this form, it is not difficult to formulate quite exotic conditions - for example, “find a function from a library available under GPLv3, the result of which can be passed to the input of a function named createPDF”. This is a bit more complicated than a pattern type search, described at the beginning of the article, but much more flexible.

You can store more detailed information about the arguments of a function that is poorly described by types. For example, that the executeSelect method of the org.w3.banana.SparqlEngine class has the first argument — a string with a SPARQL select query. Finding such a function from all those receiving a string argument without such an annotation would not be easy. In addition to the search engine, this information could be used by the code verifier (similar to lint from C) and the IDE for syntax highlighting.
Although sometimes they manage to put such information into a type, but this does not help much in the search. This is how it is done in OCaml (at the compiler level)

  # (fun a -> Printf.sprintf a 1; a) "% d" ;;
 -: (int -> string, unit, string) format = <abstr>

It is clear that in order to find the function of substituting a whole into a string pattern, no one will look for a function with type

 (int -> string, unit, string) format -> int -> string

In the best of all worlds, a library in one programming language is relatively easy to call from any other. In reality, this is not so, but it is possible to dream about it.

Modern languages are very different and the terms of one can be displayed on another in a non-trivial way. For example, in some languages there is the concept of “type class” - a set of types to which there is a common interface (unlike OOP interfaces, these interfaces are not part of the type, but live separately). For example, the function

 max :: Ord a => a -> a -> a

waits for two parameters of the same type a , returns the result of the same type, and this type must belong to the class Ord (that can be compared). That is, it can be called with two integers and get a whole, or with two valid and get real. And with the whole and the real - it is impossible.

Classes of types are very convenient to use (and useful when searching), and in languages where there is none, they have thought up a way to emulate them.

They are usually implemented by passing an implicit parameter containing the interface implementation for a particular type. In languages where it is possible to describe parameters with complex defaults, such as Scala and C ++, this is done relatively easily. However, in Haskell the implicit argument is passed first (for optimization), and in Scala and C ++ - the last (due to the language structure), and the cross-language search engine will have to take this into account.

The API can describe not only the signatures of methods and functions, but also the sequence of their calling (approximately as in network protocols). In the article mentioned above, the predicate couldBeUsedBefore is introduced for this. This is especially true if more or less autonomous entities are allowed in the language, such as Erlang processes and Scala actors. Erlang allows optional typing of functions, and Scala is statically typed, but they do not allow a formal description of the messages to processes / actors.

As they say, DSL sits in each library and asks outside . If we more or less figured out how to find the sine in Smalltalk, then how to find a loop in Common Lisp is not at all clear.

Source: https://habr.com/ru/post/204124/

All Articles

“How to do it?”: API search - methods and problems

More articles: