📜 ⬆️ ⬇️

Inclusion of external languages ​​in Haskell programs

This article provides a brief description of the technique, which allows the use of libraries written in other programming languages ​​in Haskell programs. There is no need to either rewrite these libraries in Haskell, or write countless wrappers in C, or write explicit binding. In the resulting program, you can either directly call someone else's code or call functions from someone else's Haskell code. The function code itself can be written in an extended plug-in language, which allows specialists in plug-in languages ​​to work with it, which, unfortunately, are not yet familiar with Haskell.

This task did not arise from scratch, but was required for one corporation in order to use together the code written in R and the models written in Haskell. At the same time, it was desirable to use it in such a way that specialists in R programming did not have to learn Haskell either.

Thus, the challenge was to enable Haskell to include libraries and code in other languages ​​in an efficient manner. By efficiency, in this case, it is understood that there is no overhead for data transformations during transfer between functions in different languages, as far as possible, and as cheap as possible function calls between languages. At the moment, there are two similar libraries inline-r and later inline-c , with an example, which I will explain certain concepts in this technique.


')
In order to implement such a solution, it was necessary to solve the following questions:
  1. how to connect code in other languages
  2. what should be the syntax of inserts in another language; code entry method
  3. how to convert data so that both libraries can work with it

as well as other issues arising with various RTS (runtime system) execution systems, such as several garbage collectors, the use of dynamic typing

Embedding the execution system


To enable simple languages ​​(C, Rust) without RTS, everything is quite simple; in them, the execution environment is a Haskell program. In this case, here only the interface support for calling external functions of FFI, and C interest is suitable for most cases.
However, in the case of a performance system, as in R, the question arises as to how it is launched and communicated. One of the standard approaches to solving this problem is to launch the “interpreter” (the program of the target language) by a separate process and use any of the RPC tools to exchange messages with it and the source program. This method however has several drawbacks:

If the runtime environment does not have an analogue of the eval method, then you will need to write it yourself. May require specialization for a specific task, which will reduce the generality of the solution. In any case, with this method, there are issues with the safety of execution.

To transfer data to the donor, it is necessary to send it over the network (it is possible to reduce the complexity of using zero-copy), serialization and deserialization. An alternative would be to create shared memory transfer mechanisms, which, however, can be quite a challenge in languages ​​where direct memory access is limited or additional difficulties arise, such as using a copying garbage collector.

One of the advantages of this method is the simplicity of the solution, there is also no need for integration of the event loop, which is necessary, for example, for drawing graphics.

These restrictions are usually incompatible with minimum overhead requirements. Therefore, it was often better to embed the Haskell implementation environment. This solution opens up several additional features:


In this solution, technical difficulties are possible, for example, if the language runtime environment imposes additional restrictions, for example, it uses TLS (Thread Local Storage) or is essentially single-threaded. In the first case, you need to use Control.Concurrent.forkOS to create a branch tied to the OS thread (or use only the main thread). Secondly, it is necessary to serialize requests on the language side, for example, this can be done using a fairly simple “pattern”:

 {-# LANGUAGE ExistentialQuantification #-} import Control.Concurrent data Task = forall a . Task (IO a) (MVar (Either SomeException a)) sendTask :: Chan -> IO a -> IO a sendTask chan f = do result <- newEmptyMVar writeChan chan (Task f result) takeMVar result runTasks :: Chan -> IO () runTasks chan = forever $ do Task f result <- readChan chan mask_ $ do r <- try f putMVar result r 

In this approach there may be some variations, for example, to store the channel in the environment of ReaderT Chan or a global variable, if the uniqueness of this channel should be guaranteed at the program level (you should think three times before choosing this solution, but formally it can be justified). Also, in some cases, it is better to transmit to the channel the action being formed by the caller, where he himself will put the result in MVar for a response. But semantically (with accuracy (and) before working with exceptions and completing the execution of this thread) other solutions will be equivalent to this.

Data transfer


When transferring data, the first question that needs to be solved is how to present data from an external language in Haskell, the simplest types that C-types allow it to do are: CChar, CInt, CString, CStringLen, Ptr, and others. Which are C-types to Haskell mappings and are sufficient for representing data from low-level languages. For type-safe work with them, such data can be wrapped in wrapper types (newtypes), thanks to which, on the one hand, the type system allows, and on the other hand, they don’t add any load at runtime.
Since plug-in languages ​​can be dynamic, as in particular R the question arises how to represent their meanings in a language. At the same time, it would be desirable, if possible, to have statically known information about types where it is possible. In this case, the compiler can give more guarantees for the correctness of the code, and it is possible to generate code by type (see the type classes in Haskell). Here we can recall the statement that languages ​​with a dynamic type system are uni-typed languages, i.e. in statics, the entire universe of types represented there is described by a single type. Using this approach, you can present a value in the language as a pointer tagged with a phantom type indicating the type of expression (instead of a pointer, there may be an index in the table or another identifier that uniquely defines the value, depending on how they are represented in the included language)
 data SEXPTYPE -- generated by hsc2hs newtype SEXP a = SEXP (Ptr SEXPTYPE) 

SEXPTYPE is a type of interface exported to C. In order to indicate that, as is the case in dynamic languages, the expression may be of an unknown type, we use the following approach.
To transfer information about types, you can use DataKinds extensions, which allows you to raise the constructors of simple data types to the type level, i.e. create a type, for example `SomeThing 'False`. Next, we list all possible primitive types in the pluggable language, for example, in the case of R, this
 data SEXPTYPE = NIL | Symbol | List | Closure | Int | Real ... deriving (Eq,Ord,Show) 

Now, if we know that an expression has a specific type (for example, it was created in Haskell), then we can explicitly write its type as `SEXP 'List` or` SEXP' Real '. However, when working with a dynamic language, this is not enough, because all return types are not known there and I would like to be able to pass on a truly dynamic type. To do this, we can use the Rank2Types extension, which allows us to write the following type:
 data SomeSEXP = SomeSEXP (forall a . SEXP a) 

This code means that SomeSEXP for any a expression SomeSEXP a correct, i.e. accurately reflects the fact of dynamic value. Now external language methods can safely return SomeSEXP , for converting the type to the known one, you can create a function
 cast :: SomeSEXP -> Maybe (SEXP a) 

which will check the type of the expression, and either return an expression or an error.
I would also like to be able to have the type sum, since in some functions the input can be supplied to various types of variables.
To create these types, you can use the TypeFamilies extension, and create a new type of `In` that checks whether this type is in the list of allowed types:

 infix 1 :∈ -- | The predicate @a :∈ as@ states that @a@ is a member type of the set @as@. type family (a :: SEXPTYPE) :∈ (as :: [SEXPTYPE]) :: Constraint where 'Any :∈ as = () a :∈ (a ': as) = () a :∈ (b ': as) = a :∈ as type In ab = a :∈ b 

This code works in the same way as regular pattern matching at the type level, where () indicates success, and the inability to verify a statement is an error. Consider the example, In 'Int (Env ': Int ': '[]) . First, we find ourselves in the third condition, since Int does not coincide with 'Env , and the type is obtained equal to In 'Int ('Int ': '[]) . His compiler simplifies further and we get to the second expression, from which we derive success, if we checked 'Real , then we would reach In 'Real '[] for which there is no sample and, accordingly, a type error would occur.
The problem with this approach is that the type is not injective, i.e. if we have the expression foo :: In a ['Int,'Real] => a -> SEXP 'Int , then we will not be able to infer type a , which may not be convenient.

In this view, you can already work with functions using the C API and achieve some convenience of writing code. At first it may seem that this approach will prevent the use of the possibility of modern programming languages, such as comparison with the sample and algebraic data structures, but this is not so.
In order to use such structures we can use the following technique. We create an image for the structure (view) in it, we define only a shallow (shallow) representation of the structure, this is done so that there is no need to create the entire structure display in memory, moreover, in general, the compiler completely removes all intermediate structures, allowing convenient to work with external data structures without overhead:
 data HExp :: * -> SEXPTYPE -> * where -- Primitive types. The field names match those of <RInternals.h>. Nil :: HExp R.Nil -- Fields: pname, value, internal. Symbol :: SEXP R.Char -> SEXP a -> SEXP b -> HExp R.Symbol -- Fields: carval, cdrval, tagval. List :: (R.IsPairList b, c :∈ [R.Symbol, R.Nil]) => SEXP a -> SEXP b -> SEXP c -> HExp R.List ... 

And we introduce the function `hexp` creating such an image:
 hexp :: SEXP sa -> HExp sa 

Further, using the ViewPatterns extension, ViewPatterns can be used:
 foo (hexp -> Symbol s) = ... 


To complete the function
 unhexp :: HExp sa -> ms (SEXP a) 

It is important to note that the expected unhexp . hexp = id rule is not fulfilled unhexp . hexp = id unhexp . hexp = id since unhexp always creates a new structure.

This technique allows you to conveniently work with external complex structures. And can be transferred to any external programming language. The final step here is to have the marshaling mapping of Haskell data structures into external language structures, for this you can enter a type class
 class Literal a ty | a -> ty where -- | Internal function for converting a literal to a 'SEXP' value. You -- probably want to be using 'mkSEXP' instead. mkSEXP :: a -> IO (SEXP V ty) fromSEXP :: SEXP s ty -> a 

`| | a -> ty` is a functional relationship between parameters, which says that the type `a` will uniquely determine the type` ty`. This information greatly assists the compiler in inferring types and does not allow for the creation of erroneous instances. Next, for native types in Haskell, we define translation functions.

Writing code in an external language



For writing code, we can create an internal embedded language (e-dsl) (although in a high-level language, the boundaries between the library and the built-in specialized language are very blurred), or a full-featured dsl. We decided to use “DSL”, i.e. actually a slightly modified plugin language. This method looks more convenient, since in this case the code can be written by those who do not know Haskell, but know the embedded language. How it can be organized and how it looks.

To generate Haskell source code, there is a Template Haskell extension that allows you to build an AST, however there is a series of restrictions, plus type checking is very limited, because all the expressions generated are of the same type Q Exp , respectively, the compiler cannot verify part of the contracts and as a result non-valid code can be generated, which will be rejected by the compiler when generating the code.
It is also possible to create quasi quotes QuasiQuotes [generator|some-code |] here the generator is a function that will parse the code some-code and get the result. The result may be Haskell code generated using TemplateHaskell, creating files (for example, external C files), calling find / -delete or in principle any work.

Quasi-quoting is a great candidate for including code, we can create a new parser. Further, when compiling the project, we call R and give it the source code and get back AST. After we process the given AST and generate the code, which at runtime will build the same AST with regard to variable substitution from Haskell.
For example, a simple code:
 let x = [1,2,3] in [r| x_hs + x_hs|] 

Turns into:
 eval $ unhexpIO =<< Lang (installIO “+”) (unHexp $ List (mkSEXP x) (List (mkSEXP x) Nil) 

In fact, the code is slightly more difficult for perception, since the generated code carefully monitors the protection of resources, for more details see the article about GC.

Regions and memory system mapping


The greatest difficulty with such a combination of languages ​​is the interaction with garbage collectors. Especially the problem becomes complicated in the event that the garbage collector can move objects during assembly. Fortunately, neither C (in which the garbage collector is missing) nor objects can be moved to R, which greatly simplifies working with them.
The simplest safe operation mechanism is the creation of a special object that prevents the protected object from being deleted, and its lifetime can be controlled from another language, for Haskell it is:

Foreign.StablePtr.newStablePtr - creating a “protecting” object
Foreign.StablePtr.freeStablePtr - removal of the “protecting” object
extern void hs_free_stable_ptr (HsStablePtr sp); - removal of the “protecting” object from an external language

In R this is:
preserveObject - mark the object as used
releaseObject - mark an object as unused

Then, if you use an object from R in haskell, you can do
 newtype Protected a = Prected (ForeignPtr SEXPPTR) protect (SEXP p) = do preserveObject p newForeignPtr p releaseObject 

This method is good enough, but not applicable to all languages, since the necessary API may not be, and it does not work for objects on the “stack”, and finally, it may have a very high cost, for example, the price of creating and deleting an object in R O(N) , where N number of objects protected in this way . Therefore, the use of such a method as the main one is not the best idea.
However, this is not the only and not the main way to protect objects in R. In addition to this protection method, there is another one based on the “protection stack”, any object can be placed on the stack and N objects can be removed from the stack. these operations have O (1) complexity. This approach to protecting objects is used both in R itself and in functions written in C. And I would really like to display this method in Haskell. And for this it can be realized with the help of static regions, this technique was described in the article by Oleg Kiselev.

I note that due to the characteristics of the garbage collection system in R, this approach differs from the original solution, although they are equivalent in expressiveness

newtype R sma = R {runR :: ReaderT (IORef Int) ma}
deriving (Functor, Applicative, Monad)


In IORef we store the number of objects allocated on a given “stack” segment, so that when we exit, we can free them all. Next, an auxiliary function is introduced:

 runRegion :: (forall s . R s IO a) -> IO a runRegion f = do ref <- newIORef 0 runReaderT (runR f) ref 


and main function:

 region :: (forall s . R (sm) ma) -> ma region f = …  

Now we have to extend the SEXP type SEXP an additional phantom variable s , denoting the region in which the variable was allocated or protected.

Here s plays the role of a region, in addition we introduce 2 regions:
data V is an empty region, indicates that the variable does not belong to any stack segment and is not protected (can be deleted at the next memory allocation)
data G is a global region, indicating that the variable was protected by a mechanism (preserve) and can be used in all regions.
And with the help of the region function we can build hierarchies of structures forall s1 s2 s3 . s1 (s2 s3 V)) forall s1 s2 s3 . s1 (s2 s3 V)) .

The most interesting thing here is that we can introduce a semilattice (lattice) closed relative to the nesting operation of the regions. In this case, we have that:
V < .. < s2 (s1) < s1 < G
In this case, we know that we can safely use items from a “larger” region in a smaller one and can either enter the operation

 loosen :: (s1 < s2) => SEXP s2 a -> SEXP s1 a <source>     ,       : <source language="haskell"> someOperation :: (s < s1, s < s2) => SEXP s1 Int -> SEXP s2 Int -> R s (SEXP s Int) 


It remains only to create relationships in intermediate regions:

 type Ancestor = (<) class Ancestor s s1 instance Ancestor (R sm) (R sm) instance Ancestor parent region => Ancestor parent (R s region) 


Conclusion


The above describes the basic elements of a technique that can be applied to include other languages ​​in Haskell. The technique can be applied to include code in low-level languages, to increase the efficiency of the code (where appropriate and increase it), such as C or Rust . Or languages ​​with a large number of libraries, which can allow working with a convenient programming language, without modifying the work already done, such as python , or specialized languages ​​like maxima , reduce or proprietary counterparts.

I also understand that here all the directions: interaction with external code, creation of regions, code generation were considered very surface and if I need to consider them in more detail, then I can do it in the following articles.

Source: https://habr.com/ru/post/269939/


All Articles