Here we will spell some program (VM compiler) on Haskell. At the input of this compiler is given a binary file with instructions of a certain processor, where in these instructions certain calculations are described. At the output of our compiler, the program text is obtained, also on Haskell, which performs the same calculations with great speed. Perhaps this is not a compiler, but a decompiler, I do not know. Comparison of the work of the resulting programs in Haskell / Java is given in the
previous post .
Warning: people with formal mathematical thinking are
forbidden to read ! You already know everything. This post is for people who are interested in something beyond the usual C / C ++ / C # / Java, because in the last 5 years they have been tired of writing the same thing in these wonderful languages.
')
About the initial task of the ICFP contest once again, with technical details: we are given a binary file in which the program (calculating formulas) is written in the machine instructions of a certain processor. There are several formulas at once, so when calculating formulas a special pool is used for return values. There is a pool for input values. And there is a pool for storing the values ​​required during the calculation of the formulas, and for the values ​​that live from iteration to iteration (that is, when the program runs many times, it can use data from the last run). The input pool in terms of the task is called a set of input ports, similarly there are output ports, everything else is simply called “memory”.
Calculations are performed as follows: by adding, subtracting, and other arithmetic operations on data from input ports and memory, a value is obtained that is written to one of the output ports. According to the authors, the binary file from the task is a program for the “black box”, which is able to calculate something for us.
Implementing this black box so that it can calculate at all quickly, and is our first goal in ICFPC 2009, and the rest (calculate the control actions on the satellite so that it can fly correctly in the world described by these formulas) is the main goal of the contest We will not consider this post, and maybe in another.
In this binary file, the original data was intermixed with the code as follows:
Offset (dec)
| Length
| Content
| Type of
|
O
| eight
| data (cell 0) | Double
|
eight
| four
| opcode (cell 0) | UInt32
|
12
| four
| opcode (cell 1) | UInt32
|
sixteen
| eight
| data (cell 1) | Double
|
then the same order is repeated for 2.3 cells, then 4.5, and so on. In the specification it is written: for even cells, data first goes, then opcode, vice versa for odd ones. Why is there data here, why do non-zero data exist at all? Very simply, 1) in opcodes it is impossible to clog a constant, a variable is used (a memory cell) that is not modified by anyone 2) some variables are initialized, and they also lie in the data.
In order to compile such a binary into a beautiful text in any language, you need:
1) read binary into buffer
2) separate data from opcodes
3) build an AST from opcodes (AST in local terms is a disassembled opcode decomposed into a form convenient for manipulations, in the original it means Abstract Syntax Tree, an abstract tree because it contains the essence of the program, not its text or binary representation)
4) to transform AST into another type (higher level)
5) output AST in text form in the required language, and also display the definition of the data with its original meaning.
Next comes the program with comments for people who are far from Haskell, so some points in the explanations are intentionally simplified and even slightly distorted (marked separately), for ease of perception.
Connection of some standard modules:
import System.IO<br>
import Foreign.Marshal.Alloc<br>
import Foreign.Ptr<br>
import Data.Word<br>
import Foreign.Storable<br>
import Debug.Trace<br>
import Data.Bits<br>
import Data.List<br>
import Data.Maybe
Description of type synonyms, for details of the following data definitions: it says that the source address in the command is of type Int, as well as the destination address.
type DestAddr = Int<br>
type SrcAddr = Int
Now (below) comes the AST element, that is, the decoded instruction from the specification. Commands usually contain addresses of input data (or several pieces, for example, an addition command) - the type SrcAddr is mentioned, and one address in which they write the result (DstAddr is mentioned).
In our task, different things are done with AST: several commands are replaced with one that is not in the specification, then all commands are replaced with some kind of tree, whose elements are the same commands, but from a different set (more on that later). It is supposed to divide the sets of commands into different types, but in this case we have a dump here, and only comments help to separate the set from the set (see below).
In Haskell there is the so-called algebraic data type, which is generally everywhere, and it is very convenient for use in our case. We create the type Op (from the word opcode) and assign several constructors to it (they are not constructors from the OOP, but if we think in terms of the OOP, then the empty class Op is declared, and many of the classes derived from it are, and each has its own data).
data Op = <br>
-- . , : <br>
Const Double | <br>
ReadExp SrcAddr | -- read main memory <br>
ReadVarExp SrcAddr | -- read temporary variable <br>
ReadPortExp SrcAddr | -- read port <br>
-- : <br>
SqrtExp Op | SubExp Op Op | AddExp Op Op | MulExp Op Op | DivExp Op Op | -- math <br>
IfExp CmdOp Op Op Op | -- branch: compare 1st op with zero, choose from other ops. <br>
-- ( +) <br>
Add DestAddr SrcAddr SrcAddr | <br>
Sub DestAddr SrcAddr SrcAddr | <br>
Mul DestAddr SrcAddr SrcAddr | <br>
Div DestAddr SrcAddr SrcAddr | <br>
Output SrcAddr SrcAddr | -- copy from memory to out port. <br>
If CmdOp SrcAddr DestAddr SrcAddr SrcAddr | -- compare mem with 0,choose,store <br>
Input DestAddr SrcAddr | -- read src port into memory <br>
Sqrt DestAddr SrcAddr | <br>
Copy DestAddr SrcAddr | <br>
-- , <br>
Cmpz CmdOp SrcAddr | -- first part of "If" <br>
Phi DestAddr SrcAddr SrcAddr | -- second part of "If" <br>
Noop DestAddr <br>
deriving Show
We see that our options are listed through the symbol "|", traditionally meaning "or." Also there is a “deriving Show”, which means that if we want to translate some value of this type to a string, we will get a string in Haskel, approximately close to the original. This construction means literally the following: “compiler, make the specified type belonging to the Show class, and the class method Show, which translates from value to a string, generate it yourself as you can.”
Now that we will have about the tree. Suppose an expression is generated in the resulting code:
o32 = (i2 * 37) + (m75 * (m76 * t4))
It means: put the value of the input port multiplied by 37 plus the product of two memory cells and one time variable in the output port. The expression to the right of the brackets is written in terms of our AST as follows:
AddExp (MulExp (ReadPortExp 2 ) (ConstExp 37 )) ... <br>
... (MulExp (ReadExp 75 ) (MulExp (ReadExp 76 ) (ReadVarExp 4 ))) -- lisp? ;)
If you count the brackets, it should converge 8). What do we have in the leaves of the tree? Constants, input ports, memory, and temporary variables. How do we distinguish one from the other? After all, according to the specification, we only have input ports and memory? Where do constants come from, for example?
Very simple. If nobody writes in our cell, and everyone just reads from it, then this is a constant. It makes sense to remove it from the “memory” and score it with a constant in the resulting code. Similarly, temporary variables are separated from constant variables, as well as variables that should not exist at all, that is, they must be converted directly into expressions with brackets (see a little bit below).
We continue. The following section describes an AST for conditional operations (comparison with zero, from the specification). Comparisons of five pieces, according to the results of the comparison, there is a choice from one or from another memory address.
data CmdOp = Ltz | Lez | Eqz | Gez | Gtz deriving Enum
Of interest is the construction of "deriving Enum". It automatically assigns this type to the class Enum, and the types of this class have a method that allows to translate an instance of the type into an integer, and vice versa. Will be used in the next paragraph.
Now we will specify that the type CmdOp belongs to the class Show. This class is used to translate values ​​of any type to string type, as mentioned above in “deriving Show”. In our case, we do not want the compiler itself to generate the default converting methods, but to implement our own method (function).
instance Show CmdOp where <br>
show cmdOp = [ "<0" , "<=0" , "==0" , ">=0" , ">0" ] !! fromEnum cmdOp
Actually the function is show, it takes the cmdOp parameter, returns a string, one of the list. The expression to the right of the equal sign means: transfer from value to integer (fromEnum, remember deriving Enum), then use this integer as an index (the "!!" function) to select from the list (list in square brackets). Short and clear, but O (N), so what. O (N) here means that to get the fifth element of the list, it will take 5 iterations (because a list is used here, not an array), and for N-th, therefore, N iterations, which is long. There is a simple way to speed things up by writing case or using an array, but I saved the program line 8-)
Next comes the function that translates Op into the language into which we compile. We compile into Haskel, and therefore the following introduction is required:
Our black box at the entrance has input ports, and its previous state, and the output has its own state and output ports. A binary file compiled into Haskell will contain one function, which will have the same data as input and output. Return the output ports will be as part of the structure. Thus, the compiled file will contain:
1) data description:
data VMState = VMState { m13, m55, m77, o11 :: Double } -- persistent mem
2) the main function (discussed in detail below):
nextState :: VMState -> (Double,Double, Double... ) -> VMState<br>
getNextState x (i2,i3,i16000 ... ) = ( 1 )<br>
let t1 = i2 + i3 + m55 x;<br>
t2 = if (i - 16000 > O ) then t1 * (i16000 + 1 ) else O <br>
in VMState { m13 = t1 + t2 * 2 , m55 = 2 * i3 ..... o10 = (i2 + 22 ) * 576000.0 }
3) the function that returns the state of the black box at time 0 - that is, the data that we will feed it iteration after iteration:
initState = VMState { m13 = O , m55 = 123.456 , ... o90 = O }
Consider the points of each line. The first declares the “structure” of the data. Its difference from C / Java structures is that its fields cannot be changed; it is only possible to get a new structure with some (either one or all) changed values ​​of the current structure. It is written this way:
let newStruct = oldStruct { field5 = 7 , field10 = O }
Then, if desired, the old data can be thrown out.
To read one field from a structure, use the syntax:
let myFieldValue = field5 newStruct
Why so strange? Because when a structure is declared, all functions are automatically declared to access its fields. In particular, a function appears implicitly (for example):
field5 :: VMState -> Double
This means that the function receives the value of VMState as an input (in our example, we passed the whole newStruct at the top), and returns a numeric value with a floating point.
Along the way, we understood what the method signatures look like (method name :: type). We look at the main function
nextState :: VMState -> (Double,Double, Double... ) -> VMState
There are no three dots in Haskel, I added them to illustrate that we will declare a tuple of as many doubles as we need. The function takes 2 parameters, the first is VMState, the second consists of several values ​​(in our case, these are the values ​​of all ports), and this function returns the new value of VMState.
The main function of an ideal VM has a syntax similar to the following:
fun arg1 arg2 = <br>
let bind1 = expr1; bind2 = expr2<br>
in exprResult<br>
where bind3 a1 a2 a3 = expr3<br>
bind4 a5 a6 = expr4
That is, we can define some reusable expressions in the "let" section, and then use them freely in the resulting expression (which is in the "in"). We can also define the same expressions in the "where" section, only there they are usually described not by variables, but by functions local (apparently) to the ambient function (fun). Functions can also be described in the let section, but this is less convenient. I will add that from where the values ​​from the “let” section are not visible.
Further, I note that the “if” construction in haskel is an expression. There are no “for” constructs in the language itself, it is in a separate module and is a normal function. As well as "while" and the like.
In the light of the foregoing, we will continue to review the binary file compiler.
The following function translates an AST element (of type Op) into a string representation in the result language (in Haskell). The first parameter is a string containing the name of the parameter with the initial state (because we will generate read operations from it), in the formula (1) it will be “x”. In fact, I could coerce this name because it appears only once.
ast2haskell :: String -> Op -> String<br>
ast2haskell x op = s op where <br>
s (Const d) = show d -- . <br>
s (ReadExp r1) = "m" ++ (show r1) ++ " " ++ x<br> -- generate read access to memory <br>
s (ReadVarExp r1) = "t" ++ (show r1)<br> -- to temporary variable <br>
s (ReadPortExp r1) = "i" ++ (show r1)<br> -- to input port <br>
s (SqrtExp op') = "(sqrt " ++ s op' ++ ")" <br>
s (AddExp op1 op2) = "(" ++ s op1 ++ "+" ++ s op2 ++ ")" <br>
s (SubExp op1 op2) = "(" ++ s op1 ++ "-" ++ s op2 ++ ")" <br>
s (MulExp op1 op2) = "(" ++ s op1 ++ "*" ++ s op2 ++ ")" <br>
s (DivExp op1 (Const x)) = "(" ++ s op1 ++ "/" ++ (show x) ++ ")" <br>
s (DivExp op1 op2) = "(if " ++ s op2 ++ " /= 0 then " ++ s op1 ++ "/" ++ s op2 ++ " else 0)" <br>
s (IfExp cond opc op1 op2) = "(if " ++ s opc ++ show cond ++ " then " ++ s op1 ++ " else " ++ s op2 ++ ")"
What do we have here? The familiar construction fun param = expr where ...., but only immediately comes the call of one of the local functions. Why? Now we will understand. First, we only note that the “s” function (from “string”) has several definitions, one definition for each Op type constructor. As a result, of course, only one function will be called, and immediately a specific object of type Op will be broken into its components, which will be given names:
s :: Op -> String<br>
s (AddExp op1 op2) = "(" ++ s op1 ++ "+" ++ s op2 ++ ")"
This is an implementation of the s function for the Op type constructor named AddExp, which has two parameters: two AST branches that are added. And the function s itself has one argument, so there are brackets around AddExp. The result of this function will be a string glued together from pieces (operation ++ joins two lists; strings are lists too). Here the brackets are pasted on the sides, and between them the string representations of both branches, between which a plus is glued. The string representation of each branch, therefore, is calculated recursively, it is common practice in such problems. And finally, the show function converts a number into a string, and also it is used for CmdOp (comparison condition with zero), which also belongs to the class Show (described above).
Pay attention to the following definition:
s (DivExp op1 (Const x)) = "(" ++ s op1 ++ "/" ++ (show x) ++ ")" <br>
s (DivExp op1 op2) = "(if " ++ s op2 ++ " /= 0 then " ++ s op1 ++ "/" ++ s op2 ++ " else 0)"
It is great because both functions work on DivExp data, but to select the first one, it is necessary that the divisor in this node be constant! For this case, a special, simpler code is generated, because here you do not need to check to 0.
From this it follows that the definition of which function is called does not follow the “type of constructor”, but is much more flexible: the input value in turn tries to “climb through” into each implementation of the function with the required name, and the function into which it is called is called . In this case, DivExp will not crawl into the first function, in which the second parameter is initialized with something other than Const. This is called Pattern Matching (not to be confused with string matching).
Now, about what the “s” function was made for. For convenience only. Compare 2 pieces with it:
ast2haskell x op = s op where <br>
s (SubExp op1 op2) = "(" ++ s op1 ++ "-" ++ s op2 ++ ")" <br>
s (MulExp op1 op2) = "(" ++ s op1 ++ "*" ++ s op2 ++ ")"
and without it:
ast2haskell x (SubExp op1 op2) = "("++ast2haskell x op1++"-"++ast2haskell x op2++")"
ast2haskell x (MulExp op1 op2) = "("++ast2haskell x op1++"*"++ast2haskell x op2++")"
Conclusion: the code with the “s” function is shorter, and you do not need to constantly drag this “x”, passing it on.
Let us now consider a new topic - how decoding of an opcode is performed in an operation, which we consider to be of type Op. Each operation initially lies at some address in a binary file. For most operations, the same address is the destination address for the result of the operation (the result of addition and other operations - everything is put into “memory” at this address, for each operation of its own). That is why we pass into the function a pair of “opcode, its address”, and at the output we have an already initialized operation, in which, in addition to the destination address, additional arguments are initialized, for each operation, its own.
In the opcode specification, they are decoded in a cascade, that is, if the most significant part of the opcode is zero, then the next piece looks like, otherwise it is a ready opcode. The most significant part is four high-order bits, that is, 4 bits upwards, starting from the 28th bit (28, 29, 30 , 31).
To extract bits from Word32, we have defined a function (.%.) Here that can be used as an operation (infix, that is, to stand between two operands, like + or *). Haskel's syntax allows you to define a function with the name of any characters, for example, "++++" This is defined as:
( ++++ ) a b = sqrt(a * a + b * b)
or otherwise:
a ++++ b = sqrt(a * a + b * b)
and is used like this:
let q = 3 ++++ 4 <br>
let p = ( ++++ ) 3 4
Thus, it turns out very convenient. Also, you can make any of the “normal” functions of the two arguments infix, see the example below with 'shiftL' (left shift). For these purposes, the function name is taken in reverse quotes and placed between operands. For infix operators, all priorities are valid, but this is not our topic now, but the function itself (.%.) Will be defined below.
disasm :: (Word32, Int) -> Op<br>
disasm (opcode, addr) = disasm1 (opcode .%. ( 28 , 4 )) -- "" <br> -- <br>
where <br>
r1 = recast $ opcode .%. ( 14 , 14 ) -- <br>
r2 = recast $ opcode .%. ( O , 14 ) nbsp; -- <br>
r1' = -- <br>
disasm1 i = [disasm2 $ opcode .%. ( 24 , 4 ) , Add addr r1 r2,Sub addr r1 r2,Mul addr r1 r2, Div addr r1 r2,Output r1 r2,Phi addr r1 r2] !! recast i
What happens in disasm1? The definition of the list (square brackets) is again selected by index (operation "!!"). The recast function is described below, it converts signed to unsigned, because we have contacted Word32, and not just Int - and Haskell doesn’t really like to interfere with them, this is not C. Well, what about the list in question? In the zero index we are sent to decode the second stage (the operation code is encoded with 4 bits starting from 24 bits up). Further operations are added to add, subtract, and further specifications. Phi operation will be reviewed separately.
disasm2 i = [Noop addr, Cmpz (decodeImm (opcode .%. ( 21 , 3 ))) r1',<br>
Sqrt addr r1',Copy addr r1',Input addr r1'] !! recast i<br>
The second cascade, which is the final one, consists of the same meat. In the specification of r1 for these operations lies in the same place as r2 for the first cascade, therefore there is a synonym for r1 ', for beauty. Also there is a selection from the list by index. In the first (starting from zero) there is a complex expression - the constructor Cmpz, for which the argument is calculated by special decoding from a piece of opcode (type of comparison with zero):
decodeImm i = [Ltz,Lez,Eqz,Gez,Gtz] !! recast i -- enum ( ).
And here are the bit biting operations themselves - live broadcast from C.
mask x n = x .&. (( 1 `shiftL` n) - 1 ) -- = n x <br>
( .%. ) w32 (start,len) = (w32 `shiftR` start) `mask` len -- , <br>
And this function (recast) converts from signed to unsigned and vice versa, depending on the context. It is not very effective, but it works normally in our task. Here the types are not explicitly indicated, only their class is indicated (Integral). Here, Haskel remarkably substitutes the necessary types in each case (Int or Word32) and for each of them there are operations fromInteger / toInteger. Note that the types will be displayed in the compilation process, and not in runtime (which I remind just in case).
recast :: (Integral a, Integral b) => a -> b<br>
recast x = (fromInteger . toInteger) x -- , "fromInteger (toInteger x)",
and you can also write like this:
recast x = fromInteger $ toInteger x -- , <br>
-- , : <br>
-- $ <br>
-- ( ) . <br>
and you can also write like this:
recast = fromInteger . toInteger -- , .
-- , ,
--
The end of part 1 (size limit), the second part will follow .