⬆️ ⬇️

How does decompilation in .Net or Java using the example .Net



Today I would like to talk about application decompilation (all applied to the same Java, and to any language with some assumptions and limitations, but since I myself am a .Net developer, the examples will be quite a bit MSILovised :)).



For an introductory, I’ll list the current decompilation tools in the .Net world:



For software decompilation:





And now, I would like to describe how they work (you also wonder how the JetBrains machine works?). To at least understand how difficult it is: write your .Net decompiler assembly back into C # code.

')



To begin with, the full list of articles posted on this cycle in Habré
We make shipped assemblies: we interact between domains without marshalling

Getting a pointer to a .Net object

Manual stream cloning. When Assembler + C # or Java = Love

Changing the code of system assemblies or "leak". Net Framework 5.0

How does decompilation in .Net or Java using the example .Net

Continuing to shred CLR: .Net object pool outside SOH / LOH heaps

Remove objects dump from .Net application memory





To begin with, we will lay down a set of requirements in our decompiler:

  1. Must accept any assembly as input: from CLR 1. * to 4. *
  2. It is obliged to support not only C # output, but also MSIL, VB.NET and in general - that’s enough fantasy and need.
  3. The ability to choose between different versions of the language (for example, C #), while not having duplication in the implementation.




And now that the requirements are defined, let's think about how MSIL works, and how it will help us to quickly decompile the application.



Unlike the language of the processor, which makes for us some difficulties in the process of decompiling (registers, optimizations, the ability to do one action in several ways), in MSIL everything is as simple as possible. If you need to write something to a local variable, then there is only one command for this. Another way to write to a variable value will not work. This property endows the final compiler (JITter) with ease of implementation on the one hand ... And on the other hand gives the implementation simplicity of the decompiling side.



The second property that MSIL possesses is the computation on the stack. There are no registers. And the only memory through which all calculations go is the stack. This does not mean at all that the final processor also calculates everything through the stack. Not. This means that for simplification, this model uses the description of all calculations and calls on MSIL. What does this mean for us? This means that you can add two numbers with only one command, which is one, regardless of the parameters. This command, pulling data for addition from the stack, adds them and saves the result not somewhere, but back to the stack. This is important because for us, as for people writing a decompiler, this will not produce a huge branch of code.



Now we come to the most important thing: how is the process of decompiling.



So that after

Ldc_i4 5 Ldc_i4 4 Add Stloc.1 


Receive

  Sum = 5 + 4; 


The first difficulty that comes to mind: the position of the instructions may be different. That is, for example, for the code to be executed, it is not at all necessary that there will be no other instructions between ldind_i4 and add. For example, the following code is completely valid:

  Ldc_i4 5 Ldc_i4 4 Ldc_i4 10 Stloc.2 // sum2 Add Stloc.1 // sum1 


What should be decompiled, for example, as follows:

  Sum2 = 10 Sum1 = 5 + 4; 


Secondly, the names of variables in the release may be absent. Those. no impurities, the code will be:

  = 10 = 5 + 4; 


Thirdly, what is the most difficult, if-else, while, do-while, switch implementations may differ. This concerns, in particular, lambdas, yields, async / awaits and other language gadgets that are optional and are actually implemented on top of the usual functions of the language. How to take all this into account? In fact, both issues are solved in two ways.



Stack decompiling model



For decompiling, a certain interpreter of the code is written on MSIL, which has its own stack and command interpretation cycle. At each iteration of the loop, the next command not considered is taken:

  1. If this is not a transition instruction, then we look at how many values ​​on the stack are required by the command being examined. Next, we pull out from the stack two computational nodes, which we put there, as the results of the calculations of the previous commands, and create a new node, the branches of which are nodes taken from the stack. For the example above, it will look like this:



    Those. First we have 4 teams in the input. The first two do not take anything at the entrance, but only give - a number. Accordingly, we put them on the stack (ldind_i4 4, ldind_i4 5). Then we take the next command - Add. It takes on input two values ​​from the stack. Therefore, we read two nodes from our stack and, having guarded them as parameters of this command, we save the command itself onto the stack, since the command has a result. And any result is stored on the stack.



    Then the result can be transferred to the method, or to participate in other arithmetic operations, or returned using the ret instruction.



    Accordingly, if the expression would be more complicated:

      Ldc_i4 5 Ldc_i4 4 Ldc_i4 10 Mul Add Stloc.1 // value = 5 + (4 * 10) 




    That process of creating a DOM would look like this:







    Then the final assembly of the tree is carried out:







    Method calls are constructed in the same way. Only in the case of methods, the required number of parameters will be taken from the stack and stored in the class of the method call node. If the method returns a value, the method call node will be folded onto the stack. If not, added to the group of ready-made expressions.



  2. If a transition instruction (conditional or unconditional) is encountered, a grouping node is created that contains the node for calculating the transition condition and the nodes that will be executed if this condition is met and not met. At this stage, we will look at how these templates look “from a bird's eye view”:







Wood assembly



These were all preparatory stages. Further, for modularity, classes are created that recognize any one construct in the tree and translate it into another. For example, if it is an if-else, then the presence of a conditional transition such that the transition is carried forward is matched. Then the node is converted to an if-else node, the code behind the transition is marked as an else (negative if) node, and the code between the condition and the other node is marked as a positive if node. If it matches as a conditional transition with a transition to previous instructions, then it matches as a while loop and the tree is also rebuilt. Accordingly, depending on the purity of the executioners of the matchers, we will get the converted tree for a specific programming language. Further, for each of the programming languages, we ask a lot of matchmen that suit him. For example, cycles and conditions will suit everyone, because they will be present in almost all packages. And here, for example, async / await - it is only for C #. Therefore, there will be only in his package.



For clarity of the picture, how are collected if-else and while / do-while, consider examples:



Assembly IF-ELSE block







Assembly WHILE block







Code generation



The last stage of the match is code generation for the tree. There should not be any difficulties. Ideally, of course, it would be cool to suck in rules from R # or StyleCop. Fortunately, they are in XML. But in the simplest case, we write a generator that accepts a class description tree as input. He is first obliged to check the entire tree: does it contain unsupported node types. If everything is all right, then the whole tree is bypassed and for each node the corresponding method is called using the Visitor design pattern, to which the StringBuilder and the corresponding node are passed. Additionally, it is necessary to count the number of spaces that must be retreated from the beginning of each line. At this stage, everything is quite simple.



Variable name generation





Before generating code, it is necessary to generate the names of all variables that were rubbed during the compilation process. Matching algorithms are also written for this. To generate variable names are:



For the implementation of these and many other algorithms, matchers are used, which are similar to the matcher, rebuilding the tree for a specific programming language.



What's next?



Next, I will analyze specific decompilers. dotPeek dotPeek, Cecil and your own. But it is - a little later =)

Source: https://habr.com/ru/post/244095/



All Articles