mirror of https://github.com/icsharpcode/ILSpy.git
3077 lines
107 KiB
3077 lines
107 KiB
\documentclass[12pt,twoside,notitlepage]{report} |
|
|
|
\usepackage{a4} |
|
\usepackage{verbatim} |
|
\usepackage{graphicx} |
|
\usepackage{listings} |
|
|
|
\raggedbottom % try to avoid widows and orphans |
|
\setlength{\topskip}{1\topskip plus 7\baselineskip} |
|
\sloppy |
|
\clubpenalty1000% |
|
\widowpenalty1000% |
|
|
|
\addtolength{\oddsidemargin}{6mm} % adjust margins |
|
\addtolength{\evensidemargin}{-8mm} |
|
|
|
\renewcommand{\baselinestretch}{1.1} % adjust line spacing to make |
|
% more readable |
|
|
|
\setlength{\parskip}{1.3ex plus 0.2ex minus 0.2ex} |
|
\setlength{\parindent}{0pt} |
|
|
|
\setcounter{secnumdepth}{5} |
|
\setcounter{tocdepth}{5} |
|
|
|
\lstset{ |
|
basicstyle=\small, |
|
language={[Sharp]C}, |
|
tabsize=4, |
|
numbers=left, |
|
frame=none, |
|
frameround=tfft |
|
} |
|
|
|
|
|
\begin{document} |
|
|
|
|
|
\bibliographystyle{plain} |
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
|
% Title |
|
|
|
|
|
\pagestyle{empty} |
|
|
|
\hfill{\LARGE \bf David Srbeck\'y} |
|
|
|
\vspace*{60mm} |
|
\begin{center} |
|
\Huge |
|
{\bf .NET Decompiler} \\ |
|
\vspace*{5mm} |
|
Part II of the Computer Science Tripos \\ |
|
\vspace*{5mm} |
|
Jesus College \\ |
|
\vspace*{5mm} |
|
May 16, 2008 |
|
\end{center} |
|
|
|
\cleardoublepage |
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
|
% Proforma, table of contents and list of figures |
|
|
|
\setcounter{page}{1} |
|
\pagenumbering{roman} |
|
\pagestyle{plain} |
|
|
|
\section*{Proforma} |
|
|
|
{\large |
|
\begin{tabular}{ll} |
|
Name: & \bf David Srbeck\'y \\ |
|
College: & \bf Jesus College \\ |
|
Project Title: & \bf .NET Decompiler \\ |
|
Examination: & \bf Part II of the Computer Science Tripos, 2008 \\ |
|
Word Count: & \bf 11949\footnotemark[1] \\ |
|
Project Originator: & David Srbeck\'y \\ |
|
Supervisor: & Alan Mycroft \\ |
|
\end{tabular} |
|
} |
|
\footnotetext[1]{This word count was computed |
|
by {\tt detex diss.tex | tr -cd '0-9A-Za-z $\tt\backslash$n' | wc -w} |
|
} |
|
\stepcounter{footnote} |
|
|
|
\subsection*{Original Aim of the Project} |
|
|
|
The aim of this project was to create a .NET decompiler. |
|
That is, to create a program that translates .NET executables |
|
back to C\# source code. |
|
%This is essentially the opposite of a compiler. |
|
|
|
The concrete requirement was to decompile a |
|
quick-sort algorithm. Given an implementation of a quick-sort |
|
algorithm in executable form, the decompiler was required |
|
to re-create semantically equivalent C\# source code. |
|
Although allowed to be rather cryptic, the generated source code |
|
had to compile and work without any additional modifications. |
|
|
|
|
|
\subsection*{Work Completed} |
|
|
|
Implementation of the quick-sort algorithm decompiles successfully. |
|
Many optimizations were implemented and perfected to the |
|
point where the generated source code for the quick-sort |
|
algorithm is practically identical to the original% |
|
\footnote{See pages \pageref{Original quick-sort} and \pageref{Decompiled quick-sort} in the evaluation chapter.}. |
|
In particular, the source code compiles back to identical .NET CLR code. |
|
|
|
Further challenges were sought in more complex programs. |
|
The decompiler deals with several issues which did not |
|
occur in the quick-sort algorithm. It also implements some |
|
optimizations that are only applicable to the more complex programs. |
|
|
|
|
|
\subsection*{Special Difficulties} |
|
|
|
None. |
|
|
|
\newpage |
|
\section*{Declaration} |
|
|
|
I, David Srbeck\'y of Jesus College, being a candidate for Part II of the Computer |
|
Science Tripos, hereby declare |
|
that this dissertation and the work described in it are my own work, |
|
unaided except as may be specified below, and that the dissertation |
|
does not contain material that has already been used to any substantial |
|
extent for a comparable purpose. |
|
|
|
\bigskip |
|
\leftline{Signed} |
|
|
|
\medskip |
|
\leftline{Date} |
|
|
|
\cleardoublepage |
|
|
|
\tableofcontents |
|
|
|
\listoffigures |
|
|
|
|
|
\cleardoublepage % just to make sure before the page numbering |
|
% is changed |
|
|
|
\setcounter{page}{1} |
|
\pagenumbering{arabic} |
|
\pagestyle{headings} |
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
|
% 3 Pages |
|
\chapter{Introduction} |
|
|
|
\section{The goal of the project} |
|
% Decompiler |
|
% .NET |
|
|
|
The goal of this project is to create a .NET decompiler. |
|
|
|
Decompiler is a tool that translates machine code back to source |
|
code. That is, it does the opposite of a compiler -- it takes the |
|
executable file and it tries to recreate the original source code. |
|
|
|
In general, decompilation can be extremely difficult or even impossible. |
|
Therefore this project focuses on something slightly simpler -- |
|
decompilation of .NET executables. |
|
The advantage of .NET executables is that they consist of processor |
|
independent bytecode which is easier to decompile then |
|
the traditional machine code because |
|
the instructions are more high level and the code is less optimized. |
|
The .NET executables also preserve a lot of metadata about the code like |
|
type information and method names. |
|
|
|
Note that despite its name, the \emph{.NET Framework} |
|
is unrelated to networking or the Internet and it is basically the |
|
equivalent of \emph{Java Development Kit}. |
|
|
|
To succeed, the produced decompiler is required to successfully decompile |
|
a console implementation of a quick-sort algorithm back to C\# source code. |
|
|
|
|
|
\section{Decompilation} |
|
|
|
Decompilation is the opposite of compilation. It takes the executable |
|
file and attempts to recreate the original source code. |
|
Despite of doing the reverse, decompilation is very similar to |
|
compilation. This is because both compilation and decompilation do |
|
in essence that same thing -- translate program from one form to |
|
other form and then optimize the final form. |
|
Because of this, many analysis and optimizations used in compilation |
|
are also applicable in decompilation. |
|
|
|
Decompilation may not always be possible. In particular, |
|
self-modifying code is very problematic. |
|
|
|
\section{Uses of decompilation} |
|
|
|
\begin{itemize} |
|
\item Recovery of lost source code. |
|
\item Reverse engineering of programs that ship with no |
|
source code. This, however, might be a bit controversial use |
|
due to legal issues. |
|
\item Interoperability. |
|
\item Security analysis of programs. |
|
\item Error correction or minor improvements. |
|
\item Porting of executable to other platform. |
|
\item Translation between languages. |
|
Source code in some obscure language can be compiled into an |
|
executable which can in turn be decompiled into C\#. |
|
\item Debugging of deployed systems. Deployed system |
|
might not ship with source code or it might be difficult |
|
to `link' it to the source code because it is optimized. |
|
Also note that it might not be acceptable to restart |
|
the running system because the issue is not easily reproducible. |
|
\end{itemize} |
|
|
|
\section{The .NET Framework} |
|
% Bytecode - stack-based, high-level |
|
% JIT |
|
% Much metadata |
|
% Typesafe |
|
|
|
The \emph{.NET Framework} is a general-purpose software development platform |
|
which is very similar to the \emph{Java Development Kit}. |
|
It includes an extensive class library |
|
and, similarly to Java, it is based on a virtual machine model which |
|
makes it platform independent. |
|
|
|
To be platform independent the program is stored in form of bytecode |
|
(also called Common Intermediate Language or \emph{IL}). |
|
The bytecode instructions use the stack to pass data around rather then |
|
registers. |
|
The instructions are aware of the object-oriented programing |
|
model and are more high level then traditional processor instructions. |
|
For example, there is special instruction to allocate an object in memory, |
|
to access a field of an object, to cast an object to a different type |
|
and so on. |
|
|
|
The byte code is translated to machine code of the |
|
target platform at run-time. The code is just-in-time compiled rather then |
|
interpreted. This means that the first execution of a function will be slow |
|
due to the translation overhead, but the subsequent executions of the |
|
same function will be very fast. |
|
|
|
The .NET executables preserve a lot of information about the |
|
structure of the program. The complete structure of classes is |
|
preserved including all the typing information. The code is still |
|
separated into distinct methods and the complete signatures of |
|
the methods are preserved -- that is, the executable includes |
|
the method name, the parameters and the return type. |
|
|
|
The virtual machine is intended to be type-safe. Before any code |
|
is executed it is verified and code that has been successfully |
|
verified is guaranteed to be type-safe. For example, it is |
|
guaranteed not access unallocated memory. |
|
However, the framework also provides some inherently unsafe |
|
instructions. This allows support for languages like C++, but |
|
it also means that some parts of the program will be unverifiable. |
|
Unverifiable code may or may not be allowed to execute depending |
|
on the local security policy. |
|
|
|
In general, any programming language can be compiled to \emph{.NET} and |
|
there are already dozens, if not hundreds, of compliers in existence. |
|
The most commonly used language is \emph{C\#}. |
|
|
|
|
|
\section{Scope of the project} |
|
|
|
The \emph{.NET Framework} is mature comprehensive platform and handling |
|
all of its features is out the scope of the project. Therefore, it is |
|
necessary to state which features should be supported and which |
|
should not. |
|
|
|
The decompiler was specified to at least handle all the features required |
|
to decompile the quick-sort algorithm. This includes integer arithmetic, |
|
array operations, conditional statements, loops and system method calls. |
|
|
|
The decompiler will not support so-called unverifiable instructions -- |
|
for example, pointer arithmetic and access to arbitrary memory locations. |
|
These operations are usually used only for |
|
interoperability with native code (code not running under the .NET |
|
virtual machine). As such, these instructions |
|
are relatively rare and do not even occur in many programs at all. |
|
|
|
Generics also will not be supported. They were introduced in the |
|
second version of the Framework and many programs still do not make any |
|
use of them. The generics also probably does not provide any |
|
theoretical challenge for the decompilation. It just involves handing |
|
of more bytecodes and more special cases of the existing bytecodes. |
|
|
|
\section{Previous work} |
|
|
|
Decompilers are in existence nearly as long as compilers. |
|
The very first decompiler was written by Joel Donnelly in 1960 |
|
under the supervision of Professor Maurice Halstead. |
|
Initially, decompilers were created to aid in the program |
|
conversion process from second to third generation computers. |
|
Decompilers were actively developed over the following |
|
years and were used for variety of applications. |
|
|
|
Probably the best know and most influential research paper |
|
is \emph{Cristina Cifuentes' PhD Thesis ``Reverse Compilation |
|
Techniques'', 1994}. As part of the paper Cristina |
|
created a prototype \emph{dcc} which translates Intel i80286 |
|
machine code into C using various data and control flow |
|
analysis techniques. |
|
|
|
The introduction of virtual machines made decompilation |
|
more feasible then ever (e.g.\ \emph{Java Virtual Machine} or |
|
\emph{.NET Framework Runtime}). |
|
|
|
There are already several closed-source .NET decompilers |
|
in existence. The most known one is probably |
|
Lutz Roeder's \emph{Reflector for .NET}. |
|
|
|
% Cristina |
|
% Current .NET Decompilers |
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
|
% 12 Pages |
|
\cleardoublepage |
|
\chapter{Preparation} |
|
|
|
This chapter explains the general idea behind the decompilation |
|
process. The implementation chapter then explains how these |
|
methods were implemented and what measures were taken |
|
to ensure the correctness of the produced output. |
|
|
|
\section{Overview of the decompilation process} |
|
% Read metadata |
|
% Starting point |
|
% Translating bytecodes into C# expressions; |
|
% High-level; Stack-Based; Conditional gotos |
|
% Dataflow analysis; inlining |
|
% Control flow analysis (interesting) |
|
% Suggar |
|
|
|
The first step is to read and disassemble the \emph{.NET} executable files. |
|
There is an open-source library called \emph{Cecil} which was created exactly |
|
for this purpose. It loads the whole executable into memory and then |
|
it provides simple object-oriented API to access the content of the executable. |
|
Using this library, the access to the executable is straightforward. |
|
|
|
It is worth noting the wealth of information we are starting with. |
|
The \emph{.NET} executables contain quite high-level representation |
|
of the program. Complete typing information is still present -- |
|
we have access to all the classes including the fields and methods. |
|
The types and names are also preserved. |
|
The bytecode is stored on per-method basis and even the local variables |
|
used in the method body are still explicitly represented including the type.% |
|
\footnote{However, the names of local variables are not preserved.} |
|
|
|
Ultimately, the challenge is to translate the bytecode of a given |
|
method to a C\# code. |
|
|
|
For example, consider decompilation of following code: |
|
|
|
\begin{verbatim} |
|
// Load "Hello, world!" on top of the stack |
|
ldstr "Hello, world!" |
|
// Print the top of the stack to the console |
|
call System.Console::WriteLine(string) |
|
\end{verbatim} |
|
|
|
We start by taking the bytecodes one by one and translate them |
|
to C\# statements. The bytecodes are quite high level so |
|
the translation is usually straightforward. |
|
The two bytecodes can thus be represented as the following |
|
two C\# statements (\verb|output1| and \verb|some_input| are just |
|
dummy variables): |
|
|
|
\begin{verbatim} |
|
string output1 = "Hello, world!"; |
|
System.Console.WriteLine(some_input); |
|
\end{verbatim} |
|
|
|
The first major challenge of decompilation is data-flow |
|
analysis. In this example, we would analyze the stack |
|
behavior and we would find that \verb|output1| is on the top |
|
of the stack when \verb|WriteLine| is called. Thus, we would get: |
|
|
|
\begin{verbatim} |
|
string output1 = "Hello, world!"; |
|
System.Console.WriteLine(output1); |
|
\end{verbatim} |
|
|
|
At this point the produced source code should already compile |
|
and work as intended. |
|
|
|
The second major challenge of decompilation is control-flow |
|
analysis. The use of control-flow analysis |
|
is to replace \verb|goto| statements with high-level structures |
|
like loops and conditional statements. This makes the produced |
|
source code much more easier to read and understand. |
|
|
|
Finally, there are several sugarings that can be done to the code to |
|
make it more readable -- for example, replacing \verb|i = i + 1| |
|
with the equivalent \verb|i++| or using rules of logic to |
|
simplify boolean expressions. |
|
|
|
\section{Translating bytecodes into C\# expressions} |
|
% One at a time |
|
% Byte code is high level (methods, classes, bytecode commands) |
|
% Types preserved (contrast with Java) |
|
|
|
The first task is to convert naively the individual bytecodes into |
|
equivalent C\# expressions or statements. |
|
|
|
This task is reasonably straightforward because the bytecodes |
|
are quite high level and are aware of the object-oriented |
|
programming model. For example, there are bytecodes |
|
for integer addition, object creation, array access, |
|
field access, object casting and so on; all of these trivially |
|
translate to C\# expressions. |
|
|
|
Any branching commands are also easy to handle because |
|
C\# supports labels and \verb|goto|s. Unconditional branches |
|
are translated to \verb|goto| statements and conditional |
|
branches are translated to \verb|if (condition) goto Label;|. |
|
|
|
C\# uses local variables for data transfer whereas bytecodes |
|
use the stack. This mismatch will be handled during the |
|
data-flow analysis phase. Until then, dummy variables |
|
are used as placeholders. For example, \verb|add| bytecode |
|
would be translated to \verb|input1 + input2|. |
|
|
|
|
|
\section{Data-flow analysis} |
|
\subsection{Stack in .NET} |
|
% Stack behavior |
|
% Deterministic stack (tested - failed) |
|
|
|
Bytecodes use a stack to pass data between consecutive |
|
instructions. When a bytecode is executed, it pops its arguments |
|
from the stack, it performs some operation on them and then |
|
it pushes the result, if any, back to the stack. |
|
|
|
The number of popped elements is fixed and known before execution time |
|
- for example, any binary operation pops exactly two arguments and |
|
a method call pops one element for each of its arguments. |
|
|
|
Most bytecodes either push only one element on the stack or none at all.% |
|
\footnote{There is one exception to this -- the `dup' instruction. |
|
More on this later.} |
|
This makes the data-flow analysis slightly simpler. |
|
|
|
The framework imposes some useful restrictions with regards to |
|
control-flow. In theory, we could arrive at a given point in program |
|
and depending on which execution path was taken, we could end up with |
|
two completely different stacks. This is not allowed. Regardless |
|
of the path taken, the stack must have the same number of elements in it. |
|
The typing is also restricted -- if one execution path pushes an integer |
|
on the stack, any other execution path must push an integer as well. |
|
|
|
\subsection{Stack simulation} |
|
|
|
Bytecodes use the stack to move data around, but there is no such stack |
|
concept in C\#. Therefore we need to use something that C\# does |
|
have -- local variables. |
|
|
|
The transformation from stack to local variables is simpler then |
|
it might initially seem. Basically, pushing something on the |
|
stack means that we will need that value later on. So in C\#, |
|
we create a new temporary variable and store the value in it. |
|
On the other hand, popping is a usage of the value. So in C\#, |
|
we just reference the temporary variable that |
|
holds the value we want. The only difficult is to figure out |
|
which temporary variable it is. |
|
|
|
This is where the restrictions on the stack come handy -- we can |
|
figure out what the state of the stack will be at any given point. |
|
That is, we do not know what the value will be, but we do know |
|
who pushed the value on the stack and what the type of the value is. |
|
For example, after executing the following code |
|
\begin{verbatim} |
|
IL_01: ldstr "Hello'' |
|
IL_02: ldstr "world'' |
|
\end{verbatim} |
|
we know that there will be exactly two values on the stack -- |
|
the first one pushed by bytecode labeled \verb|IL_01| and the |
|
second one pushed by bytecode labeled \verb|IL_02|. Both of |
|
them of are of type \verb|String|. Let's say that the state |
|
of the stack is \{\verb|IL_01|, \verb|IL_02|\}. |
|
|
|
Now, let us pop value from the stack by calling |
|
\verb|System.Console.WriteLine|. This method takes one |
|
\verb|String| argument. Just before the method is called |
|
the state of the stack is \{\verb|IL_01|, \verb|IL_02|\}. |
|
Therefore the value allocated by \verb|IL_02| is popped |
|
from the stack and we are left with \{\verb|IL_01|\}. |
|
Furthermore, we know which temporary local variable to |
|
use in the method call -- the one that stores the value |
|
pushed by \verb|IL_02|. |
|
|
|
Let us call \verb|System.Console.WriteLine| again to completely |
|
empty the stack and let us see what the produced C\# source code |
|
would look like (the comments show the state of the stack |
|
between the individual commands). |
|
\begin{verbatim} |
|
// {} |
|
String temp_for_IL_01 = "Hello''; |
|
// {IL_01} |
|
String temp_for_IL_02 = "world''; |
|
// {IL_01, IL_02} |
|
System.Console.WriteLine(temp_for_IL_02); |
|
// {IL_01} |
|
System.Console.WriteLine(temp_for_IL_01); |
|
// {} |
|
\end{verbatim} |
|
|
|
\subsection{The `dup' instruction} |
|
|
|
In general, bytecodes push either one value on the stack or none at all. |
|
There is only one exception to this -- the \verb|dup| instruction. |
|
The purpose of this instruction is to duplicate the value on top of the |
|
stack. Fortunately, it does not break our model. The only significance |
|
of the \verb|dup| instruction is that the same value will be used twice - |
|
that is, the temporary local variable will be referenced twice. |
|
|
|
The \verb|dup| instruction, however, makes some future optimizations |
|
more difficult. |
|
|
|
The \verb|dup| instruction is used relatively rarely and |
|
it usually just saves a few bytes of disk space. |
|
Note that the \verb|dup| instruction could always be replaced |
|
by store to a local variable and two loads. |
|
|
|
The \verb|dup| instruction can be used in expression like |
|
\verb|this.a + this.a| -- the value \verb|this.a| is obtained |
|
once, duplicated and then the duplicates are added. |
|
|
|
\subsection{Control merge points} |
|
|
|
This in essence the most complex part of data-flow analysis. |
|
It is caused by the possibility of multiple execution paths. |
|
There are two scenarios that branching can cause. |
|
|
|
The nice scenario is when a value is pushed in one location and, |
|
depending on the execution path, can be popped in two or more different |
|
locations. This translates well into the local variable model - |
|
it simply means that the temporary local variable will be referenced |
|
several times and there is nothing special we need to do. |
|
|
|
The worse scenario is when a value can be pushed on the sack |
|
in two or more different locations and then it is poped only in one locations. |
|
Example of this would be |
|
\verb|Console.WriteLine(condition ? "true": "false")| -- |
|
depending on the value of the condition either ``true'' or |
|
``false'' is printed. This code snippet compiles into bytecode |
|
with two possible execution paths -- one in which ``true'' is pushed |
|
on the stack and one in which ``false'' is pushed on the stack. |
|
In both cases the value is poped only once when |
|
\verb|Console.WriteLine| is called. |
|
|
|
This scenario does not work well with the local variable |
|
model. Two values can be pushed on the stack and so two temporary |
|
local variables are created -- one holding ``true'' and |
|
one holding ``false''. |
|
Depending on the execution path either of these can be on the |
|
top of the stack when \verb|Console.WriteLine| is executed and |
|
therefore we do not know which temporary variable to use as the argument |
|
for \verb|Console.WriteLine|. It is impossible to know. |
|
|
|
This can be resolved by creating yet another temporary variable |
|
which will hold the actual argument for the \verb|Console.WriteLine| |
|
call. Depending on the execution path taken, one of the previous |
|
variables needs to be copied into this one. |
|
So the produced C\# source code would look like this: |
|
\begin{verbatim} |
|
String sigma; |
|
if (condition) { |
|
String temp1 = "true"; |
|
sigma = temp1; |
|
} else { |
|
String temp2 = "false"; |
|
sigma = temp2; |
|
} |
|
Console.WriteLine(sigma); |
|
\end{verbatim} |
|
|
|
Note that this is essentially the trick that is used to obtain |
|
single static assignment (SSA) form of programs during compilation. |
|
|
|
In reality, this happens extremely rarely and it is almost exclusively |
|
caused by the C\# \verb|?| operator. |
|
|
|
\subsection{In-lining expressions} |
|
% Generally difficult, but here SSA-SR |
|
% Validity considered - order, branching |
|
|
|
The code produced so far contains many temporary variables. |
|
These variables are often not necessary and can be |
|
optimized away. For example, the code |
|
|
|
\begin{verbatim} |
|
string output1 = "Hello, world!"; |
|
System.Console.WriteLine(output1); |
|
\end{verbatim} |
|
|
|
can be simplified into a single line: |
|
|
|
\begin{verbatim} |
|
System.Console.WriteLine("Hello, world!"); |
|
\end{verbatim} |
|
|
|
Doing this optimization makes the produced code much more readable. |
|
|
|
In general, removing a local variable from a method is non-trivial. |
|
However, these generated local variables are special -- they were generated |
|
to remove the stack-based data-flow model. |
|
This means that the local variables are assigned only once and are |
|
read only once% |
|
\footnote{With the exception of the `dup' instruction which is read twice |
|
and thus needs to be handled specially.} |
|
(corresponding to the push and pop operations respectively). |
|
|
|
There are some safety issues with this optimization. |
|
Even though we are sure that the local variable is assigned and read |
|
only once, we need to make sure that doing the optimization will not |
|
change the semantics of the program. For example consider the |
|
following code: |
|
|
|
\begin{verbatim} |
|
string output1 = f(); |
|
string output2 = g(); |
|
System.Console.WriteLine(output2 + output1); |
|
\end{verbatim} |
|
|
|
The functions \verb|f| and \verb|g| may have side-effects and in this |
|
case \verb|f| is called first and then \verb|g| is called. |
|
Optimizing the code would produce |
|
\begin{verbatim} |
|
System.Console.WriteLine(g() + f()); |
|
\end{verbatim} |
|
which calls the functions in the opposite order. |
|
|
|
Being conservative, we will allow this optimization only if the |
|
declaration of the variable immediately precedes its use. |
|
In other words, the variable is declared and then immediately |
|
used with no possible side-effects in between these. |
|
Using this rule, the previous example can be optimized to |
|
\begin{verbatim} |
|
string output1 = f(); |
|
System.Console.WriteLine(g() + output1); |
|
\end{verbatim} |
|
but it can not be optimized further because the function |
|
\verb|g| is executed between the declaration of \verb|output1| |
|
and its use. |
|
|
|
|
|
\section{Control-flow analysis} |
|
% Recovering high-level structures |
|
|
|
The main point of control-flow analysis is to recover high-level |
|
structures like loops and conditional blocks of code. |
|
This is not necessary to produce semantically correct code, |
|
but it does make the code much more readable compared to |
|
only using \verb|goto| statements. It is also |
|
the most theoretically interesting part of the whole |
|
decompilation process. |
|
|
|
In general, \verb|goto|s can be always completely removed |
|
by playing some `tricks' like duplicating code or adding |
|
boolean flags. However, as far as readability of the |
|
code is concerned, this does more harm than good. |
|
Therefore, we want to remove \verb|goto|s using the `natural' |
|
way -- by recovering the high-level structures they represent |
|
in the program. Any \verb|goto|s that can not be removed |
|
in such way are left in the program. |
|
|
|
See figures \ref{IfThen}, \ref{IfThenElse} and \ref{Loop} |
|
for examples of control-flow graphs. |
|
|
|
\begin{figure}[tbh] |
|
\centerline{\includegraphics{figs/IfThen.png}} |
|
\caption{\label{IfThen}Conditional statement (if-then)} |
|
\end{figure} |
|
|
|
\begin{figure}[tbh] |
|
\centerline{\includegraphics{figs/IfThenElse.png}} |
|
\caption{\label{IfThenElse}Conditional statement (if-then-else)} |
|
\end{figure} |
|
|
|
\begin{figure}[tbh] |
|
\centerline{\includegraphics{figs/Loop.png}} |
|
\caption{\label{Loop}Loop} |
|
\end{figure} |
|
|
|
\subsection{Finding loops} |
|
% T1-T2 reduction |
|
% Nested loops |
|
% Goto=Continue=Break; Multi-level break/continue |
|
|
|
T1-T2 transformations are used to find loops. |
|
These transformation are usually used to determine whether |
|
a control flow graph is reducible or not, but |
|
it turns out that they are suitable for finding loops as well. |
|
|
|
If a control-flow graph is irreducible, it means that the |
|
program can not be represented only using high-level structures. |
|
That is, \verb|goto| statements are necessary to represent the |
|
program. This is undesirable since \verb|goto| statements |
|
usually make the decompiled source code less readable. |
|
See figure \ref{Ireducible} for the classical example |
|
of irreducible control-flow graph. |
|
|
|
\begin{figure}[tbh] |
|
\centerline{\includegraphics{figs/Ireducible.png}} |
|
\caption{\label{Ireducible}Ireducible control-flow graph} |
|
\end{figure} |
|
|
|
If a control-flow graph is reducible, the program may or may not be |
|
representable only using high-level structures. That is, |
|
\verb|goto| statements still may be necessary on some occasions. |
|
For example, C\# has a \verb|break| statement which exits |
|
the inner most loop, but it does not have any statement which |
|
would exit multiple levels of loops.% |
|
\footnote{This is language dependent. |
|
The Java language, for example, does provide such command.} |
|
Therefore in such occasion we have to use the \verb|goto| statement. |
|
|
|
The algorithm works by taking the initial control-flow graph |
|
and then repeatedly applying simplifying transformations to it. |
|
As we can see from the name, the algorithm consists of two |
|
transformations called T1 and T2. |
|
If the graph is reducible, the algorithm will eventually simplify |
|
the graph to only a single node. If the graph is not reducible, |
|
we will end up with multiple nodes and no more transformations |
|
that can be performed. |
|
|
|
Both of the transformations look for a specific pattern in |
|
the graph and then replace the pattern with something simpler. |
|
|
|
See figure \ref{T1T2} for a diagram showing the T1 and T2 transformations. |
|
|
|
\begin{figure}[tbh] |
|
\centerline{\includegraphics{figs/T1T2.png}} |
|
\caption{\label{T1T2}T1 and T2 transformations} |
|
\end{figure} |
|
|
|
The T1 transformation removes self-loop. If one of the successors |
|
of a node is the node itself, then the node is a self-loop. |
|
That is, the node `points' to itself. The T1 transformation |
|
replaces the node with a node without such link. |
|
The number of preformed T1 transformations corresponds to the |
|
number of loops in the program. |
|
|
|
The T2 transformation merges two consecutive nodes together. |
|
The only condition is the the second node has to have |
|
only one predecessor which is the first node. |
|
Preforming this transformation repeatedly basically reduces |
|
any directed acyclic graph into a single node. |
|
|
|
See figures \ref{IfThenElseReduction}, \ref{NestedLoops} and |
|
\ref{NestedLoops2} for examples how T1 and T2 transformations |
|
can be used to reduce graphs. |
|
|
|
\begin{figure}[p] |
|
\centerline{\includegraphics{figs/IfThenElseReduction.png}} |
|
\caption{\label{IfThenElseReduction}Reduction of if-then-else statement} |
|
\end{figure} |
|
|
|
\begin{figure}[p] |
|
\centerline{\includegraphics{figs/NestedLoops.png}} |
|
\caption{\label{NestedLoops}Reduction of two nested loops} |
|
\end{figure} |
|
|
|
\begin{figure}[tbhp] |
|
\centerline{\includegraphics{figs/NestedLoops2.png}} |
|
It is not not immediately obvious that these are, in fact, |
|
two nested loops. Fifth conceptual node has been added |
|
to demonstrate why the loops can be considered nested. |
|
\caption{\label{NestedLoops2}Reduction of two nested loops 2} |
|
\end{figure} |
|
|
|
The whole algorithm is guaranteed to terminate if and only |
|
the graph is reducible. However, it may produce different |
|
intermediate states depending on the order of the transformations. |
|
For example, the number of T1 transformations used can vary. |
|
This affects the number of apparent loops in the program. |
|
|
|
Note that although reducible graphs are preferred, |
|
irreducible graphs can be easily decompiled as well |
|
using \verb|goto| statements. Programs that were originally |
|
written in C\# are most likely going to be reducible. |
|
|
|
\subsection{Finding conditionals} |
|
% Compound conditions |
|
|
|
Any T1 transformation (self-loop reduction) produces a loop. |
|
Similarly, T2 transformation or several T2 transformations |
|
produce a directed acyclic sub-graph. This sub-graph can |
|
usually be represented as one or more conditional statements. |
|
|
|
Any node in the control-flow graph that has two successors |
|
is potentially a conditional -- that is, \verb|if-then-else| |
|
statement. The node itself determines which branch will |
|
be taken, so the node is the condition. We now need to |
|
determine what is the `true' body and what is the `false' |
|
body. That is, we look for the blocks of code that will |
|
be executed when the condition will or will not hold respectively. |
|
This is done by considering reachability. A code that is |
|
reachable only by following the `true' branch can be |
|
considered as the `true' body of the conditional. |
|
Similarly, code that is reachable only by following the |
|
`false' branch is the `false' body of the conditional. |
|
Finally, the code that is reachable from both branches is |
|
not part of the conditional -- it is the code that follows |
|
the whole \verb|if-then-else| statement. |
|
See figure \ref{IfThenElseRechablility} for a diagram of this. |
|
|
|
\begin{figure}[tbhp] |
|
\centerline{\includegraphics{figs/IfThenElseRechablility.png}} |
|
\caption{\label{IfThenElseRechablility}Reachablity of nodes for a conditional} |
|
\end{figure} |
|
|
|
\subsection{Short-circuit boolean expressions} |
|
|
|
Short-circuit boolean expressions are boolean expressions |
|
which may not evaluate completely. |
|
|
|
The semantics of normal boolean expression \verb|f() & g()| is |
|
to evaluate both \verb|f()| and \verb|g()| and then return |
|
the logical `and' of these. |
|
|
|
The semantics of short-circuit boolean expression \verb|f() && g()| |
|
is different. The expression \verb|f()| is evaluated as before, |
|
but \verb|g()| is evaluated only if \verb|f()| returned \emph{true}. |
|
If \verb|f()| returned \emph{false} then the whole expression |
|
will return \emph{false} anyway and so there no point in evaluating |
|
\verb|g()|. |
|
|
|
In general, conditionals depending on this short-circuit logic |
|
cannot be expressed as normal conditionals without the use of \verb|goto|s |
|
or code duplication. |
|
Therefore it is desirable to find the short-circuit boolean expressions |
|
in the control-flow graph. |
|
|
|
The short-circuit boolean expressions will manifest themselves as |
|
one of four patterns in the control-flow diagrams. |
|
See figure \ref{shortcircuit}. |
|
|
|
\begin{figure}[tbh] |
|
\centerline{\includegraphics{figs/shortcircuit.png}} |
|
\caption{\label{shortcircuit}Short-circuit control-flow graphs} |
|
\end{figure} |
|
|
|
These patterns can be easily searched for and replaced with a single node. |
|
|
|
Nested expressions like \verb/(f() && g()) || (h() && i())/ |
|
will simplify well by applying the algorithm repeatedly. |
|
|
|
\subsection{Basic blocks} |
|
% Performance optimization only (from compilation) |
|
|
|
Basic block is a block of code which is always executed as |
|
a unit. That is, no other code can jump in the middle of the |
|
block and there are no jumps from the middle of the block. |
|
|
|
The implication of this is that as far as control-flow |
|
analysis is concerned, the whole basic block can be |
|
considered as a single instruction. |
|
|
|
This primary reason for implementing this is performance gain. |
|
|
|
|
|
\section{Requirements Analysis} |
|
|
|
The decompiler must successfully round-trip a quick-sort algorithm. |
|
That is, given an executable containing an implementation of |
|
the quick-sort algorithm, the decompiler must produce |
|
C\# source code that is both syntactically and semantically correct. |
|
The produced source code must compile and work correctly |
|
without any additional modifications to the source code. |
|
|
|
To achieve this the Decompiler needs to handle at least the following: |
|
\begin{itemize} |
|
\item Integers and integer arithmetic |
|
\item Create and be able to use integer arrays |
|
\item Branching must be successfully decompiled |
|
\item Several methods can be defined |
|
\item Methods can have arguments and return values |
|
\item Methods can be called recursively |
|
\item Integer command line arguments can be read and parsed |
|
\item Text can be outputted to the standard console output |
|
\end{itemize} |
|
|
|
|
|
\section{Development model} |
|
|
|
The development is based on the spiral model. |
|
|
|
The decompiler will internally consist of several consecutive code |
|
representations. The decompiled program will be gradually |
|
transformed from one representation to other until it reaches |
|
the last one and is outputted as source code. |
|
This matches the spiral model well. Successful implementation |
|
of each code representation can be seen as one iteration in the |
|
spiral model. |
|
|
|
Furthermore, once a code representation is finished, it can |
|
be improved by implementing an optimization which transforms it. |
|
This can be seen as another iteration. |
|
|
|
|
|
\section{Reference documentation} |
|
|
|
The .NET executables will be decompiled into C\# source code. |
|
It is therefore crucial to be intimately familiar with the language. |
|
Evaluation semantics, operator precedence, associativity and other |
|
aspects of the language can be found in the \emph{ECMA-334 Standard -- |
|
C\# Language Specification}. |
|
|
|
In oder to decompile .NET executables, it is necessary to get |
|
familiar with the .NET Runtime (i.e.\ the .NET Virtual Machine). |
|
In particular, it is necessary to learn how the execution engine |
|
works -- how does the stack behave, what limitations are imposed |
|
on the code, what data types are used and so on. |
|
Most importantly, it is necessary to get familiar with majority |
|
of the bytecodes. |
|
The primary reference for the .NET Runtime is the |
|
\emph{ECMA-335 Standard -- Common Language Infrastructure (CLI)}. |
|
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
|
% 25 Pages |
|
\cleardoublepage |
|
\chapter{Implementation} |
|
|
|
\section{Development process} |
|
% SVN |
|
% GUI -- options; progressive optimizations |
|
% Testing |
|
% Refactoring |
|
|
|
The whole project was developed in C\# on the \emph{Windows} platform. |
|
SharpDevelop\footnote{Open-source integrated development environment |
|
for the .NET framework. I am actively participating in its development, |
|
currently being one of the major developers.} |
|
was used as the integrated development environment for writing of the code. |
|
|
|
The source code and all files related to the project are stored on SVN% |
|
\footnote{Subversion (SVN) is an open-source version control system. |
|
It is very similar to CVS and aims to be its successor.} |
|
server. This has several advantages such as back-up of all data, |
|
documented project history (the commit messages) and the ability |
|
to revet back to any previous state. The graphical user interface |
|
for SVN also makes it possible to easily see and review all changes |
|
that were made since previous commit to the server. |
|
|
|
The user interface of the decompiler is very simple. |
|
It consists of a single window which is largely covered by a text area |
|
containing the output of the decompilation process. |
|
There are only a few controls in the top part of the window. |
|
Several check boxes can be used to completely turn off or on |
|
optimization stages of the decompilation process. |
|
Numerical text boxes can be used to specify the maximal number of |
|
interactions for several transformations. By gradually increasing |
|
these it is possible to observe changes to the output step-by-step. |
|
These options significantly aid in debugging and testing of |
|
the decompiler. |
|
|
|
The source code generated by the decompiler is also stored on |
|
the disk and committed to SVN together with all other files. |
|
This is the primary method of regression testing. The generated source code |
|
is the primary output of the decompiler and thus any change |
|
in it may potentially signify a regression in the decompiler. |
|
Before every commit the changes to this file were reviewed |
|
to confirm that the new features did yield the expected |
|
results and that no regressions were introduced. |
|
|
|
|
|
\section{Overview of representation structures} |
|
% Stack-based bytecode |
|
% Variable-based bytecode |
|
% High-level blocks |
|
% Abstract Syntax Tree |
|
|
|
The decompiler uses the following four intermediate representations |
|
of the program (sequentially in this order): |
|
\begin{itemize} |
|
\item Stack-based bytecode |
|
\item Variable-based bytecode |
|
\item High-level blocks |
|
\item Abstract Syntax Tree |
|
\end{itemize} |
|
The code is translated from one representation to other and some |
|
transformations/optimizations are done at each of these representations. |
|
|
|
The representations are very briefly described in the following |
|
four subsections and then they are covered again in more detail. |
|
|
|
Trivial artificial examples are used to aid the explanation. |
|
Note that the evaluation chapter shows the actual outputs |
|
for the quick-sort implementation. |
|
|
|
\subsection{Stack-based bytecode representation} |
|
The stack-based bytecode is the original bytecode as read from the |
|
assembly. It is never modified. However, it is analyzed and |
|
annotated with the results of the analysis. Most data-flow |
|
analysis is performed at this stage. |
|
This is an example of code in this representation: |
|
\begin{verbatim} |
|
ldstr "Hello, world!" |
|
call System.Console::WriteLine |
|
\end{verbatim} |
|
|
|
\subsection{Variable-based bytecode representation} |
|
The variable-based bytecode is the result of removing the |
|
stack-based data-flow model. Local variables are now used instead. |
|
Even though the stack-based model is now completely removed, |
|
we can still use the bytecodes to represent the program. |
|
We just need to think about them in a slightly different way. |
|
The bytecodes are in essence functions which take arguments |
|
and return values. For example, the \verb|add| bytecode |
|
pops two values and pushes back the result and so it is now |
|
a function which takes two arguments and returns the result. |
|
The \verb|ldstr "Hello, world!"| bytecode used above is slightly |
|
more complicated. It does not pop anything from the stack, |
|
but it still has the implicit argument \verb|"Hello, world!"| |
|
which comes from the immediate operand of the bytecode. |
|
|
|
Continuing with the example, the variable-based from would be: |
|
\begin{verbatim} |
|
temp1 = ldstr("Hello, world!"); |
|
call(System.Console::WriteLine, temp1); |
|
\end{verbatim} |
|
where the variable \verb|temp1| was used to remove the stack data-flow model. |
|
|
|
Some transformations are performed at this stage, modifying the |
|
code. For example, the in-lining optimization would transform |
|
the previous example to: |
|
\begin{verbatim} |
|
call(System.Console::WriteLine, ldstr("Hello, world!")); |
|
\end{verbatim} |
|
Note that expressions can be be nested as seen in the code above. |
|
|
|
\subsection{High-level blocks representation} |
|
The next data representation introduces high-level blocks. |
|
In this representation, related parts of code can be grouped |
|
together into a block of code. |
|
The block signifies some high-level structure and blocks can |
|
be nested (for example, conditional within a loop). |
|
All control-flow analysis is performed at this stage. |
|
|
|
The blocks can be visualized by a pair of curly braces: |
|
\begin{verbatim} |
|
{ // Block of type 'method body' |
|
IL_01: call(System.Console::WriteLine, ldstr("Hello, world")); |
|
{ // Block of type 'loop' |
|
IL_02: call(System.Console::WriteLine, ldstr("!")); |
|
IL_03: br(IL_02) // Branch to IL_02 |
|
} |
|
} |
|
\end{verbatim} |
|
Initially there is only one block containing the whole method |
|
body and as the high-level structures are found, new smaller |
|
blocks are created. At this stage the blocks have no function |
|
other than grouping of code. |
|
|
|
% Validity -- gotos |
|
This data representation also allows code to be reordered |
|
to better fit the ordinal high-level structure. |
|
|
|
\subsection{Abstract Syntax Tree representation} |
|
|
|
This is the final representation. The abstract syntax tree (AST) |
|
is a structured way of representing the produced source code. |
|
For example, the expression \verb|(a + 1)| would be represented |
|
as four objects -- parenthesis, binary operator, identifier and |
|
constant. The structure is a tree so the identifier and the constant |
|
are children of the binary operator which is in turn child of the |
|
parenthesis. |
|
|
|
% The high-level structures are directly translated to nodes in the |
|
% abstract syntax tree. For example, loop will translate to |
|
% a node representing code \verb|for(;;) {}|. |
|
|
|
% The bytecodes are translated to the abstract syntax tree on |
|
% individual basis. For example, the bytecode expression \verb|add(1, 2)| |
|
% is translated to \verb|1 + 2|. The previously seen |
|
% \verb|call(System.Console::WriteLine, ldstr("Hello, world"))| |
|
% is translated to \verb|System.Console.WriteLine("Hello, world")|. |
|
|
|
This initial abstract syntax tree will be quite verbose due to |
|
constructs that ensure correctness of the produced source code. |
|
There will be an especially high quantity of \verb|goto|s and |
|
parentheses. For example, a simple increment is initially |
|
represented as \verb|((i) = (i) + (1));| where the parentheses |
|
are used to ensure safety. |
|
Transformations are done on the abstract syntax tree to simplify the code |
|
in occasions where it is know to be safe. |
|
|
|
Several further transformations are done to make the code more readable. |
|
For example, renaming of local variables or use of common |
|
programming idioms. |
|
|
|
\section{Stack-based bytecode representation} |
|
\label{Stack-based bytecode representation} |
|
|
|
This is the first and simplest representation of the code. |
|
It directly corresponds to the bytecode in the .NET executable. |
|
It is the only representation which is not transformed in any way. |
|
|
|
% Cecil |
|
% Offset, Code, Operand |
|
The bytecode is loaded from the .NET executable with the |
|
help of the Cecil library. |
|
Three fundamental pieces of information are loaded for each bytecode -- |
|
its offset from the start of the method, opcode |
|
(name) and the immediate operand. |
|
The immediate operand is additional argument |
|
present for some bytecodes -- for example, for the \verb|call| bytecode, |
|
the operand specifies the method that should be called. |
|
|
|
% Branch target, branches here |
|
For bytecodes that can branch, the operand is the branch target. |
|
Each bytecode stores a reference to the branch target (if any) and |
|
each bytecode also stores a reference to all bytecodes that can branch to it. |
|
|
|
\subsection{Stack analysis} |
|
|
|
Once all bytecodes are loaded, the stack behavior of the program is analyzed. |
|
The precise state of the stack is determined for each bytecode. |
|
The .NET Framework dictates that this must be possible for valid executables. |
|
It is possible to determine the exact number of elements on the |
|
stack and for each of these elements is possible to determine which bytecode |
|
pushed it on the stack% |
|
\footnote{In some control merge scenarios, there might be more |
|
bytecodes that can, depending on the execution path, push the value on the stack.} |
|
and what the type of the element is. |
|
|
|
% Stack before / after |
|
% StackBehaviour -- difficult for methods |
|
There are two stack states for each bytecode -- the first is the known |
|
state \emph{before} the bytecode is executed and the second is the state |
|
\emph{after} the bytecode is executed. Knowing what the bytecode does, |
|
the later state can be obviously derived from the former. |
|
This is usually easy -- for example, the \verb|add| bytecode pops |
|
two elements from the stack and pushes one back. The pushed element |
|
is obviously pushed by the \verb|add| bytecode |
|
and the element is of the same type as the two popped elements% |
|
\footnote{The 'add' bytecode can only sum numbers of the same type.}. |
|
The \verb|ldstr| bytecode just pushes one element of type \verb|string|. |
|
The behaviour of the \verb|call| bytecode is slightly more complex |
|
because it depends on the method that it is invoking. |
|
Implementing these rules for all bytecodes is |
|
tedious, but usually straightforward. |
|
|
|
The whole stack analysis is performed by iterative algorithm. |
|
The stack \emph{before} the very first bytecode is empty. |
|
We can use this to find the stack state \emph{after} the first instruction. |
|
The stack state \emph{after} the first bytecode is the |
|
initial stack state for other bytecode. |
|
We can apply this principle again and again until all states are known. |
|
|
|
% Branching -- 1) fall 2) branch 3) fall+branch veryfy same |
|
Branching dictates the propagation of the states between bytecodes. |
|
For two simple consecutive bytecodes, the stack state \emph{after} |
|
the first bytecode is same as the state \emph{before} the second. |
|
Therefore we just copy the state. |
|
If the bytecode is a branch then we need to copy |
|
the stack state to the target of the branch as well. |
|
|
|
Dead-code can never be executed and thus it does not have |
|
any stack state associated with it. |
|
|
|
At the bytecode level, the .NET Runtime handles boolean values |
|
as integers. For example, the bytecode \verb|ldc.i4 1| can |
|
be interpreted equivalently as `push \verb|1|' and `push \verb|true|'. |
|
Therefore three integer types are used -- |
|
`zero integer', `non-zero integer' and generic `integer' (of |
|
unknown value). Bytecode \verb|ldc.i4 1| therefore has type |
|
`non-zero integer'. |
|
|
|
\section{Variable-based bytecode representation} |
|
\label{Variable-based bytecode representation} |
|
% Removed stack-based model -- local variables, nesting |
|
|
|
This is a next representation which differs from the stack-based |
|
one by completely removing the stack-based data-flow model. |
|
Data is passed between instructions using local variables |
|
or using nested expressions. |
|
|
|
Consider the following stack-based program which evaluates |
|
$2 * 3 + 4 * 5$. |
|
\begin{verbatim} |
|
ldc.i4 2 |
|
ldc.i4 3 |
|
mul |
|
ldc.i4 4 |
|
ldc.i4 5 |
|
mul |
|
add |
|
\end{verbatim} |
|
|
|
The program is transformed to the variable-based form by |
|
interpreting the bytecodes as functions which return values. |
|
The result of every function call is stored in new temporary |
|
variable so that the value can be referred to latter. |
|
|
|
\begin{verbatim} |
|
int temp1 = ldc.i4(2); |
|
int temp2 = ldc.i4(3); |
|
int temp3 = mul(temp1, temp2); |
|
int temp4 = ldc.i4(4); |
|
int temp5 = ldc.i4(5); |
|
int temp6 = mul(temp4, temp5); |
|
int temp7 = add(temp3, temp6); |
|
\end{verbatim} |
|
|
|
The stack analysis performed earlier is used to determine |
|
what local variables should be used as arguments. |
|
For example, the stack analysis tells us that the stack contains |
|
precisely two elements just before the \verb|add| instruction. |
|
It also tells us that these two elements have been pushed on the stack by the |
|
\verb|mul| instructions (the third and sixth instruction respectively). |
|
Therefore to access the results of the \verb|mul| instructions, |
|
the \verb|add| instruction needs to use the temporary variables |
|
\verb|temp3| and \verb|temp6|. |
|
|
|
\subsection{Representation} |
|
|
|
Internally, each expression (`function call') consists of three |
|
main elements -- the opcode (e.g.\ \verb|ldc.i4|), the immediate operand |
|
(e.g.\ \verb|1|) and the arguments (e.g.\ \verb|temp1|). |
|
The argument can be either a reference to a local variable or |
|
other nested expression. In the case of |
|
\begin{verbatim} |
|
call(System.Console::WriteLine, ldstr("Hello, world")) |
|
\end{verbatim} |
|
\verb|call| is the opcode, |
|
\verb|System.Console::WriteLine| is the operand and |
|
\verb|ldstr("Hello, world")| is the argument (nested expression). |
|
|
|
Expressions like \verb|ldstr("Hello, world")| or \verb|ldc.i4| |
|
do not have any arguments other then the immediate operand. |
|
|
|
% stloc, ldloc |
|
There are bytecodes for storing and referencing a local variables |
|
(\verb|ldloc| and \verb|stloc| respectively). |
|
Therefore no special data structures are needed to declare |
|
and reference the local variables because we we can store |
|
program like |
|
\begin{verbatim} |
|
int temp1 = ldc.i4(2); |
|
int temp2 = ldc.i4(3); |
|
int temp3 = mul(temp1, temp2); |
|
\end{verbatim} |
|
only using bytecodes as |
|
\begin{verbatim} |
|
stloc(temp1, ldc.i4(2)); |
|
stloc(temp2, ldc.i4(3)); |
|
stloc(temp3, mul(ldloc(temp1), ldloc(temp2))); |
|
\end{verbatim} |
|
|
|
% Dup |
|
Local variables \verb|temp1|, \verb|temp2| and \verb|temp3| |
|
are tagged as being `single static assignment and single read'. |
|
This is a necessary property for the in-lining optimization. |
|
All generated temporary variables satisfy this property |
|
except for the ones storing the result of a \verb|dup| instruction |
|
which does not satisfy this property because it is read twice. |
|
|
|
\subsection{In-lining of expressions} |
|
\label{In-lining of expressions} |
|
|
|
The variable-based representation passes data between instructions |
|
either using local variables or using expression nesting. |
|
Expression nesting is the preferred more readable form. |
|
The in-lining transformations simplifies the code by removing |
|
some temporary variables and using expression nesting instead. |
|
|
|
Consider the following code. |
|
\begin{verbatim} |
|
stloc(temp1, ldstr("Hello, world!")); |
|
call(System.Console::WriteLine, ldloc(temp1)); |
|
\end{verbatim} |
|
This code can be simplified into a single line by in-lining the |
|
local variable \verb|temp1|: |
|
(that is, the \verb|stloc| expression is nested within the |
|
\verb|call| expression) |
|
\begin{verbatim} |
|
call(System.Console::WriteLine, ldstr("Hello, world!")); |
|
\end{verbatim} |
|
|
|
The algorithm is iterative. Two consecutive lines are considered |
|
and if the first line assigns to a local variable which is used |
|
in the second line then the variable is in-lined. |
|
This is repeated until no more optimizations can be done. |
|
|
|
There are some safety concerns with this optimizations. |
|
The optimization would break the program if the local variable |
|
was read later in the program. |
|
Therefore the optimization is done only for variables that have |
|
the property of being `single static assignment and single read'. |
|
|
|
The second concern is change of execution order. |
|
Consider the following pseudo-code: |
|
\begin{verbatim} |
|
temp1 = f(); |
|
add(g(), temp1); |
|
\end{verbatim} |
|
In-ling \verb|temp1| would change the order in which the |
|
functions \verb|f| and \verb|g| are called and thus the |
|
optimization can not be performed. |
|
On the other hand, the following code can be optimized: |
|
\begin{verbatim} |
|
temp1 = f(); |
|
add(1, temp1); |
|
\end{verbatim} |
|
The effect of the optimization will be that the expression |
|
\verb|1| will be evaluated before \verb|f| rather then |
|
after it. However, this does not change the semantics |
|
of the program because evaluation of \verb|1| does not have |
|
any side-effects and it will still evaluate to the same value. |
|
|
|
More importantly, the same property holds for \verb|ldloc| (load local variable) |
|
instruction. It does not have any side-effects and it will |
|
evaluate to the same value even if evaluated before the |
|
function \verb|f| (or any other expression being in-lined). |
|
This is true because the function \verb|f| |
|
(or any other expression being in-lined) |
|
can not change the value of the local variable. |
|
% The only way to change a local |
|
% variable would be using the \verb|stloc| instruction. |
|
% However, \verb|stloc| can not be part of the in-lined |
|
% expression because it does not return any value - |
|
% it is a statement and as such can not be nested within an expression. |
|
|
|
% The property holds for \verb|1|, \verb|ldloc| and it would hold |
|
% for some other expressions as well, but due to the |
|
% way the stack-based model was removed it is sufficient |
|
% to consider only \verb|ldloc|. This is by far the most common case. |
|
|
|
To sum up, it is safe to in-line a local variable if it has |
|
property `single static assignment and single read' and |
|
if the argument referencing it is preceded only |
|
by \verb|ldloc| instructions. |
|
|
|
The following code would be optimized in two iterations: |
|
\begin{verbatim} |
|
stloc(temp1, ldc.i4(2)); |
|
stloc(temp2, ldc.i4(3)); |
|
stloc(temp3, mul(ldloc(temp1), ldloc(temp2))); |
|
\end{verbatim} |
|
First iteration (in-line \verb|temp2|): |
|
\begin{verbatim} |
|
stloc(temp1, ldc.i4(2)); |
|
stloc(temp3, mul(ldloc(temp1), ldc.i4(3))); |
|
\end{verbatim} |
|
Second iteration (in-line \verb|temp1|): |
|
\begin{verbatim} |
|
stloc(temp3, mul(ldc.i4(2), ldc.i4(3))); |
|
\end{verbatim} |
|
|
|
|
|
\subsection{In-lining of `dup' instruction} |
|
|
|
The \verb|dup| instruction can not be in-lined because |
|
it does not satisfy the `single static assignment and single read' |
|
property. This is because the data is referenced twice. |
|
For example, in-lining the following code would cause |
|
the function \verb|f| to be called twice. |
|
\begin{verbatim} |
|
stloc(temp1, dup(f())); |
|
add(temp1, temp1); |
|
\end{verbatim} |
|
However, there are circumstances where this would be acceptable. |
|
If the expression within the \verb|dup| instruction is a |
|
constant then it is possible to in-line it without any harm. |
|
The following code can be optimized: |
|
\begin{verbatim} |
|
stloc(temp1, dup(ldc.i4(2))); |
|
add(temp1, temp1); |
|
\end{verbatim} |
|
Even more elaborate expressions like |
|
\verb|mul(ldc.i4(2), ldc.i4(3))| are still a constant. |
|
The instruction \verb|ldarg.0| (load \verb|this|) is also |
|
constant relative to the single invocation of method. |
|
|
|
|
|
|
|
\section{High-level blocks representation} |
|
\label{High-level blocks representation} |
|
% Links |
|
% Link regeneration |
|
|
|
So far the code is just a sequential list of statements. |
|
The statements are in the exactly same order as found in the |
|
original executable and all control-flow is achieved |
|
only by the \verb|goto| statements. |
|
|
|
The `high-level block' representation structure introduces |
|
greater freedom to the code. High-level structures are recovered |
|
and the code can be arbitrarily reordered. |
|
|
|
The code is organized into blocks in this representation. |
|
A block can can contain executable code as well as other nested blocks. |
|
Therefore this resembles a tree-like data structure. |
|
The root block of the three represents the method and contains |
|
all of its code. |
|
A node in the three represents some high-level data structure |
|
(loop, \verb|if| statement). Nodes can also represent `meta' |
|
high-level structures which are used during the optimizations. |
|
For example, node can encapsulate a block of code that is known |
|
to be acyclic (without loops). |
|
Finally, all leaves in the tree are basic blocks. |
|
|
|
\subsection{Safety} |
|
This data representation allows the code to be restructured |
|
and reordered. Therefore care is needed to make sure that |
|
the produced code is semantically correct. |
|
|
|
% Validity -- arbitrary nodes |
|
It would be possible to make sure that every transformation is |
|
valid. However, there is even more robust solution -- |
|
give up at the start and allow arbitrary transformations. |
|
The only constraint is that the code can be only moved. It cannot |
|
be duplicated and it cannot be deleted. Deleting of code would |
|
indeed cause problems. Anything else is permitted. |
|
|
|
When the code is generated, the correctness is ensured by |
|
explicit \verb|label|s and \verb|goto|s placed in front |
|
of and after every basic block respectively. For example |
|
consider the follow piece of code in which the order of |
|
basic blocks was reversed and two unnecessary high-level |
|
nodes were added: |
|
|
|
\begin{verbatim} |
|
goto BasicBlock1; // Method prologue |
|
|
|
for(;;) { |
|
BasicBlock2: |
|
Console.WriteLine("world"); |
|
goto End; |
|
} |
|
|
|
if (true) { |
|
BasicBlock1: |
|
Console.WriteLine("Hello"); |
|
goto BasicBlock2; |
|
} |
|
|
|
End: |
|
\end{verbatim} |
|
|
|
The inclusion of explicit \verb|label|s and \verb|goto|s |
|
makes the high-level structures and the order of basic blocks |
|
irrelevant and thus the produced source code will be correct. |
|
Note that in the code above the lines \verb|for(;;) {| and |
|
\verb|if (true) {| will never be reached during execution. |
|
|
|
In general, the transformations done on the code will be |
|
more sane then in the example above and most of the |
|
\verb|label|s and \verb|goto|s will be redundant. |
|
In the following code all of the \verb|label|s and \verb|goto|s |
|
can be removed: |
|
|
|
\begin{verbatim} |
|
goto BasicBlock1; // Method prologue |
|
|
|
BasicBlock1: |
|
Console.WriteLine("Hello"); |
|
goto BasicBlock2; |
|
|
|
BasicBlock2: |
|
Console.WriteLine("world"); |
|
goto End; |
|
|
|
End: |
|
\end{verbatim} |
|
|
|
The removal of \verb|label|s and \verb|goto|s is done once |
|
the Abstract Syntax Tree is created. |
|
|
|
To sum up, finding and correctly identifying all high-level |
|
structures will produce nice results, but failing to |
|
do so will not cause any harm as far as correctness is concerned. |
|
|
|
|
|
\subsection{Tree data structure} |
|
|
|
The data structure to store the tree is very simple -- |
|
the tree consists of nodes where each node contains a link |
|
to its parent and a list of children. A child can be either |
|
another node or a basic block (leaf). |
|
|
|
The API of the tree structure is severely restricted. |
|
Once the tree is created, it is only possible to add new empty |
|
nodes and to move basic blocks from one node to other. |
|
This ensures that basic blocks can not be duplicated or |
|
deleted which is the safety requirement as discussed in the |
|
previous section. |
|
All transformations are limited by this constraint. |
|
|
|
Each node also provides events which are fired whenever |
|
the child collection of the node is changed. |
|
|
|
\subsection{Creating basic blocks} |
|
|
|
The previous data representation stored the code as a sequence |
|
of statements. Some of these statements will be always |
|
executed sequentially without interference of branching. |
|
In control-flow analysis we are primarily interested in |
|
branching and therefore it is desirable to group such |
|
sequential parts of code together to basic blocks. |
|
|
|
Basic block is a block of code that is always guaranteed to |
|
be executed together without any branching. Basic blocks |
|
can be easily found by determining which statements start |
|
a basic block. The very first statement in a method |
|
naturally starts the first basic block. Branching transfers |
|
control somewhere else so the statement immediately following |
|
a branch is a start of a basic block. |
|
Similarly, the target of a branch command is a start of |
|
basic block. |
|
|
|
Using these three simple rules the whole method body |
|
is slit into a few basic blocks. |
|
|
|
It is desirable to store control-flow links between basic blocks. |
|
These links are used for finding high-level structures. |
|
Each basic block has \emph{successors} and \emph{predecessors}. |
|
\emph{Successor} is an other basic block that may be executed |
|
immediately after this one. |
|
There can be up to two \emph{successors} -- the following basic |
|
block and the basic block being branched to. Both of these |
|
exist if the basic block ends with conditional branch. |
|
% If the condition is false, the control falls though to the |
|
% following block and if the condition is true the branch is taken. |
|
\emph{Predecessor} is just link in the opposite direction |
|
of the \emph{successor} link. Note that the number of |
|
\emph{predecessors} is not limited -- several \verb|goto|s |
|
can branch to the same location. |
|
|
|
|
|
\subsection{Control-flow links between nodes} |
|
|
|
The \emph{predecessor} and \emph{successor} links of basic |
|
blocks reflect the control flow between basic blocks. |
|
However, once basic blocks are `merged' by moving them |
|
into a node, we might be interested in flow properties |
|
of the whole node rather then just the individual basic blocks. |
|
Consider two sibling nodes that represent two loops. |
|
In other to put the nodes in correct order, we need to know |
|
which one is \emph{successor} of the other. |
|
|
|
Therefore the \emph{predecessor} and \emph{successor} links |
|
apply to whole nodes as well. Consider two sibling nodes $A$ and $B$. |
|
Node $B$ is a \emph{successor} of node $A$ if there exists |
|
a basic block within node $A$ that branches to a basic block |
|
within node $B$. |
|
|
|
The algorithm to calculate \emph{successors} of a node is |
|
as flows: |
|
\begin{itemize} |
|
\item Get a list of all basic blocks that are children of |
|
node $A$. The node can contain other nested nodes so this part |
|
needs to be performed recursively. |
|
\item Get a list of all succeeding basic blocks of node $A$. |
|
This is the union of all successors for all basic blocks |
|
within the node. |
|
\item However, we do not want succeeding basic blocks, |
|
we want a succeeding siblings of the node. Therefore, |
|
for each succeeding basic block traverse the data structure |
|
up until a sibling node is reached. |
|
\end{itemize} |
|
|
|
The algorithm to calculate \emph{predecessors} of a node is similar. |
|
|
|
The \emph{predecessor} and \emph{successor} links are used |
|
extensively and it would be computationally expensive to |
|
recalculate them every time they are needed. Therefore the |
|
links and intermediate results of the calculation are cached. |
|
The events of the tree structure are used to invalidate relevant |
|
caches. |
|
|
|
\subsection{Finding loops} |
|
\label{Finding loops} |
|
|
|
The loops are found by using the T1-T2 transformations which where |
|
discussed in the preparation chapter. |
|
|
|
There are two node types that directly correspond to the results |
|
of these two transformations. Preforming a T1 transformation |
|
produces a node of type \emph{Loop} and performing a T2 transformation |
|
produces a node of type \emph{AcyclicGraph}. |
|
|
|
The algorithm works by considering all the nodes and basic blocks |
|
in the tree individually and evaluating the conditions required for the |
|
transformations. |
|
If transformation can be performed for a given node, it is immediately |
|
preformed and the algorithm is restarted on the new transformed graph. |
|
The condition for the T1 transformation (\emph{Loop}) is that the node |
|
must a be self-loop. The condition for the T2 transformation |
|
(\emph{AcyclicGraph}) is that the node must have only one predecessor. |
|
|
|
By construction each \emph{AcyclicGraph} node contains exactly |
|
two nodes. For example, five sequential statements would be |
|
represented as four nested \emph{AcyclicGraph} nodes. |
|
This makes the tree more complex and more difficult to analyse. |
|
The acyclic property is only required for loop finding and is not |
|
needed for anything else later on. Therefore once loops are found, |
|
the \emph{AcyclicGraph} nodes are flattened (the children of the node |
|
are moved to its parent and the node is removed). |
|
|
|
Note that the created loops are trivial -- the \emph{Loop} node |
|
represents an infinite loop without initializer, condition or |
|
iterator. The specifics of the loop are not represented |
|
in this tree data structure. However, they are eventually |
|
recovered in the abstract syntax tree phase of the decompiler. |
|
|
|
\subsection{Finding conditionals} |
|
\label{Finding conditionals} |
|
% Reachable sets |
|
|
|
In order to recover conditional statements three new node types |
|
are introduced: \emph{Condition}, \emph{Block} and \emph{IfStatement}. |
|
\emph{Condition} represents a piece of code that is guaranteed to |
|
branch into one of two locations. \emph{Block} has no special |
|
semantics and merely groups several nodes into one. |
|
\emph{IfStatement} represents the whole conditional including |
|
the condition and two bodies. |
|
|
|
The \emph{IfStatement} node is still a genuine node in the tree |
|
data structure. However, subtyping is used to restrict its children. |
|
The \emph{IfStatement} node must have precisely three children |
|
of types \emph{Condition}, \emph{Block} and \emph{Block} representing |
|
the condition, `true' body and `false' body respectively. |
|
This defines the semantics of the \emph{IfStatement} node. |
|
|
|
The algorithm starts by encapsulating all two-successors basic blocks |
|
with \emph{Condition} node. Two-successor basic blocks are conditional |
|
branches and thus they are conditions of \verb|if| statements. |
|
Note that the bytecode \verb|br| branches unconditionally and thus |
|
is it not a condition of an \verb|if| statement (basic blocks ending |
|
with \verb|br| have only one successor). |
|
|
|
The iterative part of the algorithm is to look for free-standing |
|
\emph{Condition} nodes and encapsulate them with \emph{IfStatement} nodes. |
|
This involves finding the `true' and `false' bodies for the if statements. |
|
|
|
The `true' and `false' bodies of an \verb|if| statement are found by |
|
considering reachability as discussed in the preparation chapter. |
|
Nodes reachable only by following the `true' branch define the `true' |
|
body of the \verb|if| statement. Similarly for the `false' body. |
|
The rest of the nodes is the set reachable from both and it is not |
|
part of the \verb|if| statement -- it is the code following the |
|
\verb|if| statement. Note that these three set are disjoint. |
|
|
|
The reachable nodes nodes are found by iterative algorithm in |
|
which successors are added to a collection until the collection |
|
does not change anymore. |
|
Nodes that are reachable \emph{only} from the `true' branch are |
|
obtained by taking the set reachable from `true' branch and removing |
|
nodes reachable from the `false' branch. |
|
|
|
Consider the special case where there are no nodes reachable from both |
|
the `true' and `false' branches. In this case, there is no code |
|
following the \verb|if| statement. Such code might look like this: |
|
\begin{verbatim} |
|
if (condition) { |
|
return null; |
|
} else { |
|
return result; |
|
} |
|
\end{verbatim} |
|
|
|
In this special case the `false' body is moved outside the \verb|if| |
|
statement producing more compact code: |
|
\begin{verbatim} |
|
if (condtion) { |
|
return null; |
|
} |
|
return result; |
|
\end{verbatim} |
|
|
|
|
|
\subsection{Short-circuit conditionals} |
|
% Short-circuit |
|
|
|
Consider condition like \verb|a & b| (\verb|a| and \verb|b|). |
|
This or even more complex expressions compile into a code without |
|
any branches. Therefore the expression will be contained within |
|
a single basic block and thus the approach described so far |
|
would work well. |
|
|
|
Conditions like \verb|a && b| are more difficult to handle |
|
(\verb|a| and \verb|b|, but evaluate \verb|b| only if \verb|a| |
|
is true). These short-circuit conditionals introduce additional |
|
branching and thus they are more difficult to decompile. |
|
|
|
% The short-circuit conditionals will manifest themselves as |
|
% one of four flow graphs. See figure \ref{impl_shortcircuit} |
|
% (repeated from the preparation chapter). |
|
|
|
% Implementation |
|
% \begin{figure}[tbh] |
|
% \centerline{\includegraphics{figs/shortcircuit.png}} |
|
% \caption{\label{impl_shortcircuit}Short-circuit control-flow graphs} |
|
% \end{figure} |
|
|
|
The expression \verb|a| is always evaluated first. |
|
The flow graph is characterized by two things -- in which case the |
|
expression \verb|b| is evaluated and which path is taken if the |
|
expression \verb|b| is not evaluated. |
|
|
|
The \emph{Condtition} node was previously defined as a piece of |
|
code that is guaranteed to branch into one of two locations. |
|
This still holds for the flow graphs above. If two nodes |
|
that evaluate \verb|a| and \verb|b| are considered together, |
|
they still define a proper condition for an \verb|if| statement. |
|
We define a new node type \emph{ShortCircuitCondition} |
|
which is a special case of \emph{Condition}. |
|
This node has exactly two children (expressions \verb|a| and |
|
\verb|b|) and a meta-data field which specifies which one of |
|
the four short-circuit patters it represents. This is for |
|
convenience so that the pattern matching does not have to be |
|
done again in the future. |
|
|
|
The whole program is repeatedly searched and if one of the four patters |
|
is found, the two nodes are merged into a \emph{ShortCircuitCondition} node. |
|
This new node represents the combined expression (e.g.\ \verb|a && b|). |
|
Since it is a single node is eligible to fit into the pattern |
|
again and thus nested short-circuit expressions will be also |
|
found (e.g.\ \verb/(a && b) || (c && d)/). |
|
|
|
Note that this is an extension to the previous algorithm rather |
|
then a new one. It needs to be preformed just before |
|
\emph{Condition} nodes are encapsulated in \emph{IfStatement} nodes. |
|
|
|
|
|
\section{Abstract Syntax Tree representation} |
|
\label{Abstract Syntax Tree representation} |
|
|
|
Abstract syntax tree (AST) is the last representation of the program. |
|
It very closely corresponds to the final textual output. |
|
|
|
\subsection{Data structure} |
|
|
|
External library is used to hold the abstract syntax tree. |
|
|
|
Initially the \emph{CodeDom} library that ships with the |
|
\emph{.NET Framework} was used, but it was soon abandoned due |
|
to lack of support for many crucial C\# features. |
|
|
|
It was replaced with \emph{NRefactory} which is an open-source |
|
library used in SharpDevelop for parsing and representation of |
|
C\# and VB.NET files. |
|
It focuses exclusively on these two languages and |
|
thus provides excellent support for them. |
|
|
|
NRefactory allows perfect control over the generated code. |
|
Parsing arbitrary C\# file and then generating it again |
|
will almost exactly reproduce the original text. Whitespace |
|
would be reformated but anything else is explicitly represented |
|
in NRefactory's abstract syntax tree. |
|
For example, \verb|(a + b)| and \verb|a + b| have |
|
different representations and so do \verb|if (a) return;| and |
|
\verb|if (a) { return; }|. |
|
|
|
The decompiler aims to generate not only correct code, but also as |
|
readable code as possible. Therefore this precise control is desirable. |
|
|
|
|
|
\subsection{Generating skeleton} |
|
|
|
The first task is to generate code skeleton that is |
|
equivalent to the one in the executable. |
|
The \emph{Cecil} library helps with reading of the executable |
|
meta-data and thus it is not necessary to work directly with |
|
the binary format of the executable. |
|
|
|
The following four object types can be found in .NET executables: |
|
classes, structures, interfaces and enumerations. |
|
For the purpose of logical organization these can be nested |
|
within namespaces and, in some cases, nested within each other. |
|
Classes can be inherited and may implement one or more interfaces. |
|
Depending on the type, the objects can contain fields, |
|
properties, events and methods. |
|
Specific set of modifies can be applied to almost everything. |
|
|
|
Handling all these cases is tedious, but usually straightforward. |
|
|
|
\subsection{Generating method bodies} |
|
\label{Generating method bodies} |
|
|
|
Once the skeleton is created, high-level blocks and the bytecode |
|
expressions contained within them are translated to |
|
abstract syntax tree. For example, the \emph{Loop} node is translated |
|
to a \emph{ForStatement} in the AST and the bytecode expression \verb|add| |
|
is translated to \emph{BinaryOperatorExpression} in the AST. |
|
|
|
In terms of lines of code, this is by far the largest part of the |
|
decompiler. The reason for this is that there are many bytecodes |
|
and every bytecode needs to have its own specific translation code. |
|
|
|
Conceptually\footnote{In reality there are more functions |
|
that are loosely coupled.}, |
|
the translation code consists of three main functions: |
|
\emph{TranslateMethodBody}, \emph{TranslateHighLevelNode} and |
|
\emph{TranslateBytecodeExpression}. |
|
All of these return complete abstract syntax tree for the given input. |
|
|
|
\emph{TranslateMethodBody} is the simples function which encapsulates |
|
the whole algorithm. It takes tree of high-level blocks as input and |
|
produces the complete abstract syntax tree of the method. |
|
Internally, it calls \emph{TranslateHighLevelNode} for each top-level |
|
node and then merely concatenates the results. |
|
|
|
\emph{TranslateHighLevelNode} translates one particular high-level |
|
node into abstract syntax tree. There are several types of high-level |
|
nodes and each of them needs to considered individually. |
|
If the high-level node has child nodes then this function is |
|
called recursively. Some comments about particular node types: |
|
The \emph{BasicBlock} node needs to be preceded by explicit label |
|
and succeeded by explicit \verb|goto| in order to ensure safety. |
|
The \emph{Loop} node is easy to handle since it is just an infinite loop. |
|
The \emph{IfStatement} node is most difficult to handle since it |
|
needs to handle creation of the condition for the \verb|if| statement. |
|
Remember, that the condition can include short-circuit booleans and |
|
that these can be arbitrarily nested. |
|
|
|
\emph{TranslateBytecodeExpression} is the lowest level function. |
|
It translates the individual bytecode expressions. |
|
Note that bytecode expression may have other bytecode expressions |
|
as arguments and therefore this function is often called recursively. |
|
Getting abstract syntax tree for all arguments is in fact the |
|
first thing that is done. After that these arguments are combined in |
|
a way that is appropriate for the given bytecode and operand. |
|
This logic is completely different for each bytecode. |
|
Simple example is the \verb|add| bytecode -- |
|
in this case the arguments end up being children of |
|
\emph{BinaryOperatorExpression} (which is an AST node). |
|
At this point we have the abstract syntax tree representing the |
|
bytecode expression. |
|
The expression is further encapsulated by parenthesis for safety |
|
and the type of the expression is recoded in the AST meta-data. |
|
|
|
Unsupported bytecodes are handled gracefully and are represented as |
|
function calls. For example if the \verb|add| bytecode would |
|
not be implemented, it would be outputed as \verb|IL__add(1, 2)| |
|
rather then \verb|1 + 2|. |
|
|
|
Recoding the type of AST is useful so that it can be converted |
|
depending on a context. If AST of type \verb|IntegerOne| is |
|
used in context where \verb|bool| is expected, the AST can |
|
be simply replaced by \verb|true|. |
|
|
|
Here is a list of bytecodes whose translation into abstract syntax |
|
tree has been implemented |
|
(star is used to represent several bytecodes with similar names): |
|
|
|
Arithmetic: \texttt{add* div* mul* rem* sub* and xor shl shr* neg not} \\ |
|
Arrays: \texttt{newarr ldlen ldelem.* stelem.*} \\ |
|
Branching: \texttt{br brfalse brtrue beq bge* bgt* ble* blt* bne.un} \\ |
|
Comparison: \texttt{ceq cgt* clt*} \\ |
|
Conversions: \texttt{conv.*.*} \\ |
|
Object-oriented: \texttt{newobj castclass call callvirt} \\ |
|
Constants: \texttt{ldnull ldc.* ldstr ldtoken} \\ |
|
Data access: \texttt{ldarg ldfld stfld ldsfld stsfld ldloc stloc} \\ |
|
Miscellaneous: \texttt{nop dup ret} |
|
|
|
The complexity of implementing a translation for a bytecode varies |
|
between a single line (\verb|nop|) to over a page of code (\verb|callvirt|). |
|
|
|
|
|
\subsection{Optimizations} |
|
|
|
The produced abstract syntax tree represents a fully functional program. |
|
That is, it can be pretty printed and the generated source code would |
|
compile and work without problems. |
|
However, the produced source code is still quite verbose and |
|
therefore several optimizations are done to simplify it or make it `nicer'. |
|
|
|
The optimizations are implemented by using the visitor pattern |
|
provided by the NRefactory library. For each optimization |
|
a new class is created which inherits from the \emph{AstVisitor} |
|
class. Instance of this class is created and it is applied to |
|
the root of the abstract syntax tree. |
|
As a result of this, NRefactory will traverse the whole abstract |
|
syntax tree and notify the visitor (the optimization) about |
|
every element it encounters. For example, whenever a \verb|goto| |
|
statement is encountered, the method \verb|VisitGotoStatement| |
|
will be invoked on the visitor. If the optimization wants to |
|
act upon this, it has to override the \verb|VisitGotoStatement| method. |
|
The abstract syntax tree can be modified during the traversal. |
|
For example, when the \verb|goto| statement is encountered, |
|
the visitor (the optimization) can modify it, replace it or simply |
|
remove it. |
|
|
|
\subsubsection{Removal of dead labels} |
|
\label{Removal of dead labels} |
|
|
|
This is not the first optimization to be executed, but it is the |
|
simplest one and therefore it is explained first. |
|
|
|
The purpose of this optimization is to remove dead labels. |
|
That is, to remove all labels that are not referenced by one |
|
or more \verb|goto| statements. |
|
|
|
The optimization uses two visitors. The first visitor overrides |
|
the \verb|VisitGotoStatement| method and is thus notified about |
|
all \verb|goto| statements in the method. The body of the overridden |
|
\verb|VisitGotoStatement| method records the name of the label |
|
being targeted and puts in into an `alive' list. |
|
|
|
The second visitor overrides the \verb|VisitLabelStatement| |
|
method and is thus notified about all labels. If the label |
|
is not in the `alive' list, it is removed. |
|
|
|
Note that all optimizations are done on per-method basis |
|
and therefore we do not have to worry about name clashes |
|
of labels from different methods. |
|
|
|
|
|
\subsubsection{Removal of negations} |
|
\label{Removal of negations} |
|
|
|
In some cases negations can be removed by using the rules of |
|
logic. Negations are represented as \emph{UnaryOperatorExpression} |
|
in the abstract syntax tree. |
|
|
|
When a negation is encountered, it is matched against the following |
|
patterns and simplified if possible. |
|
|
|
\verb|!((x)) = !(x)| \\ |
|
\verb|!!(x) = (x)| \\ |
|
\verb|!(a > b) = (a <= b)| \\ |
|
\verb|!(a >= b) = (a < b)| \\ |
|
\verb|!(a < b) = (a >= b)| \\ |
|
\verb|!(a <= b) = (a > b)| \\ |
|
\verb|!(a == b) = (a != b)| \\ |
|
\verb|!(a != b) = (a == b)| \\ |
|
\verb:!(a & b) = (!(a) | !(b)): \\ |
|
\verb:!(a && b) = (!(a) || !(b)): \\ |
|
\verb:!(a | b) = (!(a) & !(b)): \\ |
|
\verb:!(a || b) = (!(a) && !(b)): \\ |
|
|
|
\subsubsection{Removal of parenthesis} |
|
\label{Removal of parenthesis} |
|
% C precedence and associativity |
|
|
|
To ensures safety, the abstract syntax tree will contain extensive |
|
amount of parenthesis. Even simple increment would be expressed |
|
as \verb|((i) = ((i) + (1)))|. |
|
|
|
This optimization removes parenthesis where it is known to be safe. |
|
|
|
Parenthesis can be removed from primitive values (\verb|(1)|), |
|
identifiers (\verb|(i)|) and already parenthesized expressions |
|
(\verb|((...))|). |
|
|
|
Several parenthesis can also be removed due to the unambiguous |
|
context they are in. This includes cases like |
|
\verb|return (...);|, \verb|array[(...)]|, \verb|if((...))|, |
|
\verb|(...);| and several others. |
|
|
|
The rest of the cases is governed by the C\# precedence |
|
and associativity rules. There are 15 precedence groups in |
|
C\# and expressions within the same group are left-associative% |
|
\footnote{Only assignment and conditional operator (?:) |
|
are right associative, but these are never generated in nested |
|
form by the decompiler.}. |
|
|
|
The optimization implements a function \verb|GetPrecedence| that |
|
returns the precedence for the given expression as an integer. |
|
|
|
Majority of expressions are binary expressions. When a binary |
|
expression is encountered, precedence is calculated for it |
|
and both of its operands. |
|
For example consider \verb|(a + b) - (c * d)|. |
|
The precedence of the main binary expression is 12 and the |
|
precedences of the operands are 12 and 13 respectively. |
|
Since the right operand has higher precedence then the binary |
|
expression itself, its parenthesis can be removed. |
|
The left operand has the same precedence. However, in this special |
|
case (left operand, same precedence), parenthesis can be removed |
|
as well due to left-associativity of C\#. |
|
|
|
Similar, but simpler, logic applies to unary expressions. |
|
|
|
\subsubsection{Removal of `goto' statements} |
|
\label{Removal of `goto' statements} |
|
% Goto X; labe X: ; |
|
% Replacing with break/continue |
|
|
|
This is the most complex optimization performed on the |
|
abstract syntax tree. Its purpose is the remove or replace |
|
\verb|goto| statements. |
|
|
|
In the following case it is obvious that the \verb|goto| can be removed: |
|
\begin{verbatim} |
|
// Code |
|
goto Label; |
|
Label: |
|
// Code |
|
\end{verbatim} |
|
|
|
However, in general, more complex scenarios can occur. |
|
It is not immediately obvious that the \verb|goto| statement |
|
can be removed in the following code: |
|
|
|
\begin{verbatim} |
|
if (condition1) { |
|
if (condition2) { |
|
// Code |
|
goto Label; |
|
} else { |
|
// Code |
|
} |
|
} |
|
for(;;) { |
|
for(;;) { |
|
Label: |
|
// Code |
|
} |
|
} |
|
\end{verbatim} |
|
|
|
This optimization is based on simulation of execution under |
|
various conditions. |
|
There are three important questions that the simulation answers: \\ |
|
What would happen if the \verb|goto| was replaced by no-op (i.e.\ removed)? \\ |
|
What would happen if the \verb|goto| was replaced by \verb|break|? \\ |
|
What would happen if the \verb|goto| was replaced by \verb|continue|? |
|
|
|
In the first case, the \verb|goto| is replaced by |
|
no-op and the simulator is started at that location. |
|
The simulation will traverse the AST tree structure until |
|
it finds the next executable line of code or an label. |
|
If the reached location is the same as the one that would be reached |
|
by following the original \verb|goto| then the original \verb|goto| |
|
can be removed. |
|
|
|
Similar simulation runs are performed by replacing the |
|
\verb|goto| by \verb|break| and \verb|continue|. |
|
|
|
|
|
|
|
After this optimization is done, it is desirable to remove the dead labels. |
|
|
|
\subsubsection{Simplifying loops} |
|
\label{Simplifying loops} |
|
|
|
The \verb|for| loop statement has the following form in C\#: |
|
\begin{verbatim} |
|
for(initializer; condition; iterator) { body }; |
|
\end{verbatim} |
|
|
|
The semantics of the \verb|for| loop statement is: |
|
\begin{verbatim} |
|
initializer; |
|
for(;;) { |
|
if (!condition) break; |
|
body; |
|
iterator; |
|
} |
|
\end{verbatim} |
|
|
|
So far the decompiler generated only infinite loops in form |
|
\verb|for(;;)|. Now it is time to make the loops more interesting. |
|
This optimization does pattern matching on the code and tries |
|
to identify the initializer, condition or iterator. |
|
|
|
If the loop is preceded by something like \verb|int i = 0;| then it is most |
|
likely the initializer of the loop and it will be pushed into |
|
the \verb|for| statement producing \verb|for(int i = 0;;)|. |
|
If it is not really the initializer, it does not matter -- |
|
semantics is same. |
|
|
|
If the first line of the body is \verb|if (not_condition) break;|, |
|
it is assumed to be the condition for the loop and it is pushed |
|
in to produce \verb|for(;!not_condition;)|. |
|
|
|
If the last line of the body is \verb|i = ...| or \verb|i++|, |
|
it is assumed to be the iterator. |
|
|
|
Consider the following code: |
|
\begin{verbatim} |
|
int i = 0; |
|
for(;;) { |
|
if (i >= 10) break; |
|
// body |
|
i++; |
|
} |
|
\end{verbatim} |
|
|
|
Performing this optimization will simplify the code to: |
|
\begin{verbatim} |
|
for(int i = 0; !(i >= 10); i++) { |
|
// body |
|
} |
|
\end{verbatim} |
|
|
|
This is then further simplified by the `removal of negation' |
|
optimization to: |
|
|
|
\begin{verbatim} |
|
for(int i = 0; i < 10; i++) { |
|
// body |
|
} |
|
\end{verbatim} |
|
|
|
|
|
\subsubsection{Further optimizations} |
|
\label{Further optimizations} |
|
|
|
Several further optimizations are done which are briefly |
|
discussed here. |
|
|
|
% Rename variables -- i, j,k ... |
|
Integer local variables are renamed from the more generic names |
|
like \verb|V_0| to \verb|i|, \verb|j|, \verb|k|, and so on. |
|
Boolean local variables are renamed to \verb|flag1|, \verb|flag2| |
|
and so on. |
|
|
|
% Type names |
|
C\# aliases for types are used. For example, \verb|System.Int32| |
|
is replaced with \verb|int|. C\# \verb|using| (\verb|import| in |
|
java) is used to simplify \verb|System.Console| to just \verb|Console|. |
|
Types in scope of same namespace also do not have to be |
|
fully qualified. |
|
|
|
% Remove `this' reference |
|
Explicit \verb|this.field| is simplified to just \verb|field|. |
|
|
|
% Idioms -- i++ |
|
The increments in form \verb|i = i + 1| are simplified to \verb|i++|. |
|
The increments in form \verb|i = i + 2| are simplified to \verb|i += 2|. |
|
Analogous forms hold for decrements. |
|
|
|
% Idioms -- string concat, i++ |
|
String concatenation \verb|String.Concat("Hello", "world")| |
|
can be expressed in C\# just as \verb|"Hello" + "world"|. |
|
|
|
Sometimes `false' body of an \verb|if| statement can end up being |
|
empty and can be harmlessly removed. |
|
|
|
Some control-flow commands are sometimes redundant since the |
|
implicit behavior is equivalent. For example, \verb|return;| |
|
at the end of method is redundant and \verb|continue;| at |
|
the end of loop is redundant. |
|
|
|
|
|
\subsection{Pretty Printing} |
|
|
|
Pretty printing% |
|
\footnote{Pretty printing is the act of converting abstract |
|
syntax tree into a textual form.} |
|
is provided as part of the NRefactory library and therefore does not |
|
need to be implemented as part of the decompiler. |
|
The NRefactory provides provides several options that control |
|
the format of the produced source code (for example whether an opening |
|
curly brace is placed on the same line as \verb|if| statement or on |
|
the following line). |
|
|
|
NRefactory is capable to output the code both in C\# and in VB.NET. |
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
|
% 10 Pages |
|
\cleardoublepage |
|
\chapter{Evaluation} |
|
|
|
\newcommand{\impref}[1]{See section \emph{\ref{#1} #1} on page \pageref{#1} for reference.} |
|
|
|
\section{Success Criteria} |
|
|
|
The principal criterion was that the quick-sort algorithm must successfully |
|
decompile. This goal was achieved. |
|
The decompiled source code is, in fact, almost identical to the original. |
|
This is demonstrated at the end of the following section. |
|
|
|
\section{Evolution of quick-sort} |
|
|
|
This section demonstrates how the quick-sort algorithm is decompiled. |
|
|
|
Four different representations are used during the decompilation process. |
|
The output of each of these representations is shown. |
|
|
|
Unless otherwise specified the shown pieces of code are |
|
verbatim outputs of the decompiler. |
|
|
|
\subsection{Original source code} |
|
|
|
To save space we will initially consider only the \verb|Main| method of |
|
the quick-sort algorithm. It parses textual arguments, |
|
performs the quick-sort and then prints the result. |
|
|
|
\begin{verbatim} |
|
public static void Main(string[] args) |
|
{ |
|
int[] intArray = new int[args.Length]; |
|
for (int i = 0; i < intArray.Length; i++) { |
|
intArray[i] = int.Parse(args[i]); |
|
} |
|
QuickSort(intArray, 0, intArray.Length - 1); |
|
for (int i = 0; i < intArray.Length; i++) { |
|
System.Console.Write(intArray[i].ToString() + " "); |
|
} |
|
} |
|
\end{verbatim} |
|
|
|
This code is compiled with all optimizations turn on and its |
|
binary form is considered is the following sections. |
|
|
|
\subsection{Stack-based bytecode representation} |
|
\impref{Stack-based bytecode representation} |
|
|
|
This is a snippet of the bytecode as loaded from the executable. |
|
This part represents the line \verb|intArray[i] = int.Parse(args[i]);|. |
|
\verb|V_0| is \verb|intArray| and \verb|V_1| is \verb|i|. |
|
|
|
Comments denote the stack behaviour of the bytecodes. |
|
|
|
The (manually) stared bytecodes represent just the part |
|
\verb|int.Parse(args[i]);|. |
|
|
|
\begin{verbatim} |
|
IL_0F: ldloc V_0 # Pop0->Push1 |
|
IL_10: ldloc V_1 # Pop0->Push1 |
|
IL_11: ldarg args * # Pop0->Push1 |
|
IL_12: ldloc V_1 * # Pop0->Push1 |
|
IL_13: ldelem.ref * # Popref_popi->Pushref |
|
IL_14: call Parse() * # Varpop->Varpush Flow=Call |
|
IL_19: stelem.i4 # Popref_popi_popi->Push0 |
|
\end{verbatim} |
|
|
|
The used bytecodes are \verb|ldloc| (load local variable), |
|
\verb|ldarg| (load argument), \verb|ldelem| (get array element), |
|
\verb|call| (call method) and \verb|stelem| (set array element). |
|
|
|
Here is the same snippet again after the stack analysis |
|
has been performed. |
|
|
|
\begin{verbatim} |
|
// Stack: {} |
|
IL_0F: ldloc V_0 # Pop0->Push1 |
|
// Stack: {IL_0F} |
|
IL_10: ldloc V_1 # Pop0->Push1 |
|
// Stack: {IL_0F, IL_10} |
|
IL_11: ldarg args # Pop0->Push1 |
|
// Stack: {IL_0F, IL_10, IL_11} |
|
IL_12: ldloc V_1 # Pop0->Push1 |
|
// Stack: {IL_0F, IL_10, IL_11, IL_12} |
|
IL_13: ldelem.ref # Popref_popi->Pushref |
|
// Stack: {IL_0F, IL_10, IL_13} |
|
IL_14: call Parse() # Varpop->Varpush Flow=Call |
|
// Stack: {IL_0F, IL_10, IL_14} |
|
IL_19: stelem.i4 # Popref_popi_popi->Push0 |
|
// Stack: {} |
|
\end{verbatim} |
|
|
|
For example, we can see that bytecode \verb|IL_13: ldelem.ref| pops |
|
the last two elements from the stack -- \verb|{IL_11, IL_12}|. |
|
These will hold the values \verb|args| and \verb|V_1| respectively. |
|
|
|
The bytecode \verb|IL_19: stelem.i4| pops three values. |
|
|
|
\subsection{Variable-based bytecode representation} |
|
\impref{Variable-based bytecode representation} |
|
|
|
This is the same snippet after the stack-based data-flow |
|
is replaced with variable-based one: |
|
(code manually simplified for clarity) |
|
|
|
\begin{verbatim} |
|
expr0D = ldloc(V_0); |
|
expr0E = ldloc(i); |
|
expr0F = ldarg(args); |
|
expr10 = ldloc(i); |
|
expr11 = ldelem.ref(expr0F, expr10); |
|
expr12 = call(Parse(), expr11); |
|
stelem.i4(expr0D, expr0E, expr12); |
|
\end{verbatim} |
|
|
|
Here is the actual unedited output: |
|
|
|
\begin{verbatim} |
|
stloc(expr0D, ldloc(V_0)); |
|
stloc(expr0E, ldloc(i)); |
|
stloc(expr0F, ldarg(args)); |
|
stloc(expr10, ldloc(i)); |
|
stloc(expr11, ldelem.ref(ldloc(expr0F), ldloc(expr10))); |
|
stloc(expr12, call(Parse(), ldloc(expr11))); |
|
stelem.i4(ldloc(expr0D), ldloc(expr0E), ldloc(expr12)); |
|
\end{verbatim} |
|
|
|
\subsubsection{In-lining of expressions} |
|
\impref{In-lining of expressions} |
|
|
|
Here is the same snippet after two in-lining optimizations. |
|
(this is actual output -- the GUI of the decompiler allows performing |
|
of optimizations step-by-step) |
|
|
|
\begin{verbatim} |
|
stloc(expr0D, ldloc(V_0)); |
|
stloc(expr0E, ldloc(i)); |
|
stloc(expr11, ldelem.ref(ldarg(args), ldloc(i))); |
|
stloc(expr12, call(Parse(), ldloc(expr11))); |
|
stelem.i4(ldloc(expr0D), ldloc(expr0E), ldloc(expr12)); |
|
\end{verbatim} |
|
|
|
All posible in-lines are performed: |
|
|
|
\begin{verbatim} |
|
stelem.i4(ldloc(V_0), ldloc(i), |
|
call(Parse(), ldelem.ref(ldarg(args), ldloc(i)))); |
|
\end{verbatim} |
|
|
|
|
|
\subsection{Abstract Syntax Tree representation (part 1)} |
|
\impref{Abstract Syntax Tree representation} |
|
|
|
Let us omit the optimization of high-level structures |
|
for this moment and skip directly to the Abstract Syntax Tree. |
|
|
|
After transforming a snippet of code to variable-based bytecode |
|
and then performing the in-lining optimization we have obtained |
|
|
|
\begin{verbatim} |
|
stelem.i4(ldloc(V_0), ldloc(i), |
|
call(Parse(), ldelem.ref(ldarg(args), ldloc(i)))); |
|
\end{verbatim} |
|
|
|
Let us see the same result for larger part of the method: |
|
|
|
\begin{verbatim} |
|
BasicBlock_1: stloc(V_0, newarr(System.Int32, |
|
conv.i4(ldlen(ldarg(args))))); |
|
BasicBlock_2: stloc(i, ldc.i4(0)); |
|
BasicBlock_3: goto BasicBlock_6; |
|
BasicBlock_4: stelem.i4(ldloc(V_0), ldloc(i), call(Parse(), |
|
ldelem.ref(ldarg(args), ldloc(i)))); |
|
BasicBlock_5: stloc(i, @add(ldloc(i), ldc.i4(1))); |
|
BasicBlock_6: if (ldloc(i) < conv.i4(ldlen(ldloc(V_0)))) |
|
goto BasicBlock_4; |
|
BasicBlock_7: call(QuickSort(), ldloc(V_0), ldc.i4(0), |
|
sub(conv.i4(ldlen(ldloc(V_0))), ldc.i4(1))); |
|
\end{verbatim} |
|
|
|
This code includes the first \verb|for| loop of the \verb|Main| |
|
method. Note that \verb|br| and \verb|blt| were already |
|
translated to \verb|goto| and \verb|if (...) goto|. |
|
|
|
\subsubsection{Translation of bytecodes to C\# expressions} |
|
\impref{Generating method bodies} |
|
|
|
Let us translate just the bytecodes \verb|ldloc|, \verb|ldarg| |
|
and \verb|ldc.i4|. |
|
|
|
\begin{verbatim} |
|
BasicBlock_1: stloc(V_0, newarr(System.Int32, |
|
conv.i4(ldlen((args))))); |
|
BasicBlock_2: stloc(i, (0)); |
|
BasicBlock_3: goto BasicBlock_6; |
|
BasicBlock_4: stelem.i4((V_0), (i), call(Parse(), |
|
ldelem.ref((args), (i)))); |
|
BasicBlock_5: stloc(i, @add((i), (1))); |
|
BasicBlock_6: if ((i) < conv.i4(ldlen((V_0)))) goto BasicBlock_4; |
|
BasicBlock_7: call(QuickSort(), (V_0), (0), |
|
sub(conv.i4(ldlen((V_0))), (1))); |
|
\end{verbatim} |
|
|
|
This is a small improvement. Note the safety |
|
parenthesis that were inserted during the translation |
|
(e.g.\ \verb|(0)|, \verb|(args)|). |
|
|
|
After all translations are performed the code will look like this: |
|
|
|
\begin{verbatim} |
|
BasicBlock_1: System.Int32[] V_0 = |
|
(new int[((int)((args).Length))]); |
|
BasicBlock_2: int i = (0); |
|
BasicBlock_3: goto BasicBlock_6; |
|
BasicBlock_4: ((V_0)[(i)] = (System.Int32.Parse(((args)[(i)])))); |
|
BasicBlock_5: (i = ((i) + (1))); |
|
BasicBlock_6: if ((i) < ((int)((V_0).Length))) goto BasicBlock_4; |
|
BasicBlock_7: (QuickSortProgram.QuickSort((V_0), (0), |
|
(((int)((V_0).Length)) - (1)))); |
|
\end{verbatim} |
|
|
|
\subsubsection{Removal of parenthesis} |
|
\impref{Removal of parenthesis} |
|
|
|
C\# rules of precedence are used to remove unnecessary parenthesis: |
|
|
|
\begin{verbatim} |
|
BasicBlock_1: System.Int32[] V_0 = new int[(int)args.Length]; |
|
BasicBlock_2: int i = 0; |
|
BasicBlock_3: goto BasicBlock_6; |
|
BasicBlock_4: V_0[i] = System.Int32.Parse(args[i]); |
|
BasicBlock_5: i = i + 1; |
|
BasicBlock_6: if (i < (int)V_0.Length) goto BasicBlock_4; |
|
BasicBlock_7: QuickSortProgram.QuickSort(V_0, 0, (int)V_0.Length - 1); |
|
\end{verbatim} |
|
|
|
\subsubsection{Removal of dead labels} |
|
\impref{Removal of dead labels} |
|
|
|
Unreferenced labels are removed. |
|
|
|
\begin{verbatim} |
|
System.Int32[] V_0 = new int[(int)args.Length]; |
|
int i = 0; |
|
goto BasicBlock_6; |
|
BasicBlock_4: V_0[i] = System.Int32.Parse(args[i]); |
|
i = i + 1; |
|
BasicBlock_6: if (i < (int)V_0.Length) goto BasicBlock_4; |
|
QuickSortProgram.QuickSort(V_0, 0, (int)V_0.Length - 1); |
|
\end{verbatim} |
|
|
|
\subsubsection{Further optimizations} |
|
\impref{Further optimizations} |
|
|
|
The remainder of applicable AST optimizations is performed. |
|
|
|
\begin{verbatim} |
|
int[] V_0 = new int[args.Length]; |
|
int i = 0; |
|
goto BasicBlock_6; |
|
BasicBlock_4: V_0[i] = Int32.Parse(args[i]); |
|
i++; |
|
BasicBlock_6: if (i < V_0.Length) goto BasicBlock_4; |
|
QuickSortProgram.QuickSort(V_0, 0, V_0.Length - 1); |
|
\end{verbatim} |
|
|
|
\subsection{High-level blocks representation} |
|
\impref{High-level blocks representation} |
|
|
|
\subsubsection{Initial state} |
|
|
|
This representation is based on tree structure. However, |
|
initially the tree is flat with all basic blocks being siblings. |
|
|
|
These are the first four basic blocks in the \verb|Main| method. |
|
The relations between the nodes are noted in the brackets |
|
(e.g.\ basic block 1 has one successor which is basic block number 2). |
|
See figure \ref{QuickSortMain} for graphical representation of |
|
these four basic blocks. |
|
|
|
\begin{verbatim} |
|
// BasicBlock 1 (Successors:2 Parent:0) |
|
BasicBlock_1: |
|
int[] V_0 = new int[args.Length]; |
|
int i = 0; |
|
goto BasicBlock_2; |
|
|
|
// BasicBlock 2 (Predecessors:1 Successors:4 Parent:0) |
|
BasicBlock_2: |
|
goto BasicBlock_4; |
|
\end{verbatim} |
|
\begin{verbatim} |
|
// BasicBlock 3 (Predecessors:4 Successors:4 Parent:0) |
|
BasicBlock_3: |
|
V_0[i] = Int32.Parse(args[i]); |
|
i++; |
|
goto BasicBlock_4; |
|
|
|
// BasicBlock 4 (Predecessors:3,2 Successors:5,3 Parent:0) |
|
BasicBlock_4: |
|
if (i < V_0.Length) goto BasicBlock_3; |
|
goto BasicBlock_5; |
|
\end{verbatim} |
|
|
|
\begin{figure}[tbh] |
|
\centerline{\includegraphics{figs/QuickSortMain.png}} |
|
\caption{\label{QuickSortMain}Loop in the `Main' method of quick-sort} |
|
\end{figure} |
|
|
|
\subsubsection{Finding loops and conditionals} |
|
\impref{Finding loops} |
|
|
|
Loops are found by repeatedly applying the T1-T2 transformations. |
|
In this case the first transformation to be performed will be |
|
a T2 transformation on basic blocks 1 and 2. |
|
This will result in this `tree': |
|
|
|
\begin{verbatim} |
|
// AcyclicGraph 10 (Successors:4 Parent:0) |
|
AcyclicGraph_10: |
|
{ |
|
// BasicBlock 1 (Successors:2 Parent:10) |
|
BasicBlock_1: |
|
int[] V_0 = new int[args.Length]; |
|
int i = 0; |
|
goto BasicBlock_2; |
|
|
|
// BasicBlock 2 (Predecessors:1 Parent:10) |
|
BasicBlock_2: |
|
goto BasicBlock_4; |
|
|
|
} |
|
|
|
// BasicBlock 3 (Predecessors:4 Successors:4 Parent:0) |
|
BasicBlock_3: |
|
V_0[i] = Int32.Parse(args[i]); |
|
i++; |
|
goto BasicBlock_4; |
|
|
|
// BasicBlock 4 (Predecessors:3,10 Successors:5,3 Parent:0) |
|
BasicBlock_4: |
|
if (i < V_0.Length) goto BasicBlock_3; |
|
goto BasicBlock_5; |
|
\end{verbatim} |
|
|
|
Note how the links changed. Basic blocks 1 and 2 are now considered as a |
|
single node 10 -- the predecessor of basic block 4 is now 10 rather then 2. |
|
Links within the node 10 are limited just to the scope of that node. |
|
|
|
Several further transformations are performed until all loops are found. |
|
|
|
Conditional statements are identified as well. |
|
This code has one conditional statement which |
|
does not have any code in the bodies other then the \verb|goto|s. |
|
|
|
\newpage |
|
|
|
\begin{verbatim} |
|
// BasicBlock 1 (Successors:2 Parent:0) |
|
BasicBlock_1: |
|
int[] V_0 = new int[args.Length]; |
|
int i = 0; |
|
goto BasicBlock_2; |
|
|
|
// BasicBlock 2 (Predecessors:1 Successors:11 Parent:0) |
|
BasicBlock_2: |
|
goto BasicBlock_4; |
|
|
|
// Loop 11 (Predecessors:2 Successors:5 Parent:0) |
|
Loop_11: |
|
for (;;) { |
|
// ConditionalNode 22 (Predecessors:3 Successors:3 Parent:11) |
|
ConditionalNode_22: |
|
BasicBlock_4: |
|
if (i >= V_0.Length) { |
|
goto BasicBlock_5; |
|
// Block 21 (Parent:22) |
|
Block_21: |
|
|
|
} else { |
|
goto BasicBlock_3; |
|
// Block 20 (Parent:22) |
|
Block_20: |
|
|
|
} |
|
|
|
// BasicBlock 3 (Predecessors:22 Successors:22 Parent:11) |
|
BasicBlock_3: |
|
V_0[i] = Int32.Parse(args[i]); |
|
i++; |
|
goto BasicBlock_4; |
|
|
|
} |
|
\end{verbatim} |
|
|
|
\subsubsection{Final version} |
|
|
|
Let us see the code once again without the comments and empty lines. |
|
|
|
\begin{verbatim} |
|
BasicBlock_1: |
|
int[] V_0 = new int[args.Length]; |
|
int i = 0; |
|
goto BasicBlock_2; |
|
BasicBlock_2: |
|
goto BasicBlock_4; |
|
Loop_11: |
|
for (;;) { |
|
ConditionalNode_22: |
|
BasicBlock_4: |
|
if (i >= V_0.Length) { |
|
goto BasicBlock_5; |
|
Block_21: |
|
} else { |
|
goto BasicBlock_3; |
|
Block_20: |
|
} |
|
BasicBlock_3: |
|
V_0[i] = Int32.Parse(args[i]); |
|
i++; |
|
goto BasicBlock_4; |
|
} |
|
\end{verbatim} |
|
|
|
\newpage |
|
\subsection{Abstract Syntax Tree representation (part 2)} |
|
\subsubsection{Removal of `goto' statements} |
|
\impref{Removal of `goto' statements} |
|
|
|
All of the \verb|goto| statements in the code above can be |
|
removed or replaced. After the removal of \verb|goto| statements |
|
we get the following much simpler code: |
|
|
|
\begin{verbatim} |
|
int[] V_0 = new int[args.Length]; |
|
int i = 0; |
|
for (;;) { |
|
if (i >= V_0.Length) { |
|
break; |
|
} |
|
V_0[i] = Int32.Parse(args[i]); |
|
i++; |
|
} |
|
\end{verbatim} |
|
|
|
Note that the `false' body of the \verb|if| statement was removed |
|
because it was empty. |
|
|
|
\subsubsection{Simplifying loops} |
|
\impref{Simplifying loops} |
|
|
|
The code can be further simplified by pushing the loop initializer, |
|
condition and iterator inside the \verb|for(;;)|: |
|
|
|
\begin{verbatim} |
|
int[] V_0 = new int[args.Length]; |
|
for (int i = 0; i < V_0.Length; i++) { |
|
V_0[i] = Int32.Parse(args[i]); |
|
} |
|
\end{verbatim} |
|
|
|
This concludes the decompilation of quick-sort. |
|
|
|
\clearpage |
|
|
|
{\vspace*{60mm} \centering This page is intentionally left blank\\ Please turn over \newpage} |
|
|
|
|
|
\subsection{Original quick-sort (complete source code)} |
|
\label{Original quick-sort} |
|
\lstinputlisting[basicstyle=\footnotesize] |
|
{./figs/QuickSort_original.cs} |
|
|
|
\newpage |
|
|
|
\subsection{Decompiled quick-sort (complete source code)} |
|
\label{Decompiled quick-sort} |
|
\lstinputlisting[basicstyle=\footnotesize] |
|
{./figs/10_Short_type_names_2.cs} |
|
|
|
|
|
\section{Advanced and unsupported features} |
|
|
|
This section demonstrates decompilation of more advanced application. |
|
The examples that follow were produced by decompiling a reversi game% |
|
\footnote{Taken from the CodeProject. Written by Mike Hall.}. |
|
|
|
The examples demonstrate advanced features of the decompiler |
|
and in some cases its limitations (i.e.\ still unimplemented features). |
|
|
|
|
|
|
|
|
|
\subsection{Properties and fields} |
|
|
|
Properties and fields are well-supported. |
|
|
|
\emph{Original:} |
|
\begin{verbatim} |
|
private int[,] squares; |
|
private bool[,] safeDiscs; |
|
|
|
public int BlackCount { |
|
get { return blackCount; } |
|
} |
|
public int WhiteCount { |
|
get { return whiteCount; } |
|
} |
|
\end{verbatim} |
|
|
|
\emph{Decompiled:} |
|
\begin{verbatim} |
|
private System.Int32[,] squares; |
|
private System.Boolean[,] safeDiscs; |
|
|
|
public int BlackCount { |
|
get { return blackCount; } |
|
} |
|
public int WhiteCount { |
|
get { return whiteCount; } |
|
} |
|
\end{verbatim} |
|
|
|
The decompiled code is correct. |
|
|
|
The type \verb|System.Int32[,]| was not simplified to \verb|int[,]|. |
|
|
|
|
|
|
|
|
|
\subsection{Short-circuit boolean expressions} |
|
|
|
\emph{Original:} |
|
\begin{verbatim} |
|
if (r < 0 || r > 7 || c < 0 || c > 7 || |
|
(r - dr == row && c - dc == col) || |
|
this.squares[r, c] != color) |
|
return false; |
|
\end{verbatim} |
|
|
|
\emph{Decompiled:} |
|
\begin{verbatim} |
|
if (i < 0 || i > 7 || j < 0 || j > 7 || |
|
i - dr == row && j - dc == col || |
|
squares[i, j] != color) { |
|
return false; |
|
} |
|
\end{verbatim} |
|
|
|
|
|
The decompiled code is correct. Variable names were lost during compilation. |
|
By precedence rules it was possible to remove the parenthesis. |
|
|
|
|
|
|
|
|
|
\subsection{Short-circuit boolean expressions 2} |
|
|
|
\emph{Original:} |
|
\begin{verbatim} |
|
if ((hasSpaceSide1 && hasSpaceSide2 ) || |
|
(hasSpaceSide1 && hasUnsafeSide2) || |
|
(hasUnsafeSide1 && hasSpaceSide2 )) |
|
return true; |
|
\end{verbatim} |
|
|
|
\emph{Decompiled:} |
|
\begin{verbatim} |
|
if (flag && flag2 || flag && flag4 || flag3 && flag2) { |
|
return true; |
|
} |
|
\end{verbatim} |
|
|
|
The decompiled code is correct. Variable names were lost during compilation. |
|
|
|
|
|
|
|
\subsection{Complex control nesting} |
|
|
|
\emph{Original:} |
|
\begin{verbatim} |
|
bool statusChanged = true; |
|
while (statusChanged) |
|
{ |
|
statusChanged = false; |
|
for (i = 0; i < 8; i++) |
|
for (j = 0; j < 8; j++) |
|
if (this.squares[i, j] != Board.Empty && |
|
!this.safeDiscs[i, j] && |
|
!this.IsOutflankable(i, j)) { |
|
this.safeDiscs[i, j] = true; |
|
statusChanged = true; |
|
} |
|
} |
|
\end{verbatim} |
|
|
|
\emph{Decompiled:} |
|
\begin{verbatim} |
|
for (bool flag = true; flag;) { |
|
flag = false; |
|
for (int i = 0; i < 8; i++) { |
|
for (int j = 0; j < 8; j++) { |
|
if (squares[i, j] != Empty && |
|
!safeDiscs[i, j] && |
|
!IsOutflankable(i, j)) { |
|
safeDiscs[i, j] = true; |
|
flag = true; |
|
} |
|
} |
|
} |
|
} |
|
\end{verbatim} |
|
|
|
The decompiled code is correct. |
|
|
|
A \verb|for| loop was used instead of \verb|while| loop. |
|
The decompiler always uses \verb|for| loop. |
|
|
|
In the example, four control structures are nested; |
|
even more levels do not cause any problems. |
|
|
|
|
|
|
|
|
|
|
|
\subsection{Multidimensional arrays} |
|
|
|
Multidimensional arrays are supported. |
|
|
|
\emph{Original:} |
|
\begin{verbatim} |
|
this.squares = new int[8, 8]; |
|
this.safeDiscs = new bool[8, 8]; |
|
// Clear the board and map. |
|
int i, j; |
|
for (i = 0; i < 8; i++) |
|
for (j = 0; j < 8; j++) { |
|
this.squares[i, j] = Board.Empty; |
|
this.safeDiscs[i, j] = false; |
|
} |
|
\end{verbatim} |
|
|
|
\emph{Decompiled:} |
|
\begin{verbatim} |
|
squares = new int[8, 8]; |
|
safeDiscs = new bool[8, 8]; |
|
for (int i = 0; i < 8; i++) { |
|
for (int j = 0; j < 8; j++) { |
|
squares[i, j] = Empty; |
|
safeDiscs[i, j] = false; |
|
} |
|
} |
|
\end{verbatim} |
|
|
|
The decompiled code is correct and even slightly simplified. |
|
|
|
|
|
|
|
|
|
|
|
\subsection{Dispose method} |
|
|
|
This is the standard .NET dispose method for a form. |
|
|
|
\emph{Original:} |
|
\begin{verbatim} |
|
if(disposing) { |
|
if (components != null) { |
|
components.Dispose(); |
|
} |
|
} |
|
\end{verbatim} |
|
|
|
\emph{Decompiled:} |
|
\begin{verbatim} |
|
if (disposing && components) { |
|
components.Dispose(); |
|
} |
|
\end{verbatim} |
|
|
|
The decompiler did a good job identify that the two |
|
\verb|if|s can be expressed as short-circuit expression. |
|
|
|
On the hand, it failed to distinguish being `true' and being `non-null' |
|
-- on the bytecode level these are the same thing (together with being `non-zero'). |
|
|
|
|
|
|
|
|
|
\subsection{Event handlers} |
|
|
|
Event handlers are not supported. |
|
|
|
\emph{Original:} |
|
\begin{verbatim} |
|
this.aboutMenuItem.Click += |
|
new System.EventHandler(this.aboutMenuItem_Click); |
|
\end{verbatim} |
|
|
|
\emph{Decompiled:} |
|
\begin{verbatim} |
|
aboutMenuItem.add_Click( |
|
new System.EventHandler(this, IL__ldftn(aboutMenuItem_Click())) |
|
); |
|
\end{verbatim} |
|
|
|
Note how the unsupported bytecode \verb|ldftn| is presented. |
|
|
|
|
|
|
|
\subsection{Constructors} |
|
|
|
\emph{Original:} |
|
\begin{verbatim} |
|
this.infoPanel.Location = new System.Drawing.Point(296, 32); |
|
\end{verbatim} |
|
|
|
\emph{Decompiled:} |
|
\begin{verbatim} |
|
infoPanel.Location = new System.Drawing.Point(296, 32); |
|
\end{verbatim} |
|
|
|
The decompiled code is correct including the constructor arguments. |
|
|
|
|
|
\subsection{Property access} |
|
|
|
\emph{Original:} |
|
\begin{verbatim} |
|
this.infoPanel.Name = "infoPanel"; |
|
\end{verbatim} |
|
|
|
\emph{Decompiled:} |
|
\begin{verbatim} |
|
infoPanel.Name = "infoPanel"; |
|
\end{verbatim} |
|
|
|
The decompiled code is correct. |
|
|
|
Note that properties are actually set methods. This is, |
|
\verb|set_Name(string)| here. |
|
|
|
|
|
\subsection{Object casting} |
|
|
|
\emph{Original:} |
|
\begin{verbatim} |
|
this.Icon = ((System.Drawing.Icon)(resources.GetObject("$this.Icon"))); |
|
\end{verbatim} |
|
|
|
\emph{Decompiled:} |
|
\begin{verbatim} |
|
Icon = (System.Drawing.Icon)V_0.GetObject("$this.Icon"); |
|
\end{verbatim} |
|
|
|
The decompiled code is correct. |
|
|
|
|
|
|
|
\subsection{Boolean values} |
|
|
|
In the .NET Runtime there is no difference between |
|
integers and booleans. Handling of this is delicate. |
|
|
|
\emph{Original:} |
|
\begin{verbatim} |
|
return false; |
|
\end{verbatim} |
|
|
|
\emph{Decompiled:} |
|
\begin{verbatim} |
|
return false; |
|
\end{verbatim} |
|
|
|
The decompiled code is correct. |
|
|
|
\emph{Original:} |
|
\begin{verbatim} |
|
this.infoPanel.ResumeLayout(false); |
|
\end{verbatim} |
|
|
|
\emph{Decompiled:} |
|
\begin{verbatim} |
|
infoPanel.ResumeLayout(0); |
|
\end{verbatim} |
|
|
|
This code is incorrect -- the case where boolean is |
|
passed as argument still has no been considered in the decompiler. |
|
This is just case-by-case laborious work. |
|
|
|
|
|
|
|
\subsection{The `dup' instruction} |
|
|
|
The \verb|dup| instruction is handled well. |
|
|
|
\emph{Decompiled: (`dup' unsupported)} |
|
\begin{verbatim} |
|
expr175 = this; |
|
expr175.whiteCount = expr175.whiteCount + 1; |
|
\end{verbatim} |
|
|
|
\emph{Decompiled: (`dup' supported)} |
|
\begin{verbatim} |
|
whiteCount++; |
|
\end{verbatim} |
|
|
|
|
|
% coverage for a typical binaries (I like that one :-) ) |
|
% Anonymous methods, Enumerators, LINQ |
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
|
% 2 Page |
|
\cleardoublepage |
|
\chapter{Conclusion} |
|
|
|
The project was a success. |
|
The goal of the project was to decompile an implementation |
|
of a quick-sort algorithm and this goal was achieved well. |
|
The C\# source code produced by the decompiler is |
|
virtually identical to the original source code% |
|
\footnote{See pages \pageref{Original quick-sort} and \pageref{Decompiled quick-sort} in the evaluation chapter.}. |
|
In particular, the source code compiles back to identical .NET CLR code. |
|
|
|
More complex programs than the quick-sort implementation |
|
were considered and the decompiler was improved to handle them better. |
|
The improvements include: |
|
\begin{itemize} |
|
\item Better support for several bytecodes related to: |
|
creation and casting of objects, multidimensional arrays, |
|
virtual method invocation, property access and field access. |
|
\item Support for the \verb|dup| bytecode which does not |
|
occur very frequently in programs. In particular, it does not |
|
occur in the quick-sort implementation. |
|
\item Arbitrarily complicated short-circuit boolean expression |
|
are handled. Furthermore, both the short-circuit |
|
and the traditional boolean expressions are simplified |
|
using rules of logic (e.g.\ using De Morgan's laws). |
|
\item Nested high-level control structures are handled |
|
well (e.g.\ nested loops). |
|
\item The performance is improved by using basic blocks. |
|
\item The removal of \verb|goto|s was improved to work |
|
in almost any scenario (by using the `simulator' approach). |
|
\end{itemize} |
|
|
|
|
|
\section{Hindsight} |
|
|
|
I am very satisfied with the current implementation. |
|
Whenever I have seen an opportunity for improvement, |
|
I have have refactored the code right away. |
|
As a result of this, I think that the algorithms and data structures |
|
currently used are very well suited for the purpose. |
|
|
|
If I were to reimplement the project, I would merely |
|
save time by making the right design decisions right away. |
|
|
|
Here are some issues that I have encountered are resolved: |
|
|
|
\begin{itemize} |
|
\item During the preparation phase, I have spend reasonable |
|
amount of time researching algorithms for finding of loops. |
|
I found most of the algorithms either unintuitive or not |
|
not sufficiently robust for all scenarios. |
|
Therefore, I have derived my own algorithm based on the |
|
T1-T2 transformations. |
|
\item I have initially used the \emph{CodeDom} library |
|
for the abstract syntax tree. |
|
In the end, \emph{NRefactory} proved much more suitable. |
|
% \item I have initially directly used the objects provided by |
|
% \emph{Cecil} library as the first code representation. |
|
% Making an own copy turned out to be better since the objects |
|
% could be more easily manipulated and annotated. |
|
\item I have eventually made the stack-based and |
|
variable-based code representations quite distinct and |
|
decoupled. This allowed greater freedom for transformations. |
|
\item Initially the links between nodes were manually kept |
|
in synchronization during transformations. |
|
It is more convenient and robust just to throw the links away |
|
and recalculate them. |
|
\item The removal of \verb|goto|s was gradually pushed |
|
all the way to the last stage of decompilation where it is |
|
most powerful and safest. |
|
\end{itemize} |
|
|
|
\section{Future work} |
|
|
|
The theoretical problems have been resolved. |
|
However, a lot of laborious work still needs to be done before |
|
the decompiler will be able to handle all .NET executables. |
|
|
|
The decompiler could be extended to recover C\# compile-time |
|
features like anonymous methods, enumerators or LINQ expressions. |
|
|
|
The decompiler greatly overlaps with a verification of .NET executables |
|
(the code needs to be valid in order to be decompiled). |
|
The project could be forked to create .NET verifier. |
|
|
|
I am the primary developer of the SharpDevelop debugger. |
|
I would like integrate the decompiler with the debugger so |
|
that is it possible to debug applications with no source code. |
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
|
\cleardoublepage |
|
\chapter{Project Proposal} |
|
\input{proposal} |
|
|
|
\end{document}
|
|
|