.NET Decompiler with support for PDB generation, ReadyToRun, Metadata (&more) - cross-platform!
 
 
 
 

3077 lines
107 KiB

\documentclass[12pt,twoside,notitlepage]{report}
\usepackage{a4}
\usepackage{verbatim}
\usepackage{graphicx}
\usepackage{listings}
\raggedbottom % try to avoid widows and orphans
\setlength{\topskip}{1\topskip plus 7\baselineskip}
\sloppy
\clubpenalty1000%
\widowpenalty1000%
\addtolength{\oddsidemargin}{6mm} % adjust margins
\addtolength{\evensidemargin}{-8mm}
\renewcommand{\baselinestretch}{1.1} % adjust line spacing to make
% more readable
\setlength{\parskip}{1.3ex plus 0.2ex minus 0.2ex}
\setlength{\parindent}{0pt}
\setcounter{secnumdepth}{5}
\setcounter{tocdepth}{5}
\lstset{
basicstyle=\small,
language={[Sharp]C},
tabsize=4,
numbers=left,
frame=none,
frameround=tfft
}
\begin{document}
\bibliographystyle{plain}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Title
\pagestyle{empty}
\hfill{\LARGE \bf David Srbeck\'y}
\vspace*{60mm}
\begin{center}
\Huge
{\bf .NET Decompiler} \\
\vspace*{5mm}
Part II of the Computer Science Tripos \\
\vspace*{5mm}
Jesus College \\
\vspace*{5mm}
May 16, 2008
\end{center}
\cleardoublepage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Proforma, table of contents and list of figures
\setcounter{page}{1}
\pagenumbering{roman}
\pagestyle{plain}
\section*{Proforma}
{\large
\begin{tabular}{ll}
Name: & \bf David Srbeck\'y \\
College: & \bf Jesus College \\
Project Title: & \bf .NET Decompiler \\
Examination: & \bf Part II of the Computer Science Tripos, 2008 \\
Word Count: & \bf 11949\footnotemark[1] \\
Project Originator: & David Srbeck\'y \\
Supervisor: & Alan Mycroft \\
\end{tabular}
}
\footnotetext[1]{This word count was computed
by {\tt detex diss.tex | tr -cd '0-9A-Za-z $\tt\backslash$n' | wc -w}
}
\stepcounter{footnote}
\subsection*{Original Aim of the Project}
The aim of this project was to create a .NET decompiler.
That is, to create a program that translates .NET executables
back to C\# source code.
%This is essentially the opposite of a compiler.
The concrete requirement was to decompile a
quick-sort algorithm. Given an implementation of a quick-sort
algorithm in executable form, the decompiler was required
to re-create semantically equivalent C\# source code.
Although allowed to be rather cryptic, the generated source code
had to compile and work without any additional modifications.
\subsection*{Work Completed}
Implementation of the quick-sort algorithm decompiles successfully.
Many optimizations were implemented and perfected to the
point where the generated source code for the quick-sort
algorithm is practically identical to the original%
\footnote{See pages \pageref{Original quick-sort} and \pageref{Decompiled quick-sort} in the evaluation chapter.}.
In particular, the source code compiles back to identical .NET CLR code.
Further challenges were sought in more complex programs.
The decompiler deals with several issues which did not
occur in the quick-sort algorithm. It also implements some
optimizations that are only applicable to the more complex programs.
\subsection*{Special Difficulties}
None.
\newpage
\section*{Declaration}
I, David Srbeck\'y of Jesus College, being a candidate for Part II of the Computer
Science Tripos, hereby declare
that this dissertation and the work described in it are my own work,
unaided except as may be specified below, and that the dissertation
does not contain material that has already been used to any substantial
extent for a comparable purpose.
\bigskip
\leftline{Signed}
\medskip
\leftline{Date}
\cleardoublepage
\tableofcontents
\listoffigures
\cleardoublepage % just to make sure before the page numbering
% is changed
\setcounter{page}{1}
\pagenumbering{arabic}
\pagestyle{headings}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% 3 Pages
\chapter{Introduction}
\section{The goal of the project}
% Decompiler
% .NET
The goal of this project is to create a .NET decompiler.
Decompiler is a tool that translates machine code back to source
code. That is, it does the opposite of a compiler -- it takes the
executable file and it tries to recreate the original source code.
In general, decompilation can be extremely difficult or even impossible.
Therefore this project focuses on something slightly simpler --
decompilation of .NET executables.
The advantage of .NET executables is that they consist of processor
independent bytecode which is easier to decompile then
the traditional machine code because
the instructions are more high level and the code is less optimized.
The .NET executables also preserve a lot of metadata about the code like
type information and method names.
Note that despite its name, the \emph{.NET Framework}
is unrelated to networking or the Internet and it is basically the
equivalent of \emph{Java Development Kit}.
To succeed, the produced decompiler is required to successfully decompile
a console implementation of a quick-sort algorithm back to C\# source code.
\section{Decompilation}
Decompilation is the opposite of compilation. It takes the executable
file and attempts to recreate the original source code.
Despite of doing the reverse, decompilation is very similar to
compilation. This is because both compilation and decompilation do
in essence that same thing -- translate program from one form to
other form and then optimize the final form.
Because of this, many analysis and optimizations used in compilation
are also applicable in decompilation.
Decompilation may not always be possible. In particular,
self-modifying code is very problematic.
\section{Uses of decompilation}
\begin{itemize}
\item Recovery of lost source code.
\item Reverse engineering of programs that ship with no
source code. This, however, might be a bit controversial use
due to legal issues.
\item Interoperability.
\item Security analysis of programs.
\item Error correction or minor improvements.
\item Porting of executable to other platform.
\item Translation between languages.
Source code in some obscure language can be compiled into an
executable which can in turn be decompiled into C\#.
\item Debugging of deployed systems. Deployed system
might not ship with source code or it might be difficult
to `link' it to the source code because it is optimized.
Also note that it might not be acceptable to restart
the running system because the issue is not easily reproducible.
\end{itemize}
\section{The .NET Framework}
% Bytecode - stack-based, high-level
% JIT
% Much metadata
% Typesafe
The \emph{.NET Framework} is a general-purpose software development platform
which is very similar to the \emph{Java Development Kit}.
It includes an extensive class library
and, similarly to Java, it is based on a virtual machine model which
makes it platform independent.
To be platform independent the program is stored in form of bytecode
(also called Common Intermediate Language or \emph{IL}).
The bytecode instructions use the stack to pass data around rather then
registers.
The instructions are aware of the object-oriented programing
model and are more high level then traditional processor instructions.
For example, there is special instruction to allocate an object in memory,
to access a field of an object, to cast an object to a different type
and so on.
The byte code is translated to machine code of the
target platform at run-time. The code is just-in-time compiled rather then
interpreted. This means that the first execution of a function will be slow
due to the translation overhead, but the subsequent executions of the
same function will be very fast.
The .NET executables preserve a lot of information about the
structure of the program. The complete structure of classes is
preserved including all the typing information. The code is still
separated into distinct methods and the complete signatures of
the methods are preserved -- that is, the executable includes
the method name, the parameters and the return type.
The virtual machine is intended to be type-safe. Before any code
is executed it is verified and code that has been successfully
verified is guaranteed to be type-safe. For example, it is
guaranteed not access unallocated memory.
However, the framework also provides some inherently unsafe
instructions. This allows support for languages like C++, but
it also means that some parts of the program will be unverifiable.
Unverifiable code may or may not be allowed to execute depending
on the local security policy.
In general, any programming language can be compiled to \emph{.NET} and
there are already dozens, if not hundreds, of compliers in existence.
The most commonly used language is \emph{C\#}.
\section{Scope of the project}
The \emph{.NET Framework} is mature comprehensive platform and handling
all of its features is out the scope of the project. Therefore, it is
necessary to state which features should be supported and which
should not.
The decompiler was specified to at least handle all the features required
to decompile the quick-sort algorithm. This includes integer arithmetic,
array operations, conditional statements, loops and system method calls.
The decompiler will not support so-called unverifiable instructions --
for example, pointer arithmetic and access to arbitrary memory locations.
These operations are usually used only for
interoperability with native code (code not running under the .NET
virtual machine). As such, these instructions
are relatively rare and do not even occur in many programs at all.
Generics also will not be supported. They were introduced in the
second version of the Framework and many programs still do not make any
use of them. The generics also probably does not provide any
theoretical challenge for the decompilation. It just involves handing
of more bytecodes and more special cases of the existing bytecodes.
\section{Previous work}
Decompilers are in existence nearly as long as compilers.
The very first decompiler was written by Joel Donnelly in 1960
under the supervision of Professor Maurice Halstead.
Initially, decompilers were created to aid in the program
conversion process from second to third generation computers.
Decompilers were actively developed over the following
years and were used for variety of applications.
Probably the best know and most influential research paper
is \emph{Cristina Cifuentes' PhD Thesis ``Reverse Compilation
Techniques'', 1994}. As part of the paper Cristina
created a prototype \emph{dcc} which translates Intel i80286
machine code into C using various data and control flow
analysis techniques.
The introduction of virtual machines made decompilation
more feasible then ever (e.g.\ \emph{Java Virtual Machine} or
\emph{.NET Framework Runtime}).
There are already several closed-source .NET decompilers
in existence. The most known one is probably
Lutz Roeder's \emph{Reflector for .NET}.
% Cristina
% Current .NET Decompilers
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% 12 Pages
\cleardoublepage
\chapter{Preparation}
This chapter explains the general idea behind the decompilation
process. The implementation chapter then explains how these
methods were implemented and what measures were taken
to ensure the correctness of the produced output.
\section{Overview of the decompilation process}
% Read metadata
% Starting point
% Translating bytecodes into C# expressions;
% High-level; Stack-Based; Conditional gotos
% Dataflow analysis; inlining
% Control flow analysis (interesting)
% Suggar
The first step is to read and disassemble the \emph{.NET} executable files.
There is an open-source library called \emph{Cecil} which was created exactly
for this purpose. It loads the whole executable into memory and then
it provides simple object-oriented API to access the content of the executable.
Using this library, the access to the executable is straightforward.
It is worth noting the wealth of information we are starting with.
The \emph{.NET} executables contain quite high-level representation
of the program. Complete typing information is still present --
we have access to all the classes including the fields and methods.
The types and names are also preserved.
The bytecode is stored on per-method basis and even the local variables
used in the method body are still explicitly represented including the type.%
\footnote{However, the names of local variables are not preserved.}
Ultimately, the challenge is to translate the bytecode of a given
method to a C\# code.
For example, consider decompilation of following code:
\begin{verbatim}
// Load "Hello, world!" on top of the stack
ldstr "Hello, world!"
// Print the top of the stack to the console
call System.Console::WriteLine(string)
\end{verbatim}
We start by taking the bytecodes one by one and translate them
to C\# statements. The bytecodes are quite high level so
the translation is usually straightforward.
The two bytecodes can thus be represented as the following
two C\# statements (\verb|output1| and \verb|some_input| are just
dummy variables):
\begin{verbatim}
string output1 = "Hello, world!";
System.Console.WriteLine(some_input);
\end{verbatim}
The first major challenge of decompilation is data-flow
analysis. In this example, we would analyze the stack
behavior and we would find that \verb|output1| is on the top
of the stack when \verb|WriteLine| is called. Thus, we would get:
\begin{verbatim}
string output1 = "Hello, world!";
System.Console.WriteLine(output1);
\end{verbatim}
At this point the produced source code should already compile
and work as intended.
The second major challenge of decompilation is control-flow
analysis. The use of control-flow analysis
is to replace \verb|goto| statements with high-level structures
like loops and conditional statements. This makes the produced
source code much more easier to read and understand.
Finally, there are several sugarings that can be done to the code to
make it more readable -- for example, replacing \verb|i = i + 1|
with the equivalent \verb|i++| or using rules of logic to
simplify boolean expressions.
\section{Translating bytecodes into C\# expressions}
% One at a time
% Byte code is high level (methods, classes, bytecode commands)
% Types preserved (contrast with Java)
The first task is to convert naively the individual bytecodes into
equivalent C\# expressions or statements.
This task is reasonably straightforward because the bytecodes
are quite high level and are aware of the object-oriented
programming model. For example, there are bytecodes
for integer addition, object creation, array access,
field access, object casting and so on; all of these trivially
translate to C\# expressions.
Any branching commands are also easy to handle because
C\# supports labels and \verb|goto|s. Unconditional branches
are translated to \verb|goto| statements and conditional
branches are translated to \verb|if (condition) goto Label;|.
C\# uses local variables for data transfer whereas bytecodes
use the stack. This mismatch will be handled during the
data-flow analysis phase. Until then, dummy variables
are used as placeholders. For example, \verb|add| bytecode
would be translated to \verb|input1 + input2|.
\section{Data-flow analysis}
\subsection{Stack in .NET}
% Stack behavior
% Deterministic stack (tested - failed)
Bytecodes use a stack to pass data between consecutive
instructions. When a bytecode is executed, it pops its arguments
from the stack, it performs some operation on them and then
it pushes the result, if any, back to the stack.
The number of popped elements is fixed and known before execution time
- for example, any binary operation pops exactly two arguments and
a method call pops one element for each of its arguments.
Most bytecodes either push only one element on the stack or none at all.%
\footnote{There is one exception to this -- the `dup' instruction.
More on this later.}
This makes the data-flow analysis slightly simpler.
The framework imposes some useful restrictions with regards to
control-flow. In theory, we could arrive at a given point in program
and depending on which execution path was taken, we could end up with
two completely different stacks. This is not allowed. Regardless
of the path taken, the stack must have the same number of elements in it.
The typing is also restricted -- if one execution path pushes an integer
on the stack, any other execution path must push an integer as well.
\subsection{Stack simulation}
Bytecodes use the stack to move data around, but there is no such stack
concept in C\#. Therefore we need to use something that C\# does
have -- local variables.
The transformation from stack to local variables is simpler then
it might initially seem. Basically, pushing something on the
stack means that we will need that value later on. So in C\#,
we create a new temporary variable and store the value in it.
On the other hand, popping is a usage of the value. So in C\#,
we just reference the temporary variable that
holds the value we want. The only difficult is to figure out
which temporary variable it is.
This is where the restrictions on the stack come handy -- we can
figure out what the state of the stack will be at any given point.
That is, we do not know what the value will be, but we do know
who pushed the value on the stack and what the type of the value is.
For example, after executing the following code
\begin{verbatim}
IL_01: ldstr "Hello''
IL_02: ldstr "world''
\end{verbatim}
we know that there will be exactly two values on the stack --
the first one pushed by bytecode labeled \verb|IL_01| and the
second one pushed by bytecode labeled \verb|IL_02|. Both of
them of are of type \verb|String|. Let's say that the state
of the stack is \{\verb|IL_01|, \verb|IL_02|\}.
Now, let us pop value from the stack by calling
\verb|System.Console.WriteLine|. This method takes one
\verb|String| argument. Just before the method is called
the state of the stack is \{\verb|IL_01|, \verb|IL_02|\}.
Therefore the value allocated by \verb|IL_02| is popped
from the stack and we are left with \{\verb|IL_01|\}.
Furthermore, we know which temporary local variable to
use in the method call -- the one that stores the value
pushed by \verb|IL_02|.
Let us call \verb|System.Console.WriteLine| again to completely
empty the stack and let us see what the produced C\# source code
would look like (the comments show the state of the stack
between the individual commands).
\begin{verbatim}
// {}
String temp_for_IL_01 = "Hello'';
// {IL_01}
String temp_for_IL_02 = "world'';
// {IL_01, IL_02}
System.Console.WriteLine(temp_for_IL_02);
// {IL_01}
System.Console.WriteLine(temp_for_IL_01);
// {}
\end{verbatim}
\subsection{The `dup' instruction}
In general, bytecodes push either one value on the stack or none at all.
There is only one exception to this -- the \verb|dup| instruction.
The purpose of this instruction is to duplicate the value on top of the
stack. Fortunately, it does not break our model. The only significance
of the \verb|dup| instruction is that the same value will be used twice -
that is, the temporary local variable will be referenced twice.
The \verb|dup| instruction, however, makes some future optimizations
more difficult.
The \verb|dup| instruction is used relatively rarely and
it usually just saves a few bytes of disk space.
Note that the \verb|dup| instruction could always be replaced
by store to a local variable and two loads.
The \verb|dup| instruction can be used in expression like
\verb|this.a + this.a| -- the value \verb|this.a| is obtained
once, duplicated and then the duplicates are added.
\subsection{Control merge points}
This in essence the most complex part of data-flow analysis.
It is caused by the possibility of multiple execution paths.
There are two scenarios that branching can cause.
The nice scenario is when a value is pushed in one location and,
depending on the execution path, can be popped in two or more different
locations. This translates well into the local variable model -
it simply means that the temporary local variable will be referenced
several times and there is nothing special we need to do.
The worse scenario is when a value can be pushed on the sack
in two or more different locations and then it is poped only in one locations.
Example of this would be
\verb|Console.WriteLine(condition ? "true": "false")| --
depending on the value of the condition either ``true'' or
``false'' is printed. This code snippet compiles into bytecode
with two possible execution paths -- one in which ``true'' is pushed
on the stack and one in which ``false'' is pushed on the stack.
In both cases the value is poped only once when
\verb|Console.WriteLine| is called.
This scenario does not work well with the local variable
model. Two values can be pushed on the stack and so two temporary
local variables are created -- one holding ``true'' and
one holding ``false''.
Depending on the execution path either of these can be on the
top of the stack when \verb|Console.WriteLine| is executed and
therefore we do not know which temporary variable to use as the argument
for \verb|Console.WriteLine|. It is impossible to know.
This can be resolved by creating yet another temporary variable
which will hold the actual argument for the \verb|Console.WriteLine|
call. Depending on the execution path taken, one of the previous
variables needs to be copied into this one.
So the produced C\# source code would look like this:
\begin{verbatim}
String sigma;
if (condition) {
String temp1 = "true";
sigma = temp1;
} else {
String temp2 = "false";
sigma = temp2;
}
Console.WriteLine(sigma);
\end{verbatim}
Note that this is essentially the trick that is used to obtain
single static assignment (SSA) form of programs during compilation.
In reality, this happens extremely rarely and it is almost exclusively
caused by the C\# \verb|?| operator.
\subsection{In-lining expressions}
% Generally difficult, but here SSA-SR
% Validity considered - order, branching
The code produced so far contains many temporary variables.
These variables are often not necessary and can be
optimized away. For example, the code
\begin{verbatim}
string output1 = "Hello, world!";
System.Console.WriteLine(output1);
\end{verbatim}
can be simplified into a single line:
\begin{verbatim}
System.Console.WriteLine("Hello, world!");
\end{verbatim}
Doing this optimization makes the produced code much more readable.
In general, removing a local variable from a method is non-trivial.
However, these generated local variables are special -- they were generated
to remove the stack-based data-flow model.
This means that the local variables are assigned only once and are
read only once%
\footnote{With the exception of the `dup' instruction which is read twice
and thus needs to be handled specially.}
(corresponding to the push and pop operations respectively).
There are some safety issues with this optimization.
Even though we are sure that the local variable is assigned and read
only once, we need to make sure that doing the optimization will not
change the semantics of the program. For example consider the
following code:
\begin{verbatim}
string output1 = f();
string output2 = g();
System.Console.WriteLine(output2 + output1);
\end{verbatim}
The functions \verb|f| and \verb|g| may have side-effects and in this
case \verb|f| is called first and then \verb|g| is called.
Optimizing the code would produce
\begin{verbatim}
System.Console.WriteLine(g() + f());
\end{verbatim}
which calls the functions in the opposite order.
Being conservative, we will allow this optimization only if the
declaration of the variable immediately precedes its use.
In other words, the variable is declared and then immediately
used with no possible side-effects in between these.
Using this rule, the previous example can be optimized to
\begin{verbatim}
string output1 = f();
System.Console.WriteLine(g() + output1);
\end{verbatim}
but it can not be optimized further because the function
\verb|g| is executed between the declaration of \verb|output1|
and its use.
\section{Control-flow analysis}
% Recovering high-level structures
The main point of control-flow analysis is to recover high-level
structures like loops and conditional blocks of code.
This is not necessary to produce semantically correct code,
but it does make the code much more readable compared to
only using \verb|goto| statements. It is also
the most theoretically interesting part of the whole
decompilation process.
In general, \verb|goto|s can be always completely removed
by playing some `tricks' like duplicating code or adding
boolean flags. However, as far as readability of the
code is concerned, this does more harm than good.
Therefore, we want to remove \verb|goto|s using the `natural'
way -- by recovering the high-level structures they represent
in the program. Any \verb|goto|s that can not be removed
in such way are left in the program.
See figures \ref{IfThen}, \ref{IfThenElse} and \ref{Loop}
for examples of control-flow graphs.
\begin{figure}[tbh]
\centerline{\includegraphics{figs/IfThen.png}}
\caption{\label{IfThen}Conditional statement (if-then)}
\end{figure}
\begin{figure}[tbh]
\centerline{\includegraphics{figs/IfThenElse.png}}
\caption{\label{IfThenElse}Conditional statement (if-then-else)}
\end{figure}
\begin{figure}[tbh]
\centerline{\includegraphics{figs/Loop.png}}
\caption{\label{Loop}Loop}
\end{figure}
\subsection{Finding loops}
% T1-T2 reduction
% Nested loops
% Goto=Continue=Break; Multi-level break/continue
T1-T2 transformations are used to find loops.
These transformation are usually used to determine whether
a control flow graph is reducible or not, but
it turns out that they are suitable for finding loops as well.
If a control-flow graph is irreducible, it means that the
program can not be represented only using high-level structures.
That is, \verb|goto| statements are necessary to represent the
program. This is undesirable since \verb|goto| statements
usually make the decompiled source code less readable.
See figure \ref{Ireducible} for the classical example
of irreducible control-flow graph.
\begin{figure}[tbh]
\centerline{\includegraphics{figs/Ireducible.png}}
\caption{\label{Ireducible}Ireducible control-flow graph}
\end{figure}
If a control-flow graph is reducible, the program may or may not be
representable only using high-level structures. That is,
\verb|goto| statements still may be necessary on some occasions.
For example, C\# has a \verb|break| statement which exits
the inner most loop, but it does not have any statement which
would exit multiple levels of loops.%
\footnote{This is language dependent.
The Java language, for example, does provide such command.}
Therefore in such occasion we have to use the \verb|goto| statement.
The algorithm works by taking the initial control-flow graph
and then repeatedly applying simplifying transformations to it.
As we can see from the name, the algorithm consists of two
transformations called T1 and T2.
If the graph is reducible, the algorithm will eventually simplify
the graph to only a single node. If the graph is not reducible,
we will end up with multiple nodes and no more transformations
that can be performed.
Both of the transformations look for a specific pattern in
the graph and then replace the pattern with something simpler.
See figure \ref{T1T2} for a diagram showing the T1 and T2 transformations.
\begin{figure}[tbh]
\centerline{\includegraphics{figs/T1T2.png}}
\caption{\label{T1T2}T1 and T2 transformations}
\end{figure}
The T1 transformation removes self-loop. If one of the successors
of a node is the node itself, then the node is a self-loop.
That is, the node `points' to itself. The T1 transformation
replaces the node with a node without such link.
The number of preformed T1 transformations corresponds to the
number of loops in the program.
The T2 transformation merges two consecutive nodes together.
The only condition is the the second node has to have
only one predecessor which is the first node.
Preforming this transformation repeatedly basically reduces
any directed acyclic graph into a single node.
See figures \ref{IfThenElseReduction}, \ref{NestedLoops} and
\ref{NestedLoops2} for examples how T1 and T2 transformations
can be used to reduce graphs.
\begin{figure}[p]
\centerline{\includegraphics{figs/IfThenElseReduction.png}}
\caption{\label{IfThenElseReduction}Reduction of if-then-else statement}
\end{figure}
\begin{figure}[p]
\centerline{\includegraphics{figs/NestedLoops.png}}
\caption{\label{NestedLoops}Reduction of two nested loops}
\end{figure}
\begin{figure}[tbhp]
\centerline{\includegraphics{figs/NestedLoops2.png}}
It is not not immediately obvious that these are, in fact,
two nested loops. Fifth conceptual node has been added
to demonstrate why the loops can be considered nested.
\caption{\label{NestedLoops2}Reduction of two nested loops 2}
\end{figure}
The whole algorithm is guaranteed to terminate if and only
the graph is reducible. However, it may produce different
intermediate states depending on the order of the transformations.
For example, the number of T1 transformations used can vary.
This affects the number of apparent loops in the program.
Note that although reducible graphs are preferred,
irreducible graphs can be easily decompiled as well
using \verb|goto| statements. Programs that were originally
written in C\# are most likely going to be reducible.
\subsection{Finding conditionals}
% Compound conditions
Any T1 transformation (self-loop reduction) produces a loop.
Similarly, T2 transformation or several T2 transformations
produce a directed acyclic sub-graph. This sub-graph can
usually be represented as one or more conditional statements.
Any node in the control-flow graph that has two successors
is potentially a conditional -- that is, \verb|if-then-else|
statement. The node itself determines which branch will
be taken, so the node is the condition. We now need to
determine what is the `true' body and what is the `false'
body. That is, we look for the blocks of code that will
be executed when the condition will or will not hold respectively.
This is done by considering reachability. A code that is
reachable only by following the `true' branch can be
considered as the `true' body of the conditional.
Similarly, code that is reachable only by following the
`false' branch is the `false' body of the conditional.
Finally, the code that is reachable from both branches is
not part of the conditional -- it is the code that follows
the whole \verb|if-then-else| statement.
See figure \ref{IfThenElseRechablility} for a diagram of this.
\begin{figure}[tbhp]
\centerline{\includegraphics{figs/IfThenElseRechablility.png}}
\caption{\label{IfThenElseRechablility}Reachablity of nodes for a conditional}
\end{figure}
\subsection{Short-circuit boolean expressions}
Short-circuit boolean expressions are boolean expressions
which may not evaluate completely.
The semantics of normal boolean expression \verb|f() & g()| is
to evaluate both \verb|f()| and \verb|g()| and then return
the logical `and' of these.
The semantics of short-circuit boolean expression \verb|f() && g()|
is different. The expression \verb|f()| is evaluated as before,
but \verb|g()| is evaluated only if \verb|f()| returned \emph{true}.
If \verb|f()| returned \emph{false} then the whole expression
will return \emph{false} anyway and so there no point in evaluating
\verb|g()|.
In general, conditionals depending on this short-circuit logic
cannot be expressed as normal conditionals without the use of \verb|goto|s
or code duplication.
Therefore it is desirable to find the short-circuit boolean expressions
in the control-flow graph.
The short-circuit boolean expressions will manifest themselves as
one of four patterns in the control-flow diagrams.
See figure \ref{shortcircuit}.
\begin{figure}[tbh]
\centerline{\includegraphics{figs/shortcircuit.png}}
\caption{\label{shortcircuit}Short-circuit control-flow graphs}
\end{figure}
These patterns can be easily searched for and replaced with a single node.
Nested expressions like \verb/(f() && g()) || (h() && i())/
will simplify well by applying the algorithm repeatedly.
\subsection{Basic blocks}
% Performance optimization only (from compilation)
Basic block is a block of code which is always executed as
a unit. That is, no other code can jump in the middle of the
block and there are no jumps from the middle of the block.
The implication of this is that as far as control-flow
analysis is concerned, the whole basic block can be
considered as a single instruction.
This primary reason for implementing this is performance gain.
\section{Requirements Analysis}
The decompiler must successfully round-trip a quick-sort algorithm.
That is, given an executable containing an implementation of
the quick-sort algorithm, the decompiler must produce
C\# source code that is both syntactically and semantically correct.
The produced source code must compile and work correctly
without any additional modifications to the source code.
To achieve this the Decompiler needs to handle at least the following:
\begin{itemize}
\item Integers and integer arithmetic
\item Create and be able to use integer arrays
\item Branching must be successfully decompiled
\item Several methods can be defined
\item Methods can have arguments and return values
\item Methods can be called recursively
\item Integer command line arguments can be read and parsed
\item Text can be outputted to the standard console output
\end{itemize}
\section{Development model}
The development is based on the spiral model.
The decompiler will internally consist of several consecutive code
representations. The decompiled program will be gradually
transformed from one representation to other until it reaches
the last one and is outputted as source code.
This matches the spiral model well. Successful implementation
of each code representation can be seen as one iteration in the
spiral model.
Furthermore, once a code representation is finished, it can
be improved by implementing an optimization which transforms it.
This can be seen as another iteration.
\section{Reference documentation}
The .NET executables will be decompiled into C\# source code.
It is therefore crucial to be intimately familiar with the language.
Evaluation semantics, operator precedence, associativity and other
aspects of the language can be found in the \emph{ECMA-334 Standard --
C\# Language Specification}.
In oder to decompile .NET executables, it is necessary to get
familiar with the .NET Runtime (i.e.\ the .NET Virtual Machine).
In particular, it is necessary to learn how the execution engine
works -- how does the stack behave, what limitations are imposed
on the code, what data types are used and so on.
Most importantly, it is necessary to get familiar with majority
of the bytecodes.
The primary reference for the .NET Runtime is the
\emph{ECMA-335 Standard -- Common Language Infrastructure (CLI)}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% 25 Pages
\cleardoublepage
\chapter{Implementation}
\section{Development process}
% SVN
% GUI -- options; progressive optimizations
% Testing
% Refactoring
The whole project was developed in C\# on the \emph{Windows} platform.
SharpDevelop\footnote{Open-source integrated development environment
for the .NET framework. I am actively participating in its development,
currently being one of the major developers.}
was used as the integrated development environment for writing of the code.
The source code and all files related to the project are stored on SVN%
\footnote{Subversion (SVN) is an open-source version control system.
It is very similar to CVS and aims to be its successor.}
server. This has several advantages such as back-up of all data,
documented project history (the commit messages) and the ability
to revet back to any previous state. The graphical user interface
for SVN also makes it possible to easily see and review all changes
that were made since previous commit to the server.
The user interface of the decompiler is very simple.
It consists of a single window which is largely covered by a text area
containing the output of the decompilation process.
There are only a few controls in the top part of the window.
Several check boxes can be used to completely turn off or on
optimization stages of the decompilation process.
Numerical text boxes can be used to specify the maximal number of
interactions for several transformations. By gradually increasing
these it is possible to observe changes to the output step-by-step.
These options significantly aid in debugging and testing of
the decompiler.
The source code generated by the decompiler is also stored on
the disk and committed to SVN together with all other files.
This is the primary method of regression testing. The generated source code
is the primary output of the decompiler and thus any change
in it may potentially signify a regression in the decompiler.
Before every commit the changes to this file were reviewed
to confirm that the new features did yield the expected
results and that no regressions were introduced.
\section{Overview of representation structures}
% Stack-based bytecode
% Variable-based bytecode
% High-level blocks
% Abstract Syntax Tree
The decompiler uses the following four intermediate representations
of the program (sequentially in this order):
\begin{itemize}
\item Stack-based bytecode
\item Variable-based bytecode
\item High-level blocks
\item Abstract Syntax Tree
\end{itemize}
The code is translated from one representation to other and some
transformations/optimizations are done at each of these representations.
The representations are very briefly described in the following
four subsections and then they are covered again in more detail.
Trivial artificial examples are used to aid the explanation.
Note that the evaluation chapter shows the actual outputs
for the quick-sort implementation.
\subsection{Stack-based bytecode representation}
The stack-based bytecode is the original bytecode as read from the
assembly. It is never modified. However, it is analyzed and
annotated with the results of the analysis. Most data-flow
analysis is performed at this stage.
This is an example of code in this representation:
\begin{verbatim}
ldstr "Hello, world!"
call System.Console::WriteLine
\end{verbatim}
\subsection{Variable-based bytecode representation}
The variable-based bytecode is the result of removing the
stack-based data-flow model. Local variables are now used instead.
Even though the stack-based model is now completely removed,
we can still use the bytecodes to represent the program.
We just need to think about them in a slightly different way.
The bytecodes are in essence functions which take arguments
and return values. For example, the \verb|add| bytecode
pops two values and pushes back the result and so it is now
a function which takes two arguments and returns the result.
The \verb|ldstr "Hello, world!"| bytecode used above is slightly
more complicated. It does not pop anything from the stack,
but it still has the implicit argument \verb|"Hello, world!"|
which comes from the immediate operand of the bytecode.
Continuing with the example, the variable-based from would be:
\begin{verbatim}
temp1 = ldstr("Hello, world!");
call(System.Console::WriteLine, temp1);
\end{verbatim}
where the variable \verb|temp1| was used to remove the stack data-flow model.
Some transformations are performed at this stage, modifying the
code. For example, the in-lining optimization would transform
the previous example to:
\begin{verbatim}
call(System.Console::WriteLine, ldstr("Hello, world!"));
\end{verbatim}
Note that expressions can be be nested as seen in the code above.
\subsection{High-level blocks representation}
The next data representation introduces high-level blocks.
In this representation, related parts of code can be grouped
together into a block of code.
The block signifies some high-level structure and blocks can
be nested (for example, conditional within a loop).
All control-flow analysis is performed at this stage.
The blocks can be visualized by a pair of curly braces:
\begin{verbatim}
{ // Block of type 'method body'
IL_01: call(System.Console::WriteLine, ldstr("Hello, world"));
{ // Block of type 'loop'
IL_02: call(System.Console::WriteLine, ldstr("!"));
IL_03: br(IL_02) // Branch to IL_02
}
}
\end{verbatim}
Initially there is only one block containing the whole method
body and as the high-level structures are found, new smaller
blocks are created. At this stage the blocks have no function
other than grouping of code.
% Validity -- gotos
This data representation also allows code to be reordered
to better fit the ordinal high-level structure.
\subsection{Abstract Syntax Tree representation}
This is the final representation. The abstract syntax tree (AST)
is a structured way of representing the produced source code.
For example, the expression \verb|(a + 1)| would be represented
as four objects -- parenthesis, binary operator, identifier and
constant. The structure is a tree so the identifier and the constant
are children of the binary operator which is in turn child of the
parenthesis.
% The high-level structures are directly translated to nodes in the
% abstract syntax tree. For example, loop will translate to
% a node representing code \verb|for(;;) {}|.
% The bytecodes are translated to the abstract syntax tree on
% individual basis. For example, the bytecode expression \verb|add(1, 2)|
% is translated to \verb|1 + 2|. The previously seen
% \verb|call(System.Console::WriteLine, ldstr("Hello, world"))|
% is translated to \verb|System.Console.WriteLine("Hello, world")|.
This initial abstract syntax tree will be quite verbose due to
constructs that ensure correctness of the produced source code.
There will be an especially high quantity of \verb|goto|s and
parentheses. For example, a simple increment is initially
represented as \verb|((i) = (i) + (1));| where the parentheses
are used to ensure safety.
Transformations are done on the abstract syntax tree to simplify the code
in occasions where it is know to be safe.
Several further transformations are done to make the code more readable.
For example, renaming of local variables or use of common
programming idioms.
\section{Stack-based bytecode representation}
\label{Stack-based bytecode representation}
This is the first and simplest representation of the code.
It directly corresponds to the bytecode in the .NET executable.
It is the only representation which is not transformed in any way.
% Cecil
% Offset, Code, Operand
The bytecode is loaded from the .NET executable with the
help of the Cecil library.
Three fundamental pieces of information are loaded for each bytecode --
its offset from the start of the method, opcode
(name) and the immediate operand.
The immediate operand is additional argument
present for some bytecodes -- for example, for the \verb|call| bytecode,
the operand specifies the method that should be called.
% Branch target, branches here
For bytecodes that can branch, the operand is the branch target.
Each bytecode stores a reference to the branch target (if any) and
each bytecode also stores a reference to all bytecodes that can branch to it.
\subsection{Stack analysis}
Once all bytecodes are loaded, the stack behavior of the program is analyzed.
The precise state of the stack is determined for each bytecode.
The .NET Framework dictates that this must be possible for valid executables.
It is possible to determine the exact number of elements on the
stack and for each of these elements is possible to determine which bytecode
pushed it on the stack%
\footnote{In some control merge scenarios, there might be more
bytecodes that can, depending on the execution path, push the value on the stack.}
and what the type of the element is.
% Stack before / after
% StackBehaviour -- difficult for methods
There are two stack states for each bytecode -- the first is the known
state \emph{before} the bytecode is executed and the second is the state
\emph{after} the bytecode is executed. Knowing what the bytecode does,
the later state can be obviously derived from the former.
This is usually easy -- for example, the \verb|add| bytecode pops
two elements from the stack and pushes one back. The pushed element
is obviously pushed by the \verb|add| bytecode
and the element is of the same type as the two popped elements%
\footnote{The 'add' bytecode can only sum numbers of the same type.}.
The \verb|ldstr| bytecode just pushes one element of type \verb|string|.
The behaviour of the \verb|call| bytecode is slightly more complex
because it depends on the method that it is invoking.
Implementing these rules for all bytecodes is
tedious, but usually straightforward.
The whole stack analysis is performed by iterative algorithm.
The stack \emph{before} the very first bytecode is empty.
We can use this to find the stack state \emph{after} the first instruction.
The stack state \emph{after} the first bytecode is the
initial stack state for other bytecode.
We can apply this principle again and again until all states are known.
% Branching -- 1) fall 2) branch 3) fall+branch veryfy same
Branching dictates the propagation of the states between bytecodes.
For two simple consecutive bytecodes, the stack state \emph{after}
the first bytecode is same as the state \emph{before} the second.
Therefore we just copy the state.
If the bytecode is a branch then we need to copy
the stack state to the target of the branch as well.
Dead-code can never be executed and thus it does not have
any stack state associated with it.
At the bytecode level, the .NET Runtime handles boolean values
as integers. For example, the bytecode \verb|ldc.i4 1| can
be interpreted equivalently as `push \verb|1|' and `push \verb|true|'.
Therefore three integer types are used --
`zero integer', `non-zero integer' and generic `integer' (of
unknown value). Bytecode \verb|ldc.i4 1| therefore has type
`non-zero integer'.
\section{Variable-based bytecode representation}
\label{Variable-based bytecode representation}
% Removed stack-based model -- local variables, nesting
This is a next representation which differs from the stack-based
one by completely removing the stack-based data-flow model.
Data is passed between instructions using local variables
or using nested expressions.
Consider the following stack-based program which evaluates
$2 * 3 + 4 * 5$.
\begin{verbatim}
ldc.i4 2
ldc.i4 3
mul
ldc.i4 4
ldc.i4 5
mul
add
\end{verbatim}
The program is transformed to the variable-based form by
interpreting the bytecodes as functions which return values.
The result of every function call is stored in new temporary
variable so that the value can be referred to latter.
\begin{verbatim}
int temp1 = ldc.i4(2);
int temp2 = ldc.i4(3);
int temp3 = mul(temp1, temp2);
int temp4 = ldc.i4(4);
int temp5 = ldc.i4(5);
int temp6 = mul(temp4, temp5);
int temp7 = add(temp3, temp6);
\end{verbatim}
The stack analysis performed earlier is used to determine
what local variables should be used as arguments.
For example, the stack analysis tells us that the stack contains
precisely two elements just before the \verb|add| instruction.
It also tells us that these two elements have been pushed on the stack by the
\verb|mul| instructions (the third and sixth instruction respectively).
Therefore to access the results of the \verb|mul| instructions,
the \verb|add| instruction needs to use the temporary variables
\verb|temp3| and \verb|temp6|.
\subsection{Representation}
Internally, each expression (`function call') consists of three
main elements -- the opcode (e.g.\ \verb|ldc.i4|), the immediate operand
(e.g.\ \verb|1|) and the arguments (e.g.\ \verb|temp1|).
The argument can be either a reference to a local variable or
other nested expression. In the case of
\begin{verbatim}
call(System.Console::WriteLine, ldstr("Hello, world"))
\end{verbatim}
\verb|call| is the opcode,
\verb|System.Console::WriteLine| is the operand and
\verb|ldstr("Hello, world")| is the argument (nested expression).
Expressions like \verb|ldstr("Hello, world")| or \verb|ldc.i4|
do not have any arguments other then the immediate operand.
% stloc, ldloc
There are bytecodes for storing and referencing a local variables
(\verb|ldloc| and \verb|stloc| respectively).
Therefore no special data structures are needed to declare
and reference the local variables because we we can store
program like
\begin{verbatim}
int temp1 = ldc.i4(2);
int temp2 = ldc.i4(3);
int temp3 = mul(temp1, temp2);
\end{verbatim}
only using bytecodes as
\begin{verbatim}
stloc(temp1, ldc.i4(2));
stloc(temp2, ldc.i4(3));
stloc(temp3, mul(ldloc(temp1), ldloc(temp2)));
\end{verbatim}
% Dup
Local variables \verb|temp1|, \verb|temp2| and \verb|temp3|
are tagged as being `single static assignment and single read'.
This is a necessary property for the in-lining optimization.
All generated temporary variables satisfy this property
except for the ones storing the result of a \verb|dup| instruction
which does not satisfy this property because it is read twice.
\subsection{In-lining of expressions}
\label{In-lining of expressions}
The variable-based representation passes data between instructions
either using local variables or using expression nesting.
Expression nesting is the preferred more readable form.
The in-lining transformations simplifies the code by removing
some temporary variables and using expression nesting instead.
Consider the following code.
\begin{verbatim}
stloc(temp1, ldstr("Hello, world!"));
call(System.Console::WriteLine, ldloc(temp1));
\end{verbatim}
This code can be simplified into a single line by in-lining the
local variable \verb|temp1|:
(that is, the \verb|stloc| expression is nested within the
\verb|call| expression)
\begin{verbatim}
call(System.Console::WriteLine, ldstr("Hello, world!"));
\end{verbatim}
The algorithm is iterative. Two consecutive lines are considered
and if the first line assigns to a local variable which is used
in the second line then the variable is in-lined.
This is repeated until no more optimizations can be done.
There are some safety concerns with this optimizations.
The optimization would break the program if the local variable
was read later in the program.
Therefore the optimization is done only for variables that have
the property of being `single static assignment and single read'.
The second concern is change of execution order.
Consider the following pseudo-code:
\begin{verbatim}
temp1 = f();
add(g(), temp1);
\end{verbatim}
In-ling \verb|temp1| would change the order in which the
functions \verb|f| and \verb|g| are called and thus the
optimization can not be performed.
On the other hand, the following code can be optimized:
\begin{verbatim}
temp1 = f();
add(1, temp1);
\end{verbatim}
The effect of the optimization will be that the expression
\verb|1| will be evaluated before \verb|f| rather then
after it. However, this does not change the semantics
of the program because evaluation of \verb|1| does not have
any side-effects and it will still evaluate to the same value.
More importantly, the same property holds for \verb|ldloc| (load local variable)
instruction. It does not have any side-effects and it will
evaluate to the same value even if evaluated before the
function \verb|f| (or any other expression being in-lined).
This is true because the function \verb|f|
(or any other expression being in-lined)
can not change the value of the local variable.
% The only way to change a local
% variable would be using the \verb|stloc| instruction.
% However, \verb|stloc| can not be part of the in-lined
% expression because it does not return any value -
% it is a statement and as such can not be nested within an expression.
% The property holds for \verb|1|, \verb|ldloc| and it would hold
% for some other expressions as well, but due to the
% way the stack-based model was removed it is sufficient
% to consider only \verb|ldloc|. This is by far the most common case.
To sum up, it is safe to in-line a local variable if it has
property `single static assignment and single read' and
if the argument referencing it is preceded only
by \verb|ldloc| instructions.
The following code would be optimized in two iterations:
\begin{verbatim}
stloc(temp1, ldc.i4(2));
stloc(temp2, ldc.i4(3));
stloc(temp3, mul(ldloc(temp1), ldloc(temp2)));
\end{verbatim}
First iteration (in-line \verb|temp2|):
\begin{verbatim}
stloc(temp1, ldc.i4(2));
stloc(temp3, mul(ldloc(temp1), ldc.i4(3)));
\end{verbatim}
Second iteration (in-line \verb|temp1|):
\begin{verbatim}
stloc(temp3, mul(ldc.i4(2), ldc.i4(3)));
\end{verbatim}
\subsection{In-lining of `dup' instruction}
The \verb|dup| instruction can not be in-lined because
it does not satisfy the `single static assignment and single read'
property. This is because the data is referenced twice.
For example, in-lining the following code would cause
the function \verb|f| to be called twice.
\begin{verbatim}
stloc(temp1, dup(f()));
add(temp1, temp1);
\end{verbatim}
However, there are circumstances where this would be acceptable.
If the expression within the \verb|dup| instruction is a
constant then it is possible to in-line it without any harm.
The following code can be optimized:
\begin{verbatim}
stloc(temp1, dup(ldc.i4(2)));
add(temp1, temp1);
\end{verbatim}
Even more elaborate expressions like
\verb|mul(ldc.i4(2), ldc.i4(3))| are still a constant.
The instruction \verb|ldarg.0| (load \verb|this|) is also
constant relative to the single invocation of method.
\section{High-level blocks representation}
\label{High-level blocks representation}
% Links
% Link regeneration
So far the code is just a sequential list of statements.
The statements are in the exactly same order as found in the
original executable and all control-flow is achieved
only by the \verb|goto| statements.
The `high-level block' representation structure introduces
greater freedom to the code. High-level structures are recovered
and the code can be arbitrarily reordered.
The code is organized into blocks in this representation.
A block can can contain executable code as well as other nested blocks.
Therefore this resembles a tree-like data structure.
The root block of the three represents the method and contains
all of its code.
A node in the three represents some high-level data structure
(loop, \verb|if| statement). Nodes can also represent `meta'
high-level structures which are used during the optimizations.
For example, node can encapsulate a block of code that is known
to be acyclic (without loops).
Finally, all leaves in the tree are basic blocks.
\subsection{Safety}
This data representation allows the code to be restructured
and reordered. Therefore care is needed to make sure that
the produced code is semantically correct.
% Validity -- arbitrary nodes
It would be possible to make sure that every transformation is
valid. However, there is even more robust solution --
give up at the start and allow arbitrary transformations.
The only constraint is that the code can be only moved. It cannot
be duplicated and it cannot be deleted. Deleting of code would
indeed cause problems. Anything else is permitted.
When the code is generated, the correctness is ensured by
explicit \verb|label|s and \verb|goto|s placed in front
of and after every basic block respectively. For example
consider the follow piece of code in which the order of
basic blocks was reversed and two unnecessary high-level
nodes were added:
\begin{verbatim}
goto BasicBlock1; // Method prologue
for(;;) {
BasicBlock2:
Console.WriteLine("world");
goto End;
}
if (true) {
BasicBlock1:
Console.WriteLine("Hello");
goto BasicBlock2;
}
End:
\end{verbatim}
The inclusion of explicit \verb|label|s and \verb|goto|s
makes the high-level structures and the order of basic blocks
irrelevant and thus the produced source code will be correct.
Note that in the code above the lines \verb|for(;;) {| and
\verb|if (true) {| will never be reached during execution.
In general, the transformations done on the code will be
more sane then in the example above and most of the
\verb|label|s and \verb|goto|s will be redundant.
In the following code all of the \verb|label|s and \verb|goto|s
can be removed:
\begin{verbatim}
goto BasicBlock1; // Method prologue
BasicBlock1:
Console.WriteLine("Hello");
goto BasicBlock2;
BasicBlock2:
Console.WriteLine("world");
goto End;
End:
\end{verbatim}
The removal of \verb|label|s and \verb|goto|s is done once
the Abstract Syntax Tree is created.
To sum up, finding and correctly identifying all high-level
structures will produce nice results, but failing to
do so will not cause any harm as far as correctness is concerned.
\subsection{Tree data structure}
The data structure to store the tree is very simple --
the tree consists of nodes where each node contains a link
to its parent and a list of children. A child can be either
another node or a basic block (leaf).
The API of the tree structure is severely restricted.
Once the tree is created, it is only possible to add new empty
nodes and to move basic blocks from one node to other.
This ensures that basic blocks can not be duplicated or
deleted which is the safety requirement as discussed in the
previous section.
All transformations are limited by this constraint.
Each node also provides events which are fired whenever
the child collection of the node is changed.
\subsection{Creating basic blocks}
The previous data representation stored the code as a sequence
of statements. Some of these statements will be always
executed sequentially without interference of branching.
In control-flow analysis we are primarily interested in
branching and therefore it is desirable to group such
sequential parts of code together to basic blocks.
Basic block is a block of code that is always guaranteed to
be executed together without any branching. Basic blocks
can be easily found by determining which statements start
a basic block. The very first statement in a method
naturally starts the first basic block. Branching transfers
control somewhere else so the statement immediately following
a branch is a start of a basic block.
Similarly, the target of a branch command is a start of
basic block.
Using these three simple rules the whole method body
is slit into a few basic blocks.
It is desirable to store control-flow links between basic blocks.
These links are used for finding high-level structures.
Each basic block has \emph{successors} and \emph{predecessors}.
\emph{Successor} is an other basic block that may be executed
immediately after this one.
There can be up to two \emph{successors} -- the following basic
block and the basic block being branched to. Both of these
exist if the basic block ends with conditional branch.
% If the condition is false, the control falls though to the
% following block and if the condition is true the branch is taken.
\emph{Predecessor} is just link in the opposite direction
of the \emph{successor} link. Note that the number of
\emph{predecessors} is not limited -- several \verb|goto|s
can branch to the same location.
\subsection{Control-flow links between nodes}
The \emph{predecessor} and \emph{successor} links of basic
blocks reflect the control flow between basic blocks.
However, once basic blocks are `merged' by moving them
into a node, we might be interested in flow properties
of the whole node rather then just the individual basic blocks.
Consider two sibling nodes that represent two loops.
In other to put the nodes in correct order, we need to know
which one is \emph{successor} of the other.
Therefore the \emph{predecessor} and \emph{successor} links
apply to whole nodes as well. Consider two sibling nodes $A$ and $B$.
Node $B$ is a \emph{successor} of node $A$ if there exists
a basic block within node $A$ that branches to a basic block
within node $B$.
The algorithm to calculate \emph{successors} of a node is
as flows:
\begin{itemize}
\item Get a list of all basic blocks that are children of
node $A$. The node can contain other nested nodes so this part
needs to be performed recursively.
\item Get a list of all succeeding basic blocks of node $A$.
This is the union of all successors for all basic blocks
within the node.
\item However, we do not want succeeding basic blocks,
we want a succeeding siblings of the node. Therefore,
for each succeeding basic block traverse the data structure
up until a sibling node is reached.
\end{itemize}
The algorithm to calculate \emph{predecessors} of a node is similar.
The \emph{predecessor} and \emph{successor} links are used
extensively and it would be computationally expensive to
recalculate them every time they are needed. Therefore the
links and intermediate results of the calculation are cached.
The events of the tree structure are used to invalidate relevant
caches.
\subsection{Finding loops}
\label{Finding loops}
The loops are found by using the T1-T2 transformations which where
discussed in the preparation chapter.
There are two node types that directly correspond to the results
of these two transformations. Preforming a T1 transformation
produces a node of type \emph{Loop} and performing a T2 transformation
produces a node of type \emph{AcyclicGraph}.
The algorithm works by considering all the nodes and basic blocks
in the tree individually and evaluating the conditions required for the
transformations.
If transformation can be performed for a given node, it is immediately
preformed and the algorithm is restarted on the new transformed graph.
The condition for the T1 transformation (\emph{Loop}) is that the node
must a be self-loop. The condition for the T2 transformation
(\emph{AcyclicGraph}) is that the node must have only one predecessor.
By construction each \emph{AcyclicGraph} node contains exactly
two nodes. For example, five sequential statements would be
represented as four nested \emph{AcyclicGraph} nodes.
This makes the tree more complex and more difficult to analyse.
The acyclic property is only required for loop finding and is not
needed for anything else later on. Therefore once loops are found,
the \emph{AcyclicGraph} nodes are flattened (the children of the node
are moved to its parent and the node is removed).
Note that the created loops are trivial -- the \emph{Loop} node
represents an infinite loop without initializer, condition or
iterator. The specifics of the loop are not represented
in this tree data structure. However, they are eventually
recovered in the abstract syntax tree phase of the decompiler.
\subsection{Finding conditionals}
\label{Finding conditionals}
% Reachable sets
In order to recover conditional statements three new node types
are introduced: \emph{Condition}, \emph{Block} and \emph{IfStatement}.
\emph{Condition} represents a piece of code that is guaranteed to
branch into one of two locations. \emph{Block} has no special
semantics and merely groups several nodes into one.
\emph{IfStatement} represents the whole conditional including
the condition and two bodies.
The \emph{IfStatement} node is still a genuine node in the tree
data structure. However, subtyping is used to restrict its children.
The \emph{IfStatement} node must have precisely three children
of types \emph{Condition}, \emph{Block} and \emph{Block} representing
the condition, `true' body and `false' body respectively.
This defines the semantics of the \emph{IfStatement} node.
The algorithm starts by encapsulating all two-successors basic blocks
with \emph{Condition} node. Two-successor basic blocks are conditional
branches and thus they are conditions of \verb|if| statements.
Note that the bytecode \verb|br| branches unconditionally and thus
is it not a condition of an \verb|if| statement (basic blocks ending
with \verb|br| have only one successor).
The iterative part of the algorithm is to look for free-standing
\emph{Condition} nodes and encapsulate them with \emph{IfStatement} nodes.
This involves finding the `true' and `false' bodies for the if statements.
The `true' and `false' bodies of an \verb|if| statement are found by
considering reachability as discussed in the preparation chapter.
Nodes reachable only by following the `true' branch define the `true'
body of the \verb|if| statement. Similarly for the `false' body.
The rest of the nodes is the set reachable from both and it is not
part of the \verb|if| statement -- it is the code following the
\verb|if| statement. Note that these three set are disjoint.
The reachable nodes nodes are found by iterative algorithm in
which successors are added to a collection until the collection
does not change anymore.
Nodes that are reachable \emph{only} from the `true' branch are
obtained by taking the set reachable from `true' branch and removing
nodes reachable from the `false' branch.
Consider the special case where there are no nodes reachable from both
the `true' and `false' branches. In this case, there is no code
following the \verb|if| statement. Such code might look like this:
\begin{verbatim}
if (condition) {
return null;
} else {
return result;
}
\end{verbatim}
In this special case the `false' body is moved outside the \verb|if|
statement producing more compact code:
\begin{verbatim}
if (condtion) {
return null;
}
return result;
\end{verbatim}
\subsection{Short-circuit conditionals}
% Short-circuit
Consider condition like \verb|a & b| (\verb|a| and \verb|b|).
This or even more complex expressions compile into a code without
any branches. Therefore the expression will be contained within
a single basic block and thus the approach described so far
would work well.
Conditions like \verb|a && b| are more difficult to handle
(\verb|a| and \verb|b|, but evaluate \verb|b| only if \verb|a|
is true). These short-circuit conditionals introduce additional
branching and thus they are more difficult to decompile.
% The short-circuit conditionals will manifest themselves as
% one of four flow graphs. See figure \ref{impl_shortcircuit}
% (repeated from the preparation chapter).
% Implementation
% \begin{figure}[tbh]
% \centerline{\includegraphics{figs/shortcircuit.png}}
% \caption{\label{impl_shortcircuit}Short-circuit control-flow graphs}
% \end{figure}
The expression \verb|a| is always evaluated first.
The flow graph is characterized by two things -- in which case the
expression \verb|b| is evaluated and which path is taken if the
expression \verb|b| is not evaluated.
The \emph{Condtition} node was previously defined as a piece of
code that is guaranteed to branch into one of two locations.
This still holds for the flow graphs above. If two nodes
that evaluate \verb|a| and \verb|b| are considered together,
they still define a proper condition for an \verb|if| statement.
We define a new node type \emph{ShortCircuitCondition}
which is a special case of \emph{Condition}.
This node has exactly two children (expressions \verb|a| and
\verb|b|) and a meta-data field which specifies which one of
the four short-circuit patters it represents. This is for
convenience so that the pattern matching does not have to be
done again in the future.
The whole program is repeatedly searched and if one of the four patters
is found, the two nodes are merged into a \emph{ShortCircuitCondition} node.
This new node represents the combined expression (e.g.\ \verb|a && b|).
Since it is a single node is eligible to fit into the pattern
again and thus nested short-circuit expressions will be also
found (e.g.\ \verb/(a && b) || (c && d)/).
Note that this is an extension to the previous algorithm rather
then a new one. It needs to be preformed just before
\emph{Condition} nodes are encapsulated in \emph{IfStatement} nodes.
\section{Abstract Syntax Tree representation}
\label{Abstract Syntax Tree representation}
Abstract syntax tree (AST) is the last representation of the program.
It very closely corresponds to the final textual output.
\subsection{Data structure}
External library is used to hold the abstract syntax tree.
Initially the \emph{CodeDom} library that ships with the
\emph{.NET Framework} was used, but it was soon abandoned due
to lack of support for many crucial C\# features.
It was replaced with \emph{NRefactory} which is an open-source
library used in SharpDevelop for parsing and representation of
C\# and VB.NET files.
It focuses exclusively on these two languages and
thus provides excellent support for them.
NRefactory allows perfect control over the generated code.
Parsing arbitrary C\# file and then generating it again
will almost exactly reproduce the original text. Whitespace
would be reformated but anything else is explicitly represented
in NRefactory's abstract syntax tree.
For example, \verb|(a + b)| and \verb|a + b| have
different representations and so do \verb|if (a) return;| and
\verb|if (a) { return; }|.
The decompiler aims to generate not only correct code, but also as
readable code as possible. Therefore this precise control is desirable.
\subsection{Generating skeleton}
The first task is to generate code skeleton that is
equivalent to the one in the executable.
The \emph{Cecil} library helps with reading of the executable
meta-data and thus it is not necessary to work directly with
the binary format of the executable.
The following four object types can be found in .NET executables:
classes, structures, interfaces and enumerations.
For the purpose of logical organization these can be nested
within namespaces and, in some cases, nested within each other.
Classes can be inherited and may implement one or more interfaces.
Depending on the type, the objects can contain fields,
properties, events and methods.
Specific set of modifies can be applied to almost everything.
Handling all these cases is tedious, but usually straightforward.
\subsection{Generating method bodies}
\label{Generating method bodies}
Once the skeleton is created, high-level blocks and the bytecode
expressions contained within them are translated to
abstract syntax tree. For example, the \emph{Loop} node is translated
to a \emph{ForStatement} in the AST and the bytecode expression \verb|add|
is translated to \emph{BinaryOperatorExpression} in the AST.
In terms of lines of code, this is by far the largest part of the
decompiler. The reason for this is that there are many bytecodes
and every bytecode needs to have its own specific translation code.
Conceptually\footnote{In reality there are more functions
that are loosely coupled.},
the translation code consists of three main functions:
\emph{TranslateMethodBody}, \emph{TranslateHighLevelNode} and
\emph{TranslateBytecodeExpression}.
All of these return complete abstract syntax tree for the given input.
\emph{TranslateMethodBody} is the simples function which encapsulates
the whole algorithm. It takes tree of high-level blocks as input and
produces the complete abstract syntax tree of the method.
Internally, it calls \emph{TranslateHighLevelNode} for each top-level
node and then merely concatenates the results.
\emph{TranslateHighLevelNode} translates one particular high-level
node into abstract syntax tree. There are several types of high-level
nodes and each of them needs to considered individually.
If the high-level node has child nodes then this function is
called recursively. Some comments about particular node types:
The \emph{BasicBlock} node needs to be preceded by explicit label
and succeeded by explicit \verb|goto| in order to ensure safety.
The \emph{Loop} node is easy to handle since it is just an infinite loop.
The \emph{IfStatement} node is most difficult to handle since it
needs to handle creation of the condition for the \verb|if| statement.
Remember, that the condition can include short-circuit booleans and
that these can be arbitrarily nested.
\emph{TranslateBytecodeExpression} is the lowest level function.
It translates the individual bytecode expressions.
Note that bytecode expression may have other bytecode expressions
as arguments and therefore this function is often called recursively.
Getting abstract syntax tree for all arguments is in fact the
first thing that is done. After that these arguments are combined in
a way that is appropriate for the given bytecode and operand.
This logic is completely different for each bytecode.
Simple example is the \verb|add| bytecode --
in this case the arguments end up being children of
\emph{BinaryOperatorExpression} (which is an AST node).
At this point we have the abstract syntax tree representing the
bytecode expression.
The expression is further encapsulated by parenthesis for safety
and the type of the expression is recoded in the AST meta-data.
Unsupported bytecodes are handled gracefully and are represented as
function calls. For example if the \verb|add| bytecode would
not be implemented, it would be outputed as \verb|IL__add(1, 2)|
rather then \verb|1 + 2|.
Recoding the type of AST is useful so that it can be converted
depending on a context. If AST of type \verb|IntegerOne| is
used in context where \verb|bool| is expected, the AST can
be simply replaced by \verb|true|.
Here is a list of bytecodes whose translation into abstract syntax
tree has been implemented
(star is used to represent several bytecodes with similar names):
Arithmetic: \texttt{add* div* mul* rem* sub* and xor shl shr* neg not} \\
Arrays: \texttt{newarr ldlen ldelem.* stelem.*} \\
Branching: \texttt{br brfalse brtrue beq bge* bgt* ble* blt* bne.un} \\
Comparison: \texttt{ceq cgt* clt*} \\
Conversions: \texttt{conv.*.*} \\
Object-oriented: \texttt{newobj castclass call callvirt} \\
Constants: \texttt{ldnull ldc.* ldstr ldtoken} \\
Data access: \texttt{ldarg ldfld stfld ldsfld stsfld ldloc stloc} \\
Miscellaneous: \texttt{nop dup ret}
The complexity of implementing a translation for a bytecode varies
between a single line (\verb|nop|) to over a page of code (\verb|callvirt|).
\subsection{Optimizations}
The produced abstract syntax tree represents a fully functional program.
That is, it can be pretty printed and the generated source code would
compile and work without problems.
However, the produced source code is still quite verbose and
therefore several optimizations are done to simplify it or make it `nicer'.
The optimizations are implemented by using the visitor pattern
provided by the NRefactory library. For each optimization
a new class is created which inherits from the \emph{AstVisitor}
class. Instance of this class is created and it is applied to
the root of the abstract syntax tree.
As a result of this, NRefactory will traverse the whole abstract
syntax tree and notify the visitor (the optimization) about
every element it encounters. For example, whenever a \verb|goto|
statement is encountered, the method \verb|VisitGotoStatement|
will be invoked on the visitor. If the optimization wants to
act upon this, it has to override the \verb|VisitGotoStatement| method.
The abstract syntax tree can be modified during the traversal.
For example, when the \verb|goto| statement is encountered,
the visitor (the optimization) can modify it, replace it or simply
remove it.
\subsubsection{Removal of dead labels}
\label{Removal of dead labels}
This is not the first optimization to be executed, but it is the
simplest one and therefore it is explained first.
The purpose of this optimization is to remove dead labels.
That is, to remove all labels that are not referenced by one
or more \verb|goto| statements.
The optimization uses two visitors. The first visitor overrides
the \verb|VisitGotoStatement| method and is thus notified about
all \verb|goto| statements in the method. The body of the overridden
\verb|VisitGotoStatement| method records the name of the label
being targeted and puts in into an `alive' list.
The second visitor overrides the \verb|VisitLabelStatement|
method and is thus notified about all labels. If the label
is not in the `alive' list, it is removed.
Note that all optimizations are done on per-method basis
and therefore we do not have to worry about name clashes
of labels from different methods.
\subsubsection{Removal of negations}
\label{Removal of negations}
In some cases negations can be removed by using the rules of
logic. Negations are represented as \emph{UnaryOperatorExpression}
in the abstract syntax tree.
When a negation is encountered, it is matched against the following
patterns and simplified if possible.
\verb|!((x)) = !(x)| \\
\verb|!!(x) = (x)| \\
\verb|!(a > b) = (a <= b)| \\
\verb|!(a >= b) = (a < b)| \\
\verb|!(a < b) = (a >= b)| \\
\verb|!(a <= b) = (a > b)| \\
\verb|!(a == b) = (a != b)| \\
\verb|!(a != b) = (a == b)| \\
\verb:!(a & b) = (!(a) | !(b)): \\
\verb:!(a && b) = (!(a) || !(b)): \\
\verb:!(a | b) = (!(a) & !(b)): \\
\verb:!(a || b) = (!(a) && !(b)): \\
\subsubsection{Removal of parenthesis}
\label{Removal of parenthesis}
% C precedence and associativity
To ensures safety, the abstract syntax tree will contain extensive
amount of parenthesis. Even simple increment would be expressed
as \verb|((i) = ((i) + (1)))|.
This optimization removes parenthesis where it is known to be safe.
Parenthesis can be removed from primitive values (\verb|(1)|),
identifiers (\verb|(i)|) and already parenthesized expressions
(\verb|((...))|).
Several parenthesis can also be removed due to the unambiguous
context they are in. This includes cases like
\verb|return (...);|, \verb|array[(...)]|, \verb|if((...))|,
\verb|(...);| and several others.
The rest of the cases is governed by the C\# precedence
and associativity rules. There are 15 precedence groups in
C\# and expressions within the same group are left-associative%
\footnote{Only assignment and conditional operator (?:)
are right associative, but these are never generated in nested
form by the decompiler.}.
The optimization implements a function \verb|GetPrecedence| that
returns the precedence for the given expression as an integer.
Majority of expressions are binary expressions. When a binary
expression is encountered, precedence is calculated for it
and both of its operands.
For example consider \verb|(a + b) - (c * d)|.
The precedence of the main binary expression is 12 and the
precedences of the operands are 12 and 13 respectively.
Since the right operand has higher precedence then the binary
expression itself, its parenthesis can be removed.
The left operand has the same precedence. However, in this special
case (left operand, same precedence), parenthesis can be removed
as well due to left-associativity of C\#.
Similar, but simpler, logic applies to unary expressions.
\subsubsection{Removal of `goto' statements}
\label{Removal of `goto' statements}
% Goto X; labe X: ;
% Replacing with break/continue
This is the most complex optimization performed on the
abstract syntax tree. Its purpose is the remove or replace
\verb|goto| statements.
In the following case it is obvious that the \verb|goto| can be removed:
\begin{verbatim}
// Code
goto Label;
Label:
// Code
\end{verbatim}
However, in general, more complex scenarios can occur.
It is not immediately obvious that the \verb|goto| statement
can be removed in the following code:
\begin{verbatim}
if (condition1) {
if (condition2) {
// Code
goto Label;
} else {
// Code
}
}
for(;;) {
for(;;) {
Label:
// Code
}
}
\end{verbatim}
This optimization is based on simulation of execution under
various conditions.
There are three important questions that the simulation answers: \\
What would happen if the \verb|goto| was replaced by no-op (i.e.\ removed)? \\
What would happen if the \verb|goto| was replaced by \verb|break|? \\
What would happen if the \verb|goto| was replaced by \verb|continue|?
In the first case, the \verb|goto| is replaced by
no-op and the simulator is started at that location.
The simulation will traverse the AST tree structure until
it finds the next executable line of code or an label.
If the reached location is the same as the one that would be reached
by following the original \verb|goto| then the original \verb|goto|
can be removed.
Similar simulation runs are performed by replacing the
\verb|goto| by \verb|break| and \verb|continue|.
After this optimization is done, it is desirable to remove the dead labels.
\subsubsection{Simplifying loops}
\label{Simplifying loops}
The \verb|for| loop statement has the following form in C\#:
\begin{verbatim}
for(initializer; condition; iterator) { body };
\end{verbatim}
The semantics of the \verb|for| loop statement is:
\begin{verbatim}
initializer;
for(;;) {
if (!condition) break;
body;
iterator;
}
\end{verbatim}
So far the decompiler generated only infinite loops in form
\verb|for(;;)|. Now it is time to make the loops more interesting.
This optimization does pattern matching on the code and tries
to identify the initializer, condition or iterator.
If the loop is preceded by something like \verb|int i = 0;| then it is most
likely the initializer of the loop and it will be pushed into
the \verb|for| statement producing \verb|for(int i = 0;;)|.
If it is not really the initializer, it does not matter --
semantics is same.
If the first line of the body is \verb|if (not_condition) break;|,
it is assumed to be the condition for the loop and it is pushed
in to produce \verb|for(;!not_condition;)|.
If the last line of the body is \verb|i = ...| or \verb|i++|,
it is assumed to be the iterator.
Consider the following code:
\begin{verbatim}
int i = 0;
for(;;) {
if (i >= 10) break;
// body
i++;
}
\end{verbatim}
Performing this optimization will simplify the code to:
\begin{verbatim}
for(int i = 0; !(i >= 10); i++) {
// body
}
\end{verbatim}
This is then further simplified by the `removal of negation'
optimization to:
\begin{verbatim}
for(int i = 0; i < 10; i++) {
// body
}
\end{verbatim}
\subsubsection{Further optimizations}
\label{Further optimizations}
Several further optimizations are done which are briefly
discussed here.
% Rename variables -- i, j,k ...
Integer local variables are renamed from the more generic names
like \verb|V_0| to \verb|i|, \verb|j|, \verb|k|, and so on.
Boolean local variables are renamed to \verb|flag1|, \verb|flag2|
and so on.
% Type names
C\# aliases for types are used. For example, \verb|System.Int32|
is replaced with \verb|int|. C\# \verb|using| (\verb|import| in
java) is used to simplify \verb|System.Console| to just \verb|Console|.
Types in scope of same namespace also do not have to be
fully qualified.
% Remove `this' reference
Explicit \verb|this.field| is simplified to just \verb|field|.
% Idioms -- i++
The increments in form \verb|i = i + 1| are simplified to \verb|i++|.
The increments in form \verb|i = i + 2| are simplified to \verb|i += 2|.
Analogous forms hold for decrements.
% Idioms -- string concat, i++
String concatenation \verb|String.Concat("Hello", "world")|
can be expressed in C\# just as \verb|"Hello" + "world"|.
Sometimes `false' body of an \verb|if| statement can end up being
empty and can be harmlessly removed.
Some control-flow commands are sometimes redundant since the
implicit behavior is equivalent. For example, \verb|return;|
at the end of method is redundant and \verb|continue;| at
the end of loop is redundant.
\subsection{Pretty Printing}
Pretty printing%
\footnote{Pretty printing is the act of converting abstract
syntax tree into a textual form.}
is provided as part of the NRefactory library and therefore does not
need to be implemented as part of the decompiler.
The NRefactory provides provides several options that control
the format of the produced source code (for example whether an opening
curly brace is placed on the same line as \verb|if| statement or on
the following line).
NRefactory is capable to output the code both in C\# and in VB.NET.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% 10 Pages
\cleardoublepage
\chapter{Evaluation}
\newcommand{\impref}[1]{See section \emph{\ref{#1} #1} on page \pageref{#1} for reference.}
\section{Success Criteria}
The principal criterion was that the quick-sort algorithm must successfully
decompile. This goal was achieved.
The decompiled source code is, in fact, almost identical to the original.
This is demonstrated at the end of the following section.
\section{Evolution of quick-sort}
This section demonstrates how the quick-sort algorithm is decompiled.
Four different representations are used during the decompilation process.
The output of each of these representations is shown.
Unless otherwise specified the shown pieces of code are
verbatim outputs of the decompiler.
\subsection{Original source code}
To save space we will initially consider only the \verb|Main| method of
the quick-sort algorithm. It parses textual arguments,
performs the quick-sort and then prints the result.
\begin{verbatim}
public static void Main(string[] args)
{
int[] intArray = new int[args.Length];
for (int i = 0; i < intArray.Length; i++) {
intArray[i] = int.Parse(args[i]);
}
QuickSort(intArray, 0, intArray.Length - 1);
for (int i = 0; i < intArray.Length; i++) {
System.Console.Write(intArray[i].ToString() + " ");
}
}
\end{verbatim}
This code is compiled with all optimizations turn on and its
binary form is considered is the following sections.
\subsection{Stack-based bytecode representation}
\impref{Stack-based bytecode representation}
This is a snippet of the bytecode as loaded from the executable.
This part represents the line \verb|intArray[i] = int.Parse(args[i]);|.
\verb|V_0| is \verb|intArray| and \verb|V_1| is \verb|i|.
Comments denote the stack behaviour of the bytecodes.
The (manually) stared bytecodes represent just the part
\verb|int.Parse(args[i]);|.
\begin{verbatim}
IL_0F: ldloc V_0 # Pop0->Push1
IL_10: ldloc V_1 # Pop0->Push1
IL_11: ldarg args * # Pop0->Push1
IL_12: ldloc V_1 * # Pop0->Push1
IL_13: ldelem.ref * # Popref_popi->Pushref
IL_14: call Parse() * # Varpop->Varpush Flow=Call
IL_19: stelem.i4 # Popref_popi_popi->Push0
\end{verbatim}
The used bytecodes are \verb|ldloc| (load local variable),
\verb|ldarg| (load argument), \verb|ldelem| (get array element),
\verb|call| (call method) and \verb|stelem| (set array element).
Here is the same snippet again after the stack analysis
has been performed.
\begin{verbatim}
// Stack: {}
IL_0F: ldloc V_0 # Pop0->Push1
// Stack: {IL_0F}
IL_10: ldloc V_1 # Pop0->Push1
// Stack: {IL_0F, IL_10}
IL_11: ldarg args # Pop0->Push1
// Stack: {IL_0F, IL_10, IL_11}
IL_12: ldloc V_1 # Pop0->Push1
// Stack: {IL_0F, IL_10, IL_11, IL_12}
IL_13: ldelem.ref # Popref_popi->Pushref
// Stack: {IL_0F, IL_10, IL_13}
IL_14: call Parse() # Varpop->Varpush Flow=Call
// Stack: {IL_0F, IL_10, IL_14}
IL_19: stelem.i4 # Popref_popi_popi->Push0
// Stack: {}
\end{verbatim}
For example, we can see that bytecode \verb|IL_13: ldelem.ref| pops
the last two elements from the stack -- \verb|{IL_11, IL_12}|.
These will hold the values \verb|args| and \verb|V_1| respectively.
The bytecode \verb|IL_19: stelem.i4| pops three values.
\subsection{Variable-based bytecode representation}
\impref{Variable-based bytecode representation}
This is the same snippet after the stack-based data-flow
is replaced with variable-based one:
(code manually simplified for clarity)
\begin{verbatim}
expr0D = ldloc(V_0);
expr0E = ldloc(i);
expr0F = ldarg(args);
expr10 = ldloc(i);
expr11 = ldelem.ref(expr0F, expr10);
expr12 = call(Parse(), expr11);
stelem.i4(expr0D, expr0E, expr12);
\end{verbatim}
Here is the actual unedited output:
\begin{verbatim}
stloc(expr0D, ldloc(V_0));
stloc(expr0E, ldloc(i));
stloc(expr0F, ldarg(args));
stloc(expr10, ldloc(i));
stloc(expr11, ldelem.ref(ldloc(expr0F), ldloc(expr10)));
stloc(expr12, call(Parse(), ldloc(expr11)));
stelem.i4(ldloc(expr0D), ldloc(expr0E), ldloc(expr12));
\end{verbatim}
\subsubsection{In-lining of expressions}
\impref{In-lining of expressions}
Here is the same snippet after two in-lining optimizations.
(this is actual output -- the GUI of the decompiler allows performing
of optimizations step-by-step)
\begin{verbatim}
stloc(expr0D, ldloc(V_0));
stloc(expr0E, ldloc(i));
stloc(expr11, ldelem.ref(ldarg(args), ldloc(i)));
stloc(expr12, call(Parse(), ldloc(expr11)));
stelem.i4(ldloc(expr0D), ldloc(expr0E), ldloc(expr12));
\end{verbatim}
All posible in-lines are performed:
\begin{verbatim}
stelem.i4(ldloc(V_0), ldloc(i),
call(Parse(), ldelem.ref(ldarg(args), ldloc(i))));
\end{verbatim}
\subsection{Abstract Syntax Tree representation (part 1)}
\impref{Abstract Syntax Tree representation}
Let us omit the optimization of high-level structures
for this moment and skip directly to the Abstract Syntax Tree.
After transforming a snippet of code to variable-based bytecode
and then performing the in-lining optimization we have obtained
\begin{verbatim}
stelem.i4(ldloc(V_0), ldloc(i),
call(Parse(), ldelem.ref(ldarg(args), ldloc(i))));
\end{verbatim}
Let us see the same result for larger part of the method:
\begin{verbatim}
BasicBlock_1: stloc(V_0, newarr(System.Int32,
conv.i4(ldlen(ldarg(args)))));
BasicBlock_2: stloc(i, ldc.i4(0));
BasicBlock_3: goto BasicBlock_6;
BasicBlock_4: stelem.i4(ldloc(V_0), ldloc(i), call(Parse(),
ldelem.ref(ldarg(args), ldloc(i))));
BasicBlock_5: stloc(i, @add(ldloc(i), ldc.i4(1)));
BasicBlock_6: if (ldloc(i) < conv.i4(ldlen(ldloc(V_0))))
goto BasicBlock_4;
BasicBlock_7: call(QuickSort(), ldloc(V_0), ldc.i4(0),
sub(conv.i4(ldlen(ldloc(V_0))), ldc.i4(1)));
\end{verbatim}
This code includes the first \verb|for| loop of the \verb|Main|
method. Note that \verb|br| and \verb|blt| were already
translated to \verb|goto| and \verb|if (...) goto|.
\subsubsection{Translation of bytecodes to C\# expressions}
\impref{Generating method bodies}
Let us translate just the bytecodes \verb|ldloc|, \verb|ldarg|
and \verb|ldc.i4|.
\begin{verbatim}
BasicBlock_1: stloc(V_0, newarr(System.Int32,
conv.i4(ldlen((args)))));
BasicBlock_2: stloc(i, (0));
BasicBlock_3: goto BasicBlock_6;
BasicBlock_4: stelem.i4((V_0), (i), call(Parse(),
ldelem.ref((args), (i))));
BasicBlock_5: stloc(i, @add((i), (1)));
BasicBlock_6: if ((i) < conv.i4(ldlen((V_0)))) goto BasicBlock_4;
BasicBlock_7: call(QuickSort(), (V_0), (0),
sub(conv.i4(ldlen((V_0))), (1)));
\end{verbatim}
This is a small improvement. Note the safety
parenthesis that were inserted during the translation
(e.g.\ \verb|(0)|, \verb|(args)|).
After all translations are performed the code will look like this:
\begin{verbatim}
BasicBlock_1: System.Int32[] V_0 =
(new int[((int)((args).Length))]);
BasicBlock_2: int i = (0);
BasicBlock_3: goto BasicBlock_6;
BasicBlock_4: ((V_0)[(i)] = (System.Int32.Parse(((args)[(i)]))));
BasicBlock_5: (i = ((i) + (1)));
BasicBlock_6: if ((i) < ((int)((V_0).Length))) goto BasicBlock_4;
BasicBlock_7: (QuickSortProgram.QuickSort((V_0), (0),
(((int)((V_0).Length)) - (1))));
\end{verbatim}
\subsubsection{Removal of parenthesis}
\impref{Removal of parenthesis}
C\# rules of precedence are used to remove unnecessary parenthesis:
\begin{verbatim}
BasicBlock_1: System.Int32[] V_0 = new int[(int)args.Length];
BasicBlock_2: int i = 0;
BasicBlock_3: goto BasicBlock_6;
BasicBlock_4: V_0[i] = System.Int32.Parse(args[i]);
BasicBlock_5: i = i + 1;
BasicBlock_6: if (i < (int)V_0.Length) goto BasicBlock_4;
BasicBlock_7: QuickSortProgram.QuickSort(V_0, 0, (int)V_0.Length - 1);
\end{verbatim}
\subsubsection{Removal of dead labels}
\impref{Removal of dead labels}
Unreferenced labels are removed.
\begin{verbatim}
System.Int32[] V_0 = new int[(int)args.Length];
int i = 0;
goto BasicBlock_6;
BasicBlock_4: V_0[i] = System.Int32.Parse(args[i]);
i = i + 1;
BasicBlock_6: if (i < (int)V_0.Length) goto BasicBlock_4;
QuickSortProgram.QuickSort(V_0, 0, (int)V_0.Length - 1);
\end{verbatim}
\subsubsection{Further optimizations}
\impref{Further optimizations}
The remainder of applicable AST optimizations is performed.
\begin{verbatim}
int[] V_0 = new int[args.Length];
int i = 0;
goto BasicBlock_6;
BasicBlock_4: V_0[i] = Int32.Parse(args[i]);
i++;
BasicBlock_6: if (i < V_0.Length) goto BasicBlock_4;
QuickSortProgram.QuickSort(V_0, 0, V_0.Length - 1);
\end{verbatim}
\subsection{High-level blocks representation}
\impref{High-level blocks representation}
\subsubsection{Initial state}
This representation is based on tree structure. However,
initially the tree is flat with all basic blocks being siblings.
These are the first four basic blocks in the \verb|Main| method.
The relations between the nodes are noted in the brackets
(e.g.\ basic block 1 has one successor which is basic block number 2).
See figure \ref{QuickSortMain} for graphical representation of
these four basic blocks.
\begin{verbatim}
// BasicBlock 1 (Successors:2 Parent:0)
BasicBlock_1:
int[] V_0 = new int[args.Length];
int i = 0;
goto BasicBlock_2;
// BasicBlock 2 (Predecessors:1 Successors:4 Parent:0)
BasicBlock_2:
goto BasicBlock_4;
\end{verbatim}
\begin{verbatim}
// BasicBlock 3 (Predecessors:4 Successors:4 Parent:0)
BasicBlock_3:
V_0[i] = Int32.Parse(args[i]);
i++;
goto BasicBlock_4;
// BasicBlock 4 (Predecessors:3,2 Successors:5,3 Parent:0)
BasicBlock_4:
if (i < V_0.Length) goto BasicBlock_3;
goto BasicBlock_5;
\end{verbatim}
\begin{figure}[tbh]
\centerline{\includegraphics{figs/QuickSortMain.png}}
\caption{\label{QuickSortMain}Loop in the `Main' method of quick-sort}
\end{figure}
\subsubsection{Finding loops and conditionals}
\impref{Finding loops}
Loops are found by repeatedly applying the T1-T2 transformations.
In this case the first transformation to be performed will be
a T2 transformation on basic blocks 1 and 2.
This will result in this `tree':
\begin{verbatim}
// AcyclicGraph 10 (Successors:4 Parent:0)
AcyclicGraph_10:
{
// BasicBlock 1 (Successors:2 Parent:10)
BasicBlock_1:
int[] V_0 = new int[args.Length];
int i = 0;
goto BasicBlock_2;
// BasicBlock 2 (Predecessors:1 Parent:10)
BasicBlock_2:
goto BasicBlock_4;
}
// BasicBlock 3 (Predecessors:4 Successors:4 Parent:0)
BasicBlock_3:
V_0[i] = Int32.Parse(args[i]);
i++;
goto BasicBlock_4;
// BasicBlock 4 (Predecessors:3,10 Successors:5,3 Parent:0)
BasicBlock_4:
if (i < V_0.Length) goto BasicBlock_3;
goto BasicBlock_5;
\end{verbatim}
Note how the links changed. Basic blocks 1 and 2 are now considered as a
single node 10 -- the predecessor of basic block 4 is now 10 rather then 2.
Links within the node 10 are limited just to the scope of that node.
Several further transformations are performed until all loops are found.
Conditional statements are identified as well.
This code has one conditional statement which
does not have any code in the bodies other then the \verb|goto|s.
\newpage
\begin{verbatim}
// BasicBlock 1 (Successors:2 Parent:0)
BasicBlock_1:
int[] V_0 = new int[args.Length];
int i = 0;
goto BasicBlock_2;
// BasicBlock 2 (Predecessors:1 Successors:11 Parent:0)
BasicBlock_2:
goto BasicBlock_4;
// Loop 11 (Predecessors:2 Successors:5 Parent:0)
Loop_11:
for (;;) {
// ConditionalNode 22 (Predecessors:3 Successors:3 Parent:11)
ConditionalNode_22:
BasicBlock_4:
if (i >= V_0.Length) {
goto BasicBlock_5;
// Block 21 (Parent:22)
Block_21:
} else {
goto BasicBlock_3;
// Block 20 (Parent:22)
Block_20:
}
// BasicBlock 3 (Predecessors:22 Successors:22 Parent:11)
BasicBlock_3:
V_0[i] = Int32.Parse(args[i]);
i++;
goto BasicBlock_4;
}
\end{verbatim}
\subsubsection{Final version}
Let us see the code once again without the comments and empty lines.
\begin{verbatim}
BasicBlock_1:
int[] V_0 = new int[args.Length];
int i = 0;
goto BasicBlock_2;
BasicBlock_2:
goto BasicBlock_4;
Loop_11:
for (;;) {
ConditionalNode_22:
BasicBlock_4:
if (i >= V_0.Length) {
goto BasicBlock_5;
Block_21:
} else {
goto BasicBlock_3;
Block_20:
}
BasicBlock_3:
V_0[i] = Int32.Parse(args[i]);
i++;
goto BasicBlock_4;
}
\end{verbatim}
\newpage
\subsection{Abstract Syntax Tree representation (part 2)}
\subsubsection{Removal of `goto' statements}
\impref{Removal of `goto' statements}
All of the \verb|goto| statements in the code above can be
removed or replaced. After the removal of \verb|goto| statements
we get the following much simpler code:
\begin{verbatim}
int[] V_0 = new int[args.Length];
int i = 0;
for (;;) {
if (i >= V_0.Length) {
break;
}
V_0[i] = Int32.Parse(args[i]);
i++;
}
\end{verbatim}
Note that the `false' body of the \verb|if| statement was removed
because it was empty.
\subsubsection{Simplifying loops}
\impref{Simplifying loops}
The code can be further simplified by pushing the loop initializer,
condition and iterator inside the \verb|for(;;)|:
\begin{verbatim}
int[] V_0 = new int[args.Length];
for (int i = 0; i < V_0.Length; i++) {
V_0[i] = Int32.Parse(args[i]);
}
\end{verbatim}
This concludes the decompilation of quick-sort.
\clearpage
{\vspace*{60mm} \centering This page is intentionally left blank\\ Please turn over \newpage}
\subsection{Original quick-sort (complete source code)}
\label{Original quick-sort}
\lstinputlisting[basicstyle=\footnotesize]
{./figs/QuickSort_original.cs}
\newpage
\subsection{Decompiled quick-sort (complete source code)}
\label{Decompiled quick-sort}
\lstinputlisting[basicstyle=\footnotesize]
{./figs/10_Short_type_names_2.cs}
\section{Advanced and unsupported features}
This section demonstrates decompilation of more advanced application.
The examples that follow were produced by decompiling a reversi game%
\footnote{Taken from the CodeProject. Written by Mike Hall.}.
The examples demonstrate advanced features of the decompiler
and in some cases its limitations (i.e.\ still unimplemented features).
\subsection{Properties and fields}
Properties and fields are well-supported.
\emph{Original:}
\begin{verbatim}
private int[,] squares;
private bool[,] safeDiscs;
public int BlackCount {
get { return blackCount; }
}
public int WhiteCount {
get { return whiteCount; }
}
\end{verbatim}
\emph{Decompiled:}
\begin{verbatim}
private System.Int32[,] squares;
private System.Boolean[,] safeDiscs;
public int BlackCount {
get { return blackCount; }
}
public int WhiteCount {
get { return whiteCount; }
}
\end{verbatim}
The decompiled code is correct.
The type \verb|System.Int32[,]| was not simplified to \verb|int[,]|.
\subsection{Short-circuit boolean expressions}
\emph{Original:}
\begin{verbatim}
if (r < 0 || r > 7 || c < 0 || c > 7 ||
(r - dr == row && c - dc == col) ||
this.squares[r, c] != color)
return false;
\end{verbatim}
\emph{Decompiled:}
\begin{verbatim}
if (i < 0 || i > 7 || j < 0 || j > 7 ||
i - dr == row && j - dc == col ||
squares[i, j] != color) {
return false;
}
\end{verbatim}
The decompiled code is correct. Variable names were lost during compilation.
By precedence rules it was possible to remove the parenthesis.
\subsection{Short-circuit boolean expressions 2}
\emph{Original:}
\begin{verbatim}
if ((hasSpaceSide1 && hasSpaceSide2 ) ||
(hasSpaceSide1 && hasUnsafeSide2) ||
(hasUnsafeSide1 && hasSpaceSide2 ))
return true;
\end{verbatim}
\emph{Decompiled:}
\begin{verbatim}
if (flag && flag2 || flag && flag4 || flag3 && flag2) {
return true;
}
\end{verbatim}
The decompiled code is correct. Variable names were lost during compilation.
\subsection{Complex control nesting}
\emph{Original:}
\begin{verbatim}
bool statusChanged = true;
while (statusChanged)
{
statusChanged = false;
for (i = 0; i < 8; i++)
for (j = 0; j < 8; j++)
if (this.squares[i, j] != Board.Empty &&
!this.safeDiscs[i, j] &&
!this.IsOutflankable(i, j)) {
this.safeDiscs[i, j] = true;
statusChanged = true;
}
}
\end{verbatim}
\emph{Decompiled:}
\begin{verbatim}
for (bool flag = true; flag;) {
flag = false;
for (int i = 0; i < 8; i++) {
for (int j = 0; j < 8; j++) {
if (squares[i, j] != Empty &&
!safeDiscs[i, j] &&
!IsOutflankable(i, j)) {
safeDiscs[i, j] = true;
flag = true;
}
}
}
}
\end{verbatim}
The decompiled code is correct.
A \verb|for| loop was used instead of \verb|while| loop.
The decompiler always uses \verb|for| loop.
In the example, four control structures are nested;
even more levels do not cause any problems.
\subsection{Multidimensional arrays}
Multidimensional arrays are supported.
\emph{Original:}
\begin{verbatim}
this.squares = new int[8, 8];
this.safeDiscs = new bool[8, 8];
// Clear the board and map.
int i, j;
for (i = 0; i < 8; i++)
for (j = 0; j < 8; j++) {
this.squares[i, j] = Board.Empty;
this.safeDiscs[i, j] = false;
}
\end{verbatim}
\emph{Decompiled:}
\begin{verbatim}
squares = new int[8, 8];
safeDiscs = new bool[8, 8];
for (int i = 0; i < 8; i++) {
for (int j = 0; j < 8; j++) {
squares[i, j] = Empty;
safeDiscs[i, j] = false;
}
}
\end{verbatim}
The decompiled code is correct and even slightly simplified.
\subsection{Dispose method}
This is the standard .NET dispose method for a form.
\emph{Original:}
\begin{verbatim}
if(disposing) {
if (components != null) {
components.Dispose();
}
}
\end{verbatim}
\emph{Decompiled:}
\begin{verbatim}
if (disposing && components) {
components.Dispose();
}
\end{verbatim}
The decompiler did a good job identify that the two
\verb|if|s can be expressed as short-circuit expression.
On the hand, it failed to distinguish being `true' and being `non-null'
-- on the bytecode level these are the same thing (together with being `non-zero').
\subsection{Event handlers}
Event handlers are not supported.
\emph{Original:}
\begin{verbatim}
this.aboutMenuItem.Click +=
new System.EventHandler(this.aboutMenuItem_Click);
\end{verbatim}
\emph{Decompiled:}
\begin{verbatim}
aboutMenuItem.add_Click(
new System.EventHandler(this, IL__ldftn(aboutMenuItem_Click()))
);
\end{verbatim}
Note how the unsupported bytecode \verb|ldftn| is presented.
\subsection{Constructors}
\emph{Original:}
\begin{verbatim}
this.infoPanel.Location = new System.Drawing.Point(296, 32);
\end{verbatim}
\emph{Decompiled:}
\begin{verbatim}
infoPanel.Location = new System.Drawing.Point(296, 32);
\end{verbatim}
The decompiled code is correct including the constructor arguments.
\subsection{Property access}
\emph{Original:}
\begin{verbatim}
this.infoPanel.Name = "infoPanel";
\end{verbatim}
\emph{Decompiled:}
\begin{verbatim}
infoPanel.Name = "infoPanel";
\end{verbatim}
The decompiled code is correct.
Note that properties are actually set methods. This is,
\verb|set_Name(string)| here.
\subsection{Object casting}
\emph{Original:}
\begin{verbatim}
this.Icon = ((System.Drawing.Icon)(resources.GetObject("$this.Icon")));
\end{verbatim}
\emph{Decompiled:}
\begin{verbatim}
Icon = (System.Drawing.Icon)V_0.GetObject("$this.Icon");
\end{verbatim}
The decompiled code is correct.
\subsection{Boolean values}
In the .NET Runtime there is no difference between
integers and booleans. Handling of this is delicate.
\emph{Original:}
\begin{verbatim}
return false;
\end{verbatim}
\emph{Decompiled:}
\begin{verbatim}
return false;
\end{verbatim}
The decompiled code is correct.
\emph{Original:}
\begin{verbatim}
this.infoPanel.ResumeLayout(false);
\end{verbatim}
\emph{Decompiled:}
\begin{verbatim}
infoPanel.ResumeLayout(0);
\end{verbatim}
This code is incorrect -- the case where boolean is
passed as argument still has no been considered in the decompiler.
This is just case-by-case laborious work.
\subsection{The `dup' instruction}
The \verb|dup| instruction is handled well.
\emph{Decompiled: (`dup' unsupported)}
\begin{verbatim}
expr175 = this;
expr175.whiteCount = expr175.whiteCount + 1;
\end{verbatim}
\emph{Decompiled: (`dup' supported)}
\begin{verbatim}
whiteCount++;
\end{verbatim}
% coverage for a typical binaries (I like that one :-) )
% Anonymous methods, Enumerators, LINQ
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% 2 Page
\cleardoublepage
\chapter{Conclusion}
The project was a success.
The goal of the project was to decompile an implementation
of a quick-sort algorithm and this goal was achieved well.
The C\# source code produced by the decompiler is
virtually identical to the original source code%
\footnote{See pages \pageref{Original quick-sort} and \pageref{Decompiled quick-sort} in the evaluation chapter.}.
In particular, the source code compiles back to identical .NET CLR code.
More complex programs than the quick-sort implementation
were considered and the decompiler was improved to handle them better.
The improvements include:
\begin{itemize}
\item Better support for several bytecodes related to:
creation and casting of objects, multidimensional arrays,
virtual method invocation, property access and field access.
\item Support for the \verb|dup| bytecode which does not
occur very frequently in programs. In particular, it does not
occur in the quick-sort implementation.
\item Arbitrarily complicated short-circuit boolean expression
are handled. Furthermore, both the short-circuit
and the traditional boolean expressions are simplified
using rules of logic (e.g.\ using De Morgan's laws).
\item Nested high-level control structures are handled
well (e.g.\ nested loops).
\item The performance is improved by using basic blocks.
\item The removal of \verb|goto|s was improved to work
in almost any scenario (by using the `simulator' approach).
\end{itemize}
\section{Hindsight}
I am very satisfied with the current implementation.
Whenever I have seen an opportunity for improvement,
I have have refactored the code right away.
As a result of this, I think that the algorithms and data structures
currently used are very well suited for the purpose.
If I were to reimplement the project, I would merely
save time by making the right design decisions right away.
Here are some issues that I have encountered are resolved:
\begin{itemize}
\item During the preparation phase, I have spend reasonable
amount of time researching algorithms for finding of loops.
I found most of the algorithms either unintuitive or not
not sufficiently robust for all scenarios.
Therefore, I have derived my own algorithm based on the
T1-T2 transformations.
\item I have initially used the \emph{CodeDom} library
for the abstract syntax tree.
In the end, \emph{NRefactory} proved much more suitable.
% \item I have initially directly used the objects provided by
% \emph{Cecil} library as the first code representation.
% Making an own copy turned out to be better since the objects
% could be more easily manipulated and annotated.
\item I have eventually made the stack-based and
variable-based code representations quite distinct and
decoupled. This allowed greater freedom for transformations.
\item Initially the links between nodes were manually kept
in synchronization during transformations.
It is more convenient and robust just to throw the links away
and recalculate them.
\item The removal of \verb|goto|s was gradually pushed
all the way to the last stage of decompilation where it is
most powerful and safest.
\end{itemize}
\section{Future work}
The theoretical problems have been resolved.
However, a lot of laborious work still needs to be done before
the decompiler will be able to handle all .NET executables.
The decompiler could be extended to recover C\# compile-time
features like anonymous methods, enumerators or LINQ expressions.
The decompiler greatly overlaps with a verification of .NET executables
(the code needs to be valid in order to be decompiled).
The project could be forked to create .NET verifier.
I am the primary developer of the SharpDevelop debugger.
I would like integrate the decompiler with the debugger so
that is it possible to debug applications with no source code.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\cleardoublepage
\chapter{Project Proposal}
\input{proposal}
\end{document}