mirror of https://github.com/icsharpcode/ILSpy.git
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
356 lines
15 KiB
356 lines
15 KiB
%\thispagestyle{empty} |
|
|
|
%\rightline{\large\emph{David Srbeck\'y}} |
|
%\medskip |
|
%\rightline{\large\emph{Jesus College}} |
|
%\medskip |
|
%\rightline{\large\emph{ds417}} |
|
|
|
\vfil |
|
\vspace{0.4in} |
|
\centerline{\large Part II of the Computer Science Project Proposal} |
|
\vspace{0.4in} |
|
\centerline{\Large\bf .NET Decompiler} |
|
\vspace{0.3in} |
|
\centerline{\large\emph{October~14,~2007}} |
|
|
|
\vfil |
|
|
|
{\bf Project Originator:} \emph{David Srbeck\'y} |
|
|
|
\vspace{0.1in} |
|
|
|
{\bf Resources Required:} See attached Project Resource Form |
|
|
|
\vspace{0.3in} |
|
|
|
{\bf Project Supervisor:} \emph{Alan Mycroft} |
|
|
|
\vspace{0.3in} |
|
|
|
{\bf Director of Studies:} \emph{Jean Bacon} and \emph{David Ingram} |
|
|
|
\vspace{0.3in} |
|
|
|
{\bf Overseers:} \emph{Anuj Dawar} and \emph{Andrew Moore} |
|
|
|
\vfil |
|
\eject |
|
|
|
\section*{Introduction and Description of the Work} |
|
The \emph{.NET Framework} is a general-purpose software development platform |
|
which is very similar to \emph{Java}. It includes extensive class library |
|
and, similarly to Java, is based on the virtual machine model. The executable |
|
code for a \emph{.NET} program is stored in a file called \emph{assembly} |
|
which consists of class metadata and a stack-based bytecode called Common |
|
Intermediate Language (\emph{CIL} or \emph{IL}). |
|
|
|
In general, any programming language can be compiled to \emph{.NET} and |
|
there are dozens of compilers that compile into \emph{CIL}. The most |
|
common language used for \emph{.NET} development is \emph{C\#}. |
|
|
|
The goal of this project is to decompile \emph{.NET} assemblies back into |
|
equivalent \emph{C\#} source code. Compared to decompilation of |
|
conventional assembly code, this task is hugely simplified by the |
|
presence of metadata in the \emph{.NET} assemblies. The metadata contains |
|
complete information about classes, methods and fields. The method bodies |
|
consist of stack-based \emph{IL} code which needs to be decompiled into |
|
higher-level \emph{C\#} statements. Data-flow analysis will need to be |
|
employed to transform the stack-based data model into one that uses |
|
temporary local variables and composition of expressions. Control-flow |
|
analysis will be used to recreate high level control structures like |
|
\verb|for| loops and conditional branching. |
|
|
|
\section*{Resources Required} |
|
\begin{itemize} |
|
\item{\textbf{My own machine}\\ |
|
(1.6 GHz CPU, 1.5 GB of RAM, 50 GB \& 75 GB Disks, |
|
Windows XP SP2 OS) \\ |
|
Used for development |
|
} |
|
\item{\textbf{Student-Run Computing Facility (SRCF)}\\ |
|
Used for running the \emph{SVN} server |
|
} |
|
\item{\textbf{Public Workstation Facility (PWF)}\\ |
|
Used for storage of back-ups |
|
} |
|
\end{itemize} |
|
|
|
\newpage |
|
|
|
\section*{Starting Point} |
|
I plan to implement the project in \emph{C\#}. I have been using this |
|
language for over five years now and so I do not have to spend any time |
|
learning a new language. It also means that I will not be having any |
|
problems neither with the syntax of the language nor with any peculiar |
|
error messages produced by the compiler or by the runtime. |
|
|
|
I have written an integrated \emph{.NET} debugger for the |
|
\emph{SharpDevelop} IDE. During that I have obtained some basic knowledge |
|
about metadata and lower-level functionality in \emph{.NET}. I can read |
|
\emph{.NET} bytecode and, with the help of reference manual, I can write |
|
short programs in it. |
|
|
|
The metadata and bytecode needs to be read form the assembly files. |
|
I plan to use the \emph{Cecil} library for it. I am not familiar with this |
|
library, but I do not expect to have any difficulties with it. |
|
|
|
\section*{Substance and Structure of the Project} |
|
The project consists of the following major work items: |
|
\begin{enumerate} |
|
\newcommand{\milestone}[1]{\item \textbf{#1} \\} |
|
|
|
\milestone{Preliminary research} |
|
I will have to research the following topics: |
|
\begin{itemize} |
|
\item {\emph{Cecil} library} |
|
- \emph{Cecil} is the library which I will use for reading of the |
|
metadata. It will need to get familiar with its public API. |
|
Because it is open-source, it might be valuable to get some basic |
|
understanding of its source code as well. |
|
\item {\emph{CIL} bytecode} |
|
- The runtime of the \emph{.NET Framework} is described in |
|
ECMA-335 Standard: \emph{``CLI Specification -- Virtual Machine''} |
|
(556 pages). I will need to get familiar with this document since |
|
I will be using it as the main reference. I will be especially |
|
interested in \emph{Partition III -- CIL Instruction Set}. |
|
\item {Decompilation theory} - I will need to get familiar with the |
|
theory behind decompilation of programs. Cristina Cifuentes' |
|
PhD thesis \emph{``Reverse Compilation Techniques''} might prove as |
|
especially useful starting point. |
|
\end{itemize} |
|
|
|
The research of these topics should not be too extensive. I only indeed to |
|
get sufficient background knowledge in these areas and then return to the |
|
finner details when I needed them. |
|
|
|
\milestone{Create a skeleton of the code} |
|
It will be necessary to read the assembly metadata and create a \emph{C\#} |
|
source code that has the same classes, fields and methods. The method |
|
signatures have to match the ones in the assembly. At this point the method |
|
bodies can be left empty. |
|
|
|
\milestone{Read and disassemble \emph{.NET} bytecode} |
|
The next step is to read the bytecode for each method, disassemble it and |
|
output it as comments (for example, \verb|// IL_01: ldstr "Hello world"|). |
|
This will help me learn how to use the \emph{Cecil} library to read the |
|
bytecode and how to process it. I also expect that this output will be |
|
extremely helpful for debugging purposes later on. |
|
|
|
\milestone{Start creating r-value expressions} |
|
Ignoring the stack of the virtual machine, some bytecodes can be |
|
straightforwardly converted into expressions. For example: |
|
\begin{verbatim} |
|
ldstr "Hello world" - string "Hello world" |
|
ldnull - 'null' reference |
|
ldc.i4.0 - 4 byte integer of value 0 |
|
ldc.i4 123 - 4 byte integer of value 123 |
|
ldarg.0 - the first method argument |
|
ldloc.0 - the first local variable in the method |
|
\end{verbatim} |
|
|
|
The goal of this stage is to create \emph{C\#} expressions for several of |
|
the most important bytecodes. |
|
|
|
Function calls and arithmetic operations are also expressions, but at this |
|
stage I do not know their inputs and so I will have to use dummy values as |
|
their inputs. |
|
|
|
\milestone{Conditional and unconditional branching} |
|
There are several bytecodes that investigate one or two values on the top |
|
of stack and then, if a given condition is met, branch to different |
|
location. (\verb|br|, \verb|brfalse|, \verb|brtrue|, \verb|beq|, |
|
\verb|bge|, \verb|bgt|, etc...) |
|
|
|
The goal of this stage is to use \emph{C\#} labels and \verb|goto| |
|
statements to recreate this flow of control. (eg translate |
|
\verb|brfalse IL_02| to \verb|if (input == false) goto Label_02;|) |
|
|
|
As in the previous stage the inputs (ie the values at the top of stack) are |
|
still not know. |
|
|
|
\milestone{Simple data-flow analysis} |
|
This is where it begins to be difficult. Consider the code: |
|
\begin{verbatim} |
|
// Load "Hello, world!" on top of the stack |
|
IL_01: ldstr "Hello, world!" |
|
// Print the top of the stack to the console |
|
IL_02: call void [mscorlib]System.Console::WriteLine(string) |
|
\end{verbatim} |
|
Both of these are already decompiled as expressions, however the call |
|
has a dummy value as its argument. The goal of this stage is to perform |
|
as simple data-flow analysis as possible. The text "Hello, world!" must |
|
find its way to the method call. At this point it will probably be through |
|
one or even two temporary variables. For example: |
|
\begin{verbatim} |
|
String il_01_expression = "Hello, world!"; |
|
String il_02_argument_1 = il_01_expression; |
|
System.Console.WriteLine(il_02_argument_1); |
|
\end{verbatim} |
|
The most difficult part will be handling of control flow. Different values |
|
can be on stack depending on which branch of code was executed. At this |
|
stage it will be necessary to create and analyse control flow graph. As a |
|
result of this stage, many temporary variables might be introduced to the |
|
code. |
|
|
|
\milestone{Round-trip quick-sort algorithm} |
|
At this point very simple applications should probably successfully |
|
decompile and compile again (round-trip). |
|
|
|
The goal of this stage is to fix bugs and to add features so that simple |
|
algorithm like quick-sort can be successfully round-tripped without need to |
|
manually change the produced \emph{C\#} source code. At this point there is |
|
no restriction on the aesthetics of the source code. The only requirement |
|
is that it does compile. |
|
|
|
There are many features of \emph{.NET} that I do not plan to support at |
|
this point. For example, boxing \& unboxing, casting, generics and |
|
exception handling. In general, all non-essential features are excluded. |
|
|
|
\milestone{Further data-flow analysis} |
|
Employ more advanced data-flow analysis to simplify the generated \emph{C\#} |
|
code. Many temporary variables can be probably removed, relocated or |
|
renamed according to their use. |
|
|
|
\emph{[This task has variable scope and if the project starts falling behind |
|
schedule, simpler algorithms can be employed and vice versa.]} |
|
|
|
\milestone{Control-flow analysis} |
|
The goal of this stage is to use control-flow analysis to regenerate |
|
high-level structures like \verb|if| statements and \verb|for| loops. |
|
It will not be possible to eliminated all \verb|goto| statements, but they |
|
should be avoided whenever possible. |
|
|
|
\emph{[This task has variable scope and if the project starts falling behind |
|
schedule, simpler algorithms can be employed and vice versa.]} |
|
|
|
\milestone{Assembly resources} |
|
\emph{.NET} assemblies can have files embed in them. These files can then |
|
be accessed at runtime and thus the programs might require them. |
|
|
|
The goal is to extract the resources so that they can be included during |
|
the recompilation process. |
|
|
|
\emph{[Optional. This is an optional goal which will be done only if the |
|
project development goes much better then originally anticipated.]} |
|
|
|
\milestone{Advanced features} |
|
Add commonly used features which where ignored so far - for example, |
|
boxing \& unboxing, casting, generics and exception handling. |
|
|
|
\emph{[Optional. This is an optional goal which will be done only if the |
|
project development goes much better then originally anticipated.]} |
|
|
|
\milestone{Round-trip Mono} |
|
The ultimate goal of this project is to be able to round-trip any |
|
\emph{.NET} assembly. This means that for any given assembly the |
|
Decompiler should produce \emph{C\#} source code which is valid (does |
|
compile again without error). Even more importantly, the program produced |
|
by the compilation of the source code should be semantically same as the |
|
original one. Since the bytecode will in general differ, this condition is |
|
difficult to verify. One way to check that the Decompiler preserves the |
|
meaning of programs is to simply try it. |
|
|
|
\emph{Mono} is open-source reimplantation of the \emph{.NET Framework}. |
|
The major part of it are the \emph{.NET} class libraries which can be |
|
used for testing of the Decompiler. The project is open-source and so if |
|
any decompilation problems occur, it is possible to investigate the |
|
source code of these libraries. Furthermore, the libraries come with |
|
extensive unit testing suite so it is possible to verify that the |
|
round-tripped libraries are not broken. |
|
|
|
The goal of this final stage is to successfully round-trip all \emph{Mono} |
|
libraries and pass the unit tests. This would probably involve enormous |
|
amount of bugfixing, investigation and handling of corner cases. All |
|
remaining \emph{.NET} features would have to be implemented. |
|
|
|
\emph{[Optional. This last stage is huge and impossible to be finished |
|
within the time frame of Part II project. If all goes well, I expect |
|
that it will take at least one more year for the project to mature to |
|
this point.]} |
|
|
|
\milestone{Write the dissertation} |
|
The last and most important piece of work is to write the dissertation. |
|
Being a non-native English speaker, I expect this to take considerable |
|
amount of time. I plan to spend the last seven weeks of project time |
|
on it. This includes the end of Lent Term and the whole Easter vacation. |
|
I plan to have the dissertation finished by the start of Easter term. |
|
|
|
\end{enumerate} |
|
|
|
\newpage |
|
|
|
\section*{Success Criteria} |
|
The Decompiler should successfully round-trip a quick-sort algorithm |
|
(or any algorithm of comparable complexity). |
|
That is, when an assembly containing the algorithm is |
|
decompiled, the produced \emph{C\#} source code should be both |
|
syntactically and semantically correct. The bytecode produced |
|
by compilation of the generated source code is not expected to be |
|
identical to the original one, but it is expected to be equivalent. |
|
That is, the binary may be different but it still needs to be a correct |
|
implementation of the algorithm. |
|
|
|
To achieve this the Decompiler will need to have the following features: |
|
\begin{itemize} |
|
\item Handle integers and integer arithmetic |
|
\item Create and be able to use integer arrays |
|
\item Branching must be successfully decompiled |
|
\item Several methods can be defined |
|
\item Methods can have arguments and return values |
|
\item Methods can be called recursively |
|
\item Integer command line arguments can be read and parsed |
|
\item Text can be outputted to the standard console output |
|
\end{itemize} |
|
|
|
See the following page for a \emph{C\#} implementation of a quick-sort |
|
algorithm which will be used to demonstrate successful implementation |
|
of these features. |
|
|
|
I plan to achieve the success criteria by the progress report dead-line |
|
and then spend the rest of the time available by increasing the quality |
|
of the generated source code (ie ``Further data-flow analysis'' and |
|
``Control-flow analysis''). |
|
|
|
|
|
\newpage |
|
|
|
{ |
|
\linespread{0.90} |
|
\lstinputlisting[ |
|
basicstyle=\footnotesize, |
|
language={[Sharp]C}, |
|
tabsize=4, |
|
numbers=left, |
|
frame=single, |
|
title=Quick-sort algorithm |
|
]{ |
|
../../tests/QuickSort/Program.cs |
|
} |
|
} |
|
\newpage |
|
\section*{Timetable and Milestones} |
|
The work shall start on the Monday 22.10.2007 and is expected to |
|
take 20 weeks in total. |
|
|
|
\vspace{0.1in} |
|
\newcommand{\milestone}[3]{\emph{#1} & \emph{#2} & \textbf{#3} \\} |
|
\begin{tabular}{l l l} |
|
\milestone{22 Oct - 28 Oct}{(week 1)}{Preliminary research} |
|
\milestone{29 Oct - 4 Nov}{(week 2)}{Create a skeleton of the code} |
|
\milestone{5 Nov - 11 Nov}{(week 3)}{Read and disassemble \emph{.NET} bytecode} |
|
\milestone{12 Nov - 18 Nov}{(week 4)}{Start creating r-value expressions} |
|
\milestone{19 Nov - 25 Nov}{(week 5)}{Conditional and unconditional branching} |
|
\milestone{26 Nov - 9 Dec}{(weeks 6 and 7)}{Simple data-flow analysis} |
|
\milestone{10 Dec - 20 Jan}{}{\textnormal{Christmas vacation}} |
|
\milestone{21 Jan - 27 Jan}{(week 8)}{Round-trip quick-sort algorithm} |
|
\milestone{26 Jan - 27 Jan}{}{Write the Progress Report} |
|
\milestone{28 Jan - 10 Feb}{(weeks 9 and 10)}{Further data-flow analysis} |
|
\milestone{11 Feb - 2 Mar}{(weeks 11 to 13)}{Control-flow analysis} |
|
\milestone{3 Mar - 20 Apr}{(weeks 14 to 20)}{Write the dissertation \textnormal{(over Easter vacation)}} |
|
\milestone{21 Apr onwards }{}{\textnormal{Easter term -- Preparation for exams}} |
|
\end{tabular} |
|
\vspace{0.1in} |
|
|
|
Unscheduled tasks: \textbf{Assembly resources}; \textbf{Advanced features}; |
|
\textbf{Round-trip Mono}
|
|
|