%\thispagestyle{empty}

%\rightline{\large\emph{David Srbeck\'y}}
%\medskip
%\rightline{\large\emph{Jesus College}}
%\medskip
%\rightline{\large\emph{ds417}}

\vfil
\vspace{0.4in}
\centerline{\large Part II of the Computer Science Project Proposal}
\vspace{0.4in}
\centerline{\Large\bf .NET Decompiler}
\vspace{0.3in}
\centerline{\large\emph{October~14,~2007}}

\vfil

{\bf Project Originator:} \emph{David Srbeck\'y}

\vspace{0.1in}

{\bf Resources Required:} See attached Project Resource Form

\vspace{0.3in}

{\bf Project Supervisor:} \emph{Alan Mycroft}

\vspace{0.3in}

{\bf Director of Studies:} \emph{Jean Bacon} and \emph{David Ingram}

\vspace{0.3in}

{\bf Overseers:} \emph{Anuj Dawar} and \emph{Andrew Moore}

\vfil
\eject

\section*{Introduction and Description of the Work}
The \emph{.NET Framework} is a general-purpose software development platform 
which is very similar to \emph{Java}.  It includes extensive class library 
and, similarly to Java, is based on the virtual machine model.  The executable 
code for a \emph{.NET} program is stored in a file called \emph{assembly} 
which consists of class metadata and a stack-based bytecode called Common 
Intermediate Language (\emph{CIL} or \emph{IL}).

In general, any programming language can be compiled to \emph{.NET} and 
there are dozens of compilers that compile into \emph{CIL}.  The most 
common language used for \emph{.NET} development is \emph{C\#}.

The goal of this project is to decompile \emph{.NET} assemblies back into 
equivalent \emph{C\#} source code.  Compared to decompilation of 
conventional assembly code, this task is hugely simplified by the 
presence of metadata in the \emph{.NET} assemblies.  The metadata contains 
complete information about classes, methods and fields.  The method bodies 
consist of stack-based \emph{IL} code which needs to be decompiled into 
higher-level \emph{C\#} statements.  Data-flow analysis will need to be 
employed to transform the stack-based data model into one that uses 
temporary local variables and composition of expressions.  Control-flow 
analysis will be used to recreate high level control structures like 
\verb|for| loops and conditional branching.

\section*{Resources Required}
\begin{itemize}
	\item{\textbf{My own machine}\\
		(1.6 GHz CPU, 1.5 GB of RAM, 50 GB \& 75 GB Disks, 
		Windows XP SP2 OS) \\
		Used for development
	}
	\item{\textbf{Student-Run Computing Facility (SRCF)}\\
		Used for running the \emph{SVN} server
	}
	\item{\textbf{Public Workstation Facility (PWF)}\\
		Used for storage of back-ups
	}
\end{itemize}

\newpage

\section*{Starting Point}
I plan to implement the project in \emph{C\#}.  I have been using this 
language for over five years now and so I do not have to spend any time 
learning a new language.  It also means that I will not be having any 
problems neither with the syntax of the language nor with any peculiar 
error messages produced by the compiler or by the runtime.

I have written an integrated \emph{.NET} debugger for the 
\emph{SharpDevelop} IDE.  During that I have obtained some basic knowledge 
about metadata and lower-level functionality in \emph{.NET}.  I can read 
\emph{.NET} bytecode and, with the help of reference manual, I can write 
short programs in it.

The metadata and bytecode needs to be read form the assembly files. 
I plan to use the \emph{Cecil} library for it.  I am not familiar with this 
library, but I do not expect to have any difficulties with it.

\section*{Substance and Structure of the Project}
The project consists of the following major work items:
\begin{enumerate}
\newcommand{\milestone}[1]{\item \textbf{#1} \\}

\milestone{Preliminary research}
I will have to research the following topics:
\begin{itemize}
	\item {\emph{Cecil} library}
		- \emph{Cecil} is the library which I will use for reading of the 
		metadata.  It will need to get familiar with its public API.
		Because it is open-source, it might be valuable to get some basic 
		understanding of its source code as well.
	\item {\emph{CIL} bytecode}
		- The runtime of the \emph{.NET Framework} is described in 
		ECMA-335 Standard: \emph{``CLI Specification -- Virtual Machine''} 
		(556 pages).  I will need to get familiar with this document since 
		I will be using it as the main reference.  I will be especially 
		interested in \emph{Partition III -- CIL Instruction Set}.
	\item {Decompilation theory} - I will need to get familiar with the 
		theory behind decompilation of programs.  Cristina Cifuentes' 
		PhD thesis \emph{``Reverse Compilation Techniques''} might prove as 
		especially useful starting point.
\end{itemize}

The research of these topics should not be too extensive.  I only indeed to 
get sufficient background knowledge in these areas and then return to the 
finner details when I needed them.

\milestone{Create a skeleton of the code}
It will be necessary to read the assembly metadata and create a \emph{C\#} 
source code that has the same classes, fields and methods.  The method 
signatures have to match the ones in the assembly.  At this point the method 
bodies can be left empty.

\milestone{Read and disassemble \emph{.NET} bytecode}
The next step is to read the bytecode for each method, disassemble it and 
output it as comments (for example, \verb|// IL_01: ldstr "Hello world"|).  
This will help me learn how to use the \emph{Cecil} library to read the 
bytecode and how to process it.  I also expect that this output will be 
extremely helpful for debugging purposes later on.

\milestone{Start creating r-value expressions}
Ignoring the stack of the virtual machine, some bytecodes can be 
straightforwardly converted into expressions.  For example:
\begin{verbatim}
ldstr "Hello world"      - string "Hello world"
ldnull                   - 'null' reference
ldc.i4.0                 - 4 byte integer of value 0
ldc.i4 123               - 4 byte integer of value 123
ldarg.0                  - the first method argument
ldloc.0                  - the first local variable in the method
\end{verbatim}

The goal of this stage is to create \emph{C\#} expressions for several of 
the most important bytecodes.

Function calls and arithmetic operations are also expressions, but at this 
stage I do not know their inputs and so I will have to use dummy values as 
their inputs.

\milestone{Conditional and unconditional branching}
There are several bytecodes that investigate one or two values on the top 
of stack and then, if a given condition is met, branch to different 
location.  (\verb|br|, \verb|brfalse|, \verb|brtrue|, \verb|beq|, 
\verb|bge|, \verb|bgt|, etc...)

The goal of this stage is to use \emph{C\#} labels and \verb|goto|
statements to recreate this flow of control.  (eg translate 
\verb|brfalse IL_02| to \verb|if (input == false) goto Label_02;|)

As in the previous stage the inputs (ie the values at the top of stack) are
still not know.

\milestone{Simple data-flow analysis}
This is where it begins to be difficult.  Consider the code:
\begin{verbatim}
// Load "Hello, world!" on top of the stack
IL_01: ldstr "Hello, world!"
// Print the top of the stack to the console
IL_02: call void [mscorlib]System.Console::WriteLine(string)
\end{verbatim}
Both of these are already decompiled as expressions, however the call 
has a dummy value as its argument.  The goal of this stage is to perform 
as simple data-flow analysis as possible.  The text "Hello, world!" must 
find its way to the method call.  At this point it will probably be through 
one or even two temporary variables.  For example:
\begin{verbatim}
String il_01_expression = "Hello, world!";
String il_02_argument_1 = il_01_expression;
System.Console.WriteLine(il_02_argument_1);
\end{verbatim}
The most difficult part will be handling of control flow.  Different values 
can be on stack depending on which branch of code was executed.  At this 
stage it will be necessary to create and analyse control flow graph.  As a 
result of this stage, many temporary variables might be introduced to the 
code.

\milestone{Round-trip quick-sort algorithm}
At this point very simple applications should probably successfully 
decompile and compile again (round-trip).

The goal of this stage is to fix bugs and to add features so that simple 
algorithm like quick-sort can be successfully round-tripped without need to 
manually change the produced \emph{C\#} source code.  At this point there is 
no restriction on the aesthetics of the source code.  The only requirement 
is that it does compile. 

There are many features of \emph{.NET} that I do not plan to support at 
this point.  For example, boxing \& unboxing, casting, generics and 
exception handling.  In general, all non-essential features are excluded.

\milestone{Further data-flow analysis}
Employ more advanced data-flow analysis to simplify the generated \emph{C\#} 
code.  Many temporary variables can be probably removed, relocated or 
renamed according to their use.

\emph{[This task has variable scope and if the project starts falling behind 
schedule, simpler algorithms can be employed and vice versa.]}

\milestone{Control-flow analysis}
The goal of this stage is to use control-flow analysis to regenerate 
high-level structures like \verb|if| statements and \verb|for| loops. 
It will not be possible to eliminated all \verb|goto| statements, but they 
should be avoided whenever possible.

\emph{[This task has variable scope and if the project starts falling behind 
schedule, simpler algorithms can be employed and vice versa.]}

\milestone{Assembly resources}
\emph{.NET} assemblies can have files embed in them.  These files can then 
be accessed at runtime and thus the programs might require them.

The goal is to extract the resources so that they can be included during 
the recompilation process.

\emph{[Optional.  This is an optional goal which will be done only if the 
project development goes much better then originally anticipated.]}

\milestone{Advanced features}
Add commonly used features which where ignored so far - for example, 
boxing \& unboxing, casting, generics and exception handling.

\emph{[Optional.  This is an optional goal which will be done only if the 
project development goes much better then originally anticipated.]}

\milestone{Round-trip Mono}
The ultimate goal of this project is to be able to round-trip any 
\emph{.NET} assembly.  This means that for any given assembly the 
Decompiler should produce \emph{C\#} source code which is valid (does 
compile again without error).  Even more importantly, the program produced 
by the compilation of the source code should be semantically same as the 
original one.  Since the bytecode will in general differ, this condition is 
difficult to verify.  One way to check that the Decompiler preserves the 
meaning of programs is to simply try it.

\emph{Mono} is open-source reimplantation of the \emph{.NET Framework}.
The major part of it are the \emph{.NET} class libraries which can be 
used for testing of the Decompiler.  The project is open-source and so if 
any decompilation problems occur, it is possible to investigate the 
source code of these libraries.  Furthermore, the libraries come with 
extensive unit testing suite so it is possible to verify that the 
round-tripped libraries are not broken.

The goal of this final stage is to successfully round-trip all \emph{Mono} 
libraries and pass the unit tests.  This would probably involve enormous 
amount of bugfixing, investigation and handling of corner cases.  All 
remaining \emph{.NET} features would have to be implemented.

\emph{[Optional.  This last stage is huge and impossible to be finished 
within the time frame of Part II project.  If all goes well, I expect 
that it will take at least one more year for the project to mature to 
this point.]}

\milestone{Write the dissertation}
The last and most important piece of work is to write the dissertation.
Being a non-native English speaker, I expect this to take considerable
amount of time.  I plan to spend the last seven weeks of project time
on it.  This includes the end of Lent Term and the whole Easter vacation.
I plan to have the dissertation finished by the start of Easter term.

\end{enumerate}

\newpage

\section*{Success Criteria}
The Decompiler should successfully round-trip a quick-sort algorithm 
(or any algorithm of comparable complexity). 
That is, when an assembly containing the algorithm is 
decompiled, the produced \emph{C\#} source code should be both 
syntactically and semantically correct.  The bytecode produced
by compilation of the generated source code is not expected to be
identical to the original one, but it is expected to be equivalent.
That is, the binary may be different but it still needs to be a correct 
implementation of the algorithm.

To achieve this the Decompiler will need to have the following features:
\begin{itemize}
	\item Handle integers and integer arithmetic
	\item Create and be able to use integer arrays
	\item Branching must be successfully decompiled
	\item Several methods can be defined
	\item Methods can have arguments and return values
	\item Methods can be called recursively
	\item Integer command line arguments can be read and parsed
	\item Text can be outputted to the standard console output
\end{itemize}

See the following page for a \emph{C\#} implementation of a quick-sort
algorithm which will be used to demonstrate successful implementation
of these features.

I plan to achieve the success criteria by the progress report dead-line 
and then spend the rest of the time available by increasing the quality 
of the generated source code  (ie ``Further data-flow analysis'' and 
``Control-flow analysis'').


\newpage

{
\linespread{0.90}
\lstinputlisting[
  basicstyle=\footnotesize,
  language={[Sharp]C},
  tabsize=4,
  numbers=left,
  frame=single,
  title=Quick-sort algorithm
]{
  ../../tests/QuickSort/Program.cs
}
}
\newpage
\section*{Timetable and Milestones}
The work shall start on the Monday 22.10.2007 and is expected to 
take 20 weeks in total.

\vspace{0.1in}
\newcommand{\milestone}[3]{\emph{#1} & \emph{#2} & \textbf{#3} \\}
\begin{tabular}{l l l}
	\milestone{22 Oct - 28 Oct}{(week  1)}{Preliminary research}
	\milestone{29 Oct -  4 Nov}{(week  2)}{Create a skeleton of the code}
	\milestone{5 Nov  - 11 Nov}{(week  3)}{Read and disassemble \emph{.NET} bytecode}
	\milestone{12 Nov - 18 Nov}{(week  4)}{Start creating r-value expressions}
	\milestone{19 Nov - 25 Nov}{(week  5)}{Conditional and unconditional branching}
	\milestone{26 Nov -  9 Dec}{(weeks 6 and 7)}{Simple data-flow analysis}
	\milestone{10 Dec - 20 Jan}{}{\textnormal{Christmas vacation}}
	\milestone{21 Jan - 27 Jan}{(week  8)}{Round-trip quick-sort algorithm}
	\milestone{26 Jan - 27 Jan}{}{Write the Progress Report}
	\milestone{28 Jan - 10 Feb}{(weeks 9 and 10)}{Further data-flow analysis}
	\milestone{11 Feb -  2 Mar}{(weeks 11 to 13)}{Control-flow analysis}
	\milestone{3 Mar  - 20 Apr}{(weeks 14 to 20)}{Write the dissertation \textnormal{(over Easter vacation)}}
	\milestone{21 Apr onwards }{}{\textnormal{Easter term -- Preparation for exams}}
\end{tabular}
\vspace{0.1in}

Unscheduled tasks: \textbf{Assembly resources}; \textbf{Advanced features};
 \textbf{Round-trip Mono}