ILSpy/doc/Dissertation/ProgressReport.tex

\documentclass[12pt]{article}
\usepackage{a4wide}
\usepackage{listings}

\parindent 0pt
\parskip 6pt

\begin{document}

\thispagestyle{empty}

\rightline{\large\emph{David Srbeck\'y}}
\medskip
\rightline{\large\emph{Jesus College}}
\medskip
\rightline{\large\emph{ds417}}

\vspace{0.675in}

\centerline{\large Progress Report}
\vspace{0.4in}
\centerline{\Large\bf .NET Decompiler}
\vspace{0.3in}
\centerline{\large\emph{January~30,~2008}}

\vspace{0.675in}

{\bf Project Originator:} \emph{David Srbeck\'y}

\vspace{0.1in}

{\bf Project Supervisor:} \emph{Alan Mycroft}

\vspace{0.1in}

{\bf Director of Studies:} \emph{Jean Bacon} and \emph{David Ingram}

\vspace{0.1in}

{\bf Overseers:} \emph{Anuj Dawar} and \emph{Andrew Moore}

\vspace{0.1in}

\vfil
\eject

\newcommand{\CS}{\emph{C\#} }

\lstset{
  basicstyle=\small,
  language={[Sharp]C},
  tabsize=4,
  numbers=left,
  frame=single,
  frameround=tfft
}

\section*{Work completed so far}
\subsection*{Disassemble \emph{.NET} bytecode}

The \emph{.NET} assembly is read using the \emph{Cecil} library
and the class structure is created.  Method bodies contain the
disassembly of the IL bytecode.  The debugging comment on the
right indicates the stack behavior of the given instruction.
This, of course, is not valid \CS code yet.

\lstinputlisting[lastline=32]
{./Evolution/01_Disassemble.cs}

\newpage
\subsection*{Start creating expressions}

The bytecodes are converted to \CS expressions on individual basis.
Only one bytecode is considered at a time and thus the expressions
are completely independent.  The resulting output is a valid \CS code
which however does not compile since the dummy arguments \verb|arg1|,
\verb|arg2|, etc\dots{} are never defined.  Conditional and unconditional
branches are converted to \verb|goto| goto statements.

\lstinputlisting[lastline=32]
{./Evolution/02_Peephole_decompilation.cs}

\newpage
\subsection*{Data-flow analysis}

The execution of the bytecode is simulated and the state of the stack is
recorded for each position.  We are interested in the number of
elements on the stack as well as which instruction has pushed the
individual elements on the stack.  This information can then be used to eliminate
the dummy \verb|arg1| arguments.  Result of each instruction is stored
in new temporary variable.  When an instruction pops a stack value
we look-up which instruction has allocated the value and use the temporary
variable of the allocating instruction.
This code compiles and works correctly.

\lstinputlisting[firstline=21, lastline=52]
{./Evolution/03_Dataflow_Comments.cs}

\newpage
\subsection*{In-lineing expressions}

Many of the temporary variables can be in-lined into the expressions in which
they are used.  This is in general non-trivial optimization, however it is
simpler in this case since the temporary variables generated to store the
stack values are guaranteed to be single static assignment variables (the
variable is assigned only once during the push instruction and is used
only once during the pop instruction).
Having said that, we still need to check that doing the optimization is
safe with regards to expression evaluation order and with regrads to branching.

\lstinputlisting[firstline=1, lastline=32]
{./Evolution/04_Inline_expressions.cs}

\newpage
\subsection*{Finding basic blocks}

The first step of reconstructing any high-level structures is the
decomposition of the program into basic blocks.  This is an easy
algorithm to implement.

I chose to use the following constraint for the output:
``Each basic block starts with a label and is exited by an explicit
\verb|goto| statement.''
Therefore except for the method entry, the order
of the blocks is completely irrelevant.  Any swapping of the basic
blocks is not going change the semantics of the program in any way.

\lstinputlisting[firstline=1, lastline=30]
{./Evolution/05_Find_basic_blocks.cs}

\newpage
\subsection*{Finding loops}

The algorithm for finding loops is inspired by T1-T2 transformations.
T1-T2 transformations are used to determine whether a graph is
reducible or not.  The core idea is that if a block of code has
only one predecessor then the block of code can be merged
with its predecessor to form a directed acyclic graph.  Using this,
loops will reduce to single self-referencing nodes.
This also works for nested loops.

Note that merely adding a loop does not change the program in any way --
the loop is completely redundant as far as control flow goes.
The basic blocks still explicitly transfer control using \verb|goto|
statements, so the control flow never reaches the loop.

This is desirable property.  It ensures that the program will run
correctly.  The order of basic blocks and their nesting within loops
does not have any effect on program correctness.

The only advantage of the loop is readability and that some \verb|goto|
statements can be replaced by \verb|break| and \verb|continue| statements
if they have the same semantics in the given context.

\lstinputlisting[firstline=1, lastline=25]
{./Evolution/06_Find_loops.cs}

\newpage
\subsection*{Finding conditionals}

The current algorithm for finding conditionals works as follows:
First find a node that has two successors.  Get all nodes accessible
\emph{only} from the `true' branch -- these form the `true' body of
the conditional.  Similarly, all nodes accessible \emph{only} from the
`false' branch form the `false' body.  The rest of the nodes is
not part of the conditional.

Similarly as for the loops, adding a conditional does not have any
effect on program correctness.

\lstinputlisting[firstline=1, lastline=32]
{./Evolution/07_Find_conditionals.cs}

\newpage
\subsection*{Remove dead jumps}

There are many \verb|goto| statements in the form:
\begin{verbatim}
goto BasicBlock_X;
BasicBlock_X:
\end{verbatim}
These \verb|goto| statement can be removed.  As a result of
doing that, several labels will become dead; these can be
removed as well.

\lstinputlisting[firstline=1, lastline=32]
{./Evolution/08_Remove_dead_jumps.cs}

\newpage
\subsection*{Reduce loops}

It is common for loops to be preceded by a temporary variable
initialization, start by evaluating a condition and finally
end by doing an increment on a variable.  We can look
for these patterns and if they are found move the code
to the \verb|for(;;)| part of the statement.

\lstinputlisting[firstline=1, lastline=32]
{./Evolution/09_Reduce_loops.cs}

\newpage
\subsection*{Clean up}

Finally some minor cleanups like removing empty statements and
simplifying type names.

\lstinputlisting[firstline=1, lastline=42]
{./Evolution/10_Short_type_names.cs}

\newpage
\subsection*{Original source code}
Here is the original source code for reference.

\lstinputlisting[firstline=1, lastline=41]
{./Evolution/QuickSort_original.cs}

\newpage
\subsection*{Unexpected difficulties}

The \emph{CodeDom} library that I have initially intended to use
to output source code in arbitrary \emph{.NET} language has turned out to be
quite incomplete.  That is, since the library aims to be
able to represent source code for any language, it has
feature set limited to the lowest common denominator.
Therefore, I have switched to \emph{NRefactory} library which
is specifically designed with \CS and \emph{VB.NET} in mind.

Using \emph{T1-T2} transformations for loop finding turned out to be
a slightly more difficult since the algorithm is, after all,
originally intended to produce a yes or no answer to whether the graph
is reducible.  However, it was not problematic to refactor
the idea to suit a different purpose.

\subsection*{Summary}

The project tasks were performed in the planned order and
the project is progressing according to the schedule.

The quality of decompilation of the Quick-Sort algorithm is
almost `as good as it gets' so I intend to look for some more
complex assembly to tackle.

\end{document}