mirror of https://github.com/icsharpcode/ILSpy.git
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
243 lines
7.7 KiB
243 lines
7.7 KiB
\documentclass[12pt]{article} |
|
\usepackage{a4wide} |
|
\usepackage{listings} |
|
|
|
\parindent 0pt |
|
\parskip 6pt |
|
|
|
\begin{document} |
|
|
|
\thispagestyle{empty} |
|
|
|
\rightline{\large\emph{David Srbeck\'y}} |
|
\medskip |
|
\rightline{\large\emph{Jesus College}} |
|
\medskip |
|
\rightline{\large\emph{ds417}} |
|
|
|
\vspace{0.675in} |
|
|
|
\centerline{\large Progress Report} |
|
\vspace{0.4in} |
|
\centerline{\Large\bf .NET Decompiler} |
|
\vspace{0.3in} |
|
\centerline{\large\emph{January~30,~2008}} |
|
|
|
\vspace{0.675in} |
|
|
|
{\bf Project Originator:} \emph{David Srbeck\'y} |
|
|
|
\vspace{0.1in} |
|
|
|
{\bf Project Supervisor:} \emph{Alan Mycroft} |
|
|
|
\vspace{0.1in} |
|
|
|
{\bf Director of Studies:} \emph{Jean Bacon} and \emph{David Ingram} |
|
|
|
\vspace{0.1in} |
|
|
|
{\bf Overseers:} \emph{Anuj Dawar} and \emph{Andrew Moore} |
|
|
|
\vspace{0.1in} |
|
|
|
\vfil |
|
\eject |
|
|
|
\newcommand{\CS}{\emph{C\#} } |
|
|
|
\lstset{ |
|
basicstyle=\small, |
|
language={[Sharp]C}, |
|
tabsize=4, |
|
numbers=left, |
|
frame=single, |
|
frameround=tfft |
|
} |
|
|
|
\section*{Work completed so far} |
|
\subsection*{Disassemble \emph{.NET} bytecode} |
|
|
|
The \emph{.NET} assembly is read using the \emph{Cecil} library |
|
and the class structure is created. Method bodies contain the |
|
disassembly of the IL bytecode. The debugging comment on the |
|
right indicates the stack behavior of the given instruction. |
|
This, of course, is not valid \CS code yet. |
|
|
|
\lstinputlisting[lastline=32] |
|
{./Evolution/01_Disassemble.cs} |
|
|
|
\newpage |
|
\subsection*{Start creating expressions} |
|
|
|
The bytecodes are converted to \CS expressions on individual basis. |
|
Only one bytecode is considered at a time and thus the expressions |
|
are completely independent. The resulting output is a valid \CS code |
|
which however does not compile since the dummy arguments \verb|arg1|, |
|
\verb|arg2|, etc\dots{} are never defined. Conditional and unconditional |
|
branches are converted to \verb|goto| goto statements. |
|
|
|
\lstinputlisting[lastline=32] |
|
{./Evolution/02_Peephole_decompilation.cs} |
|
|
|
\newpage |
|
\subsection*{Data-flow analysis} |
|
|
|
The execution of the bytecode is simulated and the state of the stack is |
|
recorded for each position. We are interested in the number of |
|
elements on the stack as well as which instruction has pushed the |
|
individual elements on the stack. This information can then be used to eliminate |
|
the dummy \verb|arg1| arguments. Result of each instruction is stored |
|
in new temporary variable. When an instruction pops a stack value |
|
we look-up which instruction has allocated the value and use the temporary |
|
variable of the allocating instruction. |
|
This code compiles and works correctly. |
|
|
|
\lstinputlisting[firstline=21, lastline=52] |
|
{./Evolution/03_Dataflow_Comments.cs} |
|
|
|
\newpage |
|
\subsection*{In-lineing expressions} |
|
|
|
Many of the temporary variables can be in-lined into the expressions in which |
|
they are used. This is in general non-trivial optimization, however it is |
|
simpler in this case since the temporary variables generated to store the |
|
stack values are guaranteed to be single static assignment variables (the |
|
variable is assigned only once during the push instruction and is used |
|
only once during the pop instruction). |
|
Having said that, we still need to check that doing the optimization is |
|
safe with regards to expression evaluation order and with regrads to branching. |
|
|
|
\lstinputlisting[firstline=1, lastline=32] |
|
{./Evolution/04_Inline_expressions.cs} |
|
|
|
\newpage |
|
\subsection*{Finding basic blocks} |
|
|
|
The first step of reconstructing any high-level structures is the |
|
decomposition of the program into basic blocks. This is an easy |
|
algorithm to implement. |
|
|
|
I chose to use the following constraint for the output: |
|
``Each basic block starts with a label and is exited by an explicit |
|
\verb|goto| statement.'' |
|
Therefore except for the method entry, the order |
|
of the blocks is completely irrelevant. Any swapping of the basic |
|
blocks is not going change the semantics of the program in any way. |
|
|
|
\lstinputlisting[firstline=1, lastline=30] |
|
{./Evolution/05_Find_basic_blocks.cs} |
|
|
|
\newpage |
|
\subsection*{Finding loops} |
|
|
|
The algorithm for finding loops is inspired by T1-T2 transformations. |
|
T1-T2 transformations are used to determine whether a graph is |
|
reducible or not. The core idea is that if a block of code has |
|
only one predecessor then the block of code can be merged |
|
with its predecessor to form a directed acyclic graph. Using this, |
|
loops will reduce to single self-referencing nodes. |
|
This also works for nested loops. |
|
|
|
Note that merely adding a loop does not change the program in any way -- |
|
the loop is completely redundant as far as control flow goes. |
|
The basic blocks still explicitly transfer control using \verb|goto| |
|
statements, so the control flow never reaches the loop. |
|
|
|
This is desirable property. It ensures that the program will run |
|
correctly. The order of basic blocks and their nesting within loops |
|
does not have any effect on program correctness. |
|
|
|
The only advantage of the loop is readability and that some \verb|goto| |
|
statements can be replaced by \verb|break| and \verb|continue| statements |
|
if they have the same semantics in the given context. |
|
|
|
\lstinputlisting[firstline=1, lastline=25] |
|
{./Evolution/06_Find_loops.cs} |
|
|
|
\newpage |
|
\subsection*{Finding conditionals} |
|
|
|
The current algorithm for finding conditionals works as follows: |
|
First find a node that has two successors. Get all nodes accessible |
|
\emph{only} from the `true' branch -- these form the `true' body of |
|
the conditional. Similarly, all nodes accessible \emph{only} from the |
|
`false' branch form the `false' body. The rest of the nodes is |
|
not part of the conditional. |
|
|
|
Similarly as for the loops, adding a conditional does not have any |
|
effect on program correctness. |
|
|
|
\lstinputlisting[firstline=1, lastline=32] |
|
{./Evolution/07_Find_conditionals.cs} |
|
|
|
\newpage |
|
\subsection*{Remove dead jumps} |
|
|
|
There are many \verb|goto| statements in the form: |
|
\begin{verbatim} |
|
goto BasicBlock_X; |
|
BasicBlock_X: |
|
\end{verbatim} |
|
These \verb|goto| statement can be removed. As a result of |
|
doing that, several labels will become dead; these can be |
|
removed as well. |
|
|
|
\lstinputlisting[firstline=1, lastline=32] |
|
{./Evolution/08_Remove_dead_jumps.cs} |
|
|
|
\newpage |
|
\subsection*{Reduce loops} |
|
|
|
It is common for loops to be preceded by a temporary variable |
|
initialization, start by evaluating a condition and finally |
|
end by doing an increment on a variable. We can look |
|
for these patterns and if they are found move the code |
|
to the \verb|for(;;)| part of the statement. |
|
|
|
\lstinputlisting[firstline=1, lastline=32] |
|
{./Evolution/09_Reduce_loops.cs} |
|
|
|
\newpage |
|
\subsection*{Clean up} |
|
|
|
Finally some minor cleanups like removing empty statements and |
|
simplifying type names. |
|
|
|
\lstinputlisting[firstline=1, lastline=42] |
|
{./Evolution/10_Short_type_names.cs} |
|
|
|
\newpage |
|
\subsection*{Original source code} |
|
Here is the original source code for reference. |
|
|
|
\lstinputlisting[firstline=1, lastline=41] |
|
{./Evolution/QuickSort_original.cs} |
|
|
|
\newpage |
|
\subsection*{Unexpected difficulties} |
|
|
|
The \emph{CodeDom} library that I have initially intended to use |
|
to output source code in arbitrary \emph{.NET} language has turned out to be |
|
quite incomplete. That is, since the library aims to be |
|
able to represent source code for any language, it has |
|
feature set limited to the lowest common denominator. |
|
Therefore, I have switched to \emph{NRefactory} library which |
|
is specifically designed with \CS and \emph{VB.NET} in mind. |
|
|
|
Using \emph{T1-T2} transformations for loop finding turned out to be |
|
a slightly more difficult since the algorithm is, after all, |
|
originally intended to produce a yes or no answer to whether the graph |
|
is reducible. However, it was not problematic to refactor |
|
the idea to suit a different purpose. |
|
|
|
\subsection*{Summary} |
|
|
|
The project tasks were performed in the planned order and |
|
the project is progressing according to the schedule. |
|
|
|
The quality of decompilation of the Quick-Sort algorithm is |
|
almost `as good as it gets' so I intend to look for some more |
|
complex assembly to tackle. |
|
|
|
\end{document}
|
|
|