Decompiler Construction: Chapter 7 - Reconstructing Control Flow using Graph Theory, SSA, and Types
Now that the IR is ready to go, we can begin transforming it into readable code.
Pass Methodology
This chapter and the preceding chapters (up to the final one) define transformation passes. Each pass operates on the IR and may modify it.
Passes must be applied iteratively until a fixed point is reached i.e., until no pass produces any changes.
- After any modification, all passes must be re-run.
- Termination is achieved only when a full pass cycle results in no changes.
Each transformation must reduce or simplify the IR toward a more canonical form.
- Emitted statements and expressions should be as minimal as possible.
- Transformations must be convergent (i.e., avoid oscillation or expansion).
Failure to enforce minimal and convergent transformations may result in non-termination.
Graphing IR
Because a CFG is not maintained directly, a helper function must be used to construct one from the IR. A CFG allows us a data structure to easily and simply follow and analyze control flow.
Node Construction
IR statements are linear. CFG nodes (basic blocks) are constructed by grouping contiguous statements into ranges [start, end], where start and end are statement indices.
A new node begins at:
- The entry point of the IR
- Any statement that is the target of a control flow edge (e.g., jump/branch target)
- Any statement immediately following a control flow terminator
A node ends at:
- Any control flow terminator (e.g., jump, conditional branch, return)
- The statement immediately before a new node begins
Each node must:
- Contain only sequential statements with no internal control flow splits
- Have a single entry point (
start) - End with at most one control flow terminator
Edges
This process analyzes control flow constructs and emits edges between nodes:
- Conditional statements (if/else) create branching edges.
- Loop constructs (while, do-while) introduce back-edges and cycles.
- Page calls and any abstract jumps create edges.
The resulting CFG must:
- Accurately represent all possible execution paths.
- Ensure every block is reachable or explicitly marked unreachable.
- Resolve all jump targets into valid graph edges.
CFG construction must be deterministic and repeatable, as it is relied upon by many analysis and transformation passes.
Example Graph
Lets refer back to this simple IR
1
2
3
4
5
int32_t hello = 0;
while (hello < 100) {
hello = hello + (hello * hello);
}
printf("%d, %d\n", hello, hello);
Static Single Assignment
Static Single Assignment (SSA) assigns each variable a unique version (internal register) for every write.
If a variable can have multiple possible definitions at a merge point in a CFG, a φ (phi) node is used to select the correct version based on the incoming path.
Algorithm
The standard approach for constructing SSA form is the Cytron et al. algorithm using dominance frontiers.
It operates by first computing the CFG and dominator tree, then using dominance frontiers to determine where φ (phi) definitions are required.
Phi-nodes are placed at all CFG join points that are in the dominance frontier of a definition, guaranteeing that all possible reaching definitions are correctly merged.
Time complexity:
- V = number of variables
- B = number of basic blocks
- E = number of CFG edges
With Lengauer-Tarjan Imidiate Dominators O(V * E + E (E, B))
Psuedo-code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
INPUT:
CFG with basic blocks
defsites[v] = set of blocks where variable v is assigned
OUTPUT:
CFG in SSA form (with phi nodes + renamed variables)
STEP 1: Compute Dominators
{
computeDominators(CFG)
for each block b:
dom[b] = all blocks initially
dom[start] = {start}
repeat until no change:
for each block b != start:
dom[b] = intersection of dom[p] for all predecessors p of b
union {b}
}
STEP 2: Compute Immediate Dominators
{
idom = buildImmediateDominators(dom)
build dominator tree from idom
}
STEP 3: Compute Dominance Frontiers
{
computeDF(b):
DF[b] = empty set
for each successor s of b:
if idom[s] != b:
add s to DF[b]
for each child c in dominator tree:
for each w in DF[c]:
if idom[w] != b:
add w to DF[b]
}
STEP 4: Place φ-Def Sites
{
for each variable v:
worklist = defsites[v]
hasPhi = empty set
while worklist not empty:
b = pop(worklist)
for each d in DF[b]:
if d not in hasPhi:
insert φ(v) in block d
hasPhi.add(d)
if v not defined in d:
worklist.add(d)
}
STEP 5: Rename Variables (SSA enforcement)
{
rename(block b):
for each instruction in b:
replace uses with current version on stack
replace defs with new version:
v_i = new version counter(v)
push v_i onto stack[v]
for each successor s:
for each φ in s:
set φ argument for edge b -> s
to current stack version of that variable
for each child c in dominator tree:
rename(c)
pop all variables defined in b from stack
}
start:
initialize empty stacks for all variables
rename(start_block)
Example SSA Form
We use R as the prefix here for there SSA version, given:
1
2
3
4
5
int32_t hello = 0;
while (hello < 100) {
hello = hello + (hello * hello);
}
printf("%d, %d\n", hello, hello);
SSA:
1
2
3
4
5
int32_t hello(R1) = 0;
while (hello(PHI[R1, R2]) < 100) {
hello(R2) = hello(PHI[R1, R2]) + (hello(PHI[R1, R2]) * hello(PHI[R1, R2]));
}
printf("%d, %d\n", hello(PHI[R1, R2]), hello(PHI[R1, R2]));
Types and Abstract Types using SSA
Type information originates during Assembly -> IL lifting, where instructions use bitcast to redundantly cast each register use, etc. These types must be preserved in the IR, even if redundant, as they provide the foundation for later refinement.
However, lifted types are not authoritarian; they are often incomplete, overly specific, and overly redundant. The type system must utilize SSA to help refine and simplify them.
Using SSA
Each SSA value is associated with a set of possible types, rather than a single fixed type. Initially, this set may contain only the lifted type, but it is progressively refined through usage. A direct assignment:
1
R1 = ...
assigns R1 an initial type from lifting. A phi-node:
1
R3 = PHI(R1, R2)
represents a merge of possible values and must also merge their types:
1
type(R3) = merge(type(R1), type(R2))
Type Merging
Type merging must be explicitly defined and deterministic:
- Identical types -> preserved
- Compatible types -> promoted (e.g., int + float -> numeric)
- Incompatible/abstract types -> USER HANDLED
Dominance must not be used to arbitrarily select a type. Doing so discards valid paths and produces incorrect results.
Refinement
Types are refined by propagating from SSA uses:
- Arithmetic operations -> numeric types
- Memory access (*R) -> pointer types, obey region size
- Collapse equivalent types
- Eliminate unused SAFE abstractions
- Lower SAFE abstract types into primitive types when possible
Premature simplification must be avoided, as it can destroy information required for correct reconstruction.
Next Chapter: Chapter 8 - Designing Safe Optimization Passes
Prev Chapter: Chapter 6 - IR Semantics Edge Cases, Undefined Behavior, and Special Handling