Decompiler Construction: Chapter 12 - Abstract Modeling Pages, Regions, Self-Modifying Code, and Indirect Calls
The idea of modeling abstract concepts is to represent them in a form that is easy to analyze and transform, while still preserving all semantics necessary to handle side effects.
Architecture-Specific Instructions
Going back to early chapters when lifting to IL, you will encounter instructions that are highly CPU-specific. Example PAUSE on x86.
Instructions like this do not meaningfully modify state (registers, memory, or control flow), but instead provide hints to the processor (e.g., improving spin-wait behavior). These should be modeled explicitly as special operations, typically as intrinsic or pseudo-calls, with side-effect semantics added to them.
In this case, PAUSE can be treated as a side-effect-free operation with no impact on control flow. This makes it safer to transform during later passes, as long as timing-sensitive behavior is not a concern, modeled as critical regions.
Do not drop these instructions blindly. Model them precisely, then decide how aggressively you want to optimize them.
Pages / Regions
Executable code is often organized into logical regions (or pages), especially in a binary that relies on dynamic dispatch, trampolines, or runtime-generated code.
Instead of treating every address as unrelated, we model these as higher-level regions. A region represents a contiguous block of executable behavior that can be targeted by calls or jumps.
In IR, this is represented as pages. When a call or branch targets such a region, we transform it accordingly.
This abstraction becomes crucial when:
- Multiple entry points target the same logical code
- Code is dynamically generated or relocated
- Control flow cannot be resolved statically
For now, the goal is not to optimize pages, but model them correctly. More aggressive transformations come later.
Self-Modifying Code (SMC)
Using the DBI data from earlier, along with the logical PC we made, we can detect and group regions of self-modifying code (SMC) into contiguous statements.
Instead of modeling each mutation independently, we treat the observed execution as mutually exclusive paths depending on the runtime memory state.
Example pseudo-IR (observed execution):
1
2
3
4
5
R1 = 1;
// or
R2 = 8;
R3 = 9;
// (determined to be contiguous by logical PC)
Modeled abstractly:
1
2
3
4
5
6
if (memread[**executes R1**]) {
R1 = 1;
} else if (memread[**executes R2**]) {
R2 = 8;
R3 = 9;
}
Because the region is contiguous (as determined by the logical PC), we only need to guard on the first instruction. The remainder of the block is assumed to follow deterministically once the path is selected.
This abstraction allows us to:
- Preserve correctness under mutation
- Avoid duplicating analysis across variants
- Treat SMC as structured control flow rather than randomness
However, this model is only valid if:
- The region boundaries are accurate
- The execution paths are truly mutually exclusive
- No interleaving mutations violate the assumed structure This is all determined by the logical PC
Indirect Calls
Using information gathered from DBI, we can resolve possible targets of indirect calls.
We model the indirect read as a virtual variable so it does not introduce unintended side effects on other variables.
Example:
R2 may call either hi or hello. This can be modeled in pseudo-IR as:
1
2
3
4
5
6
virt_var1 = memread(R2);
if (virt_var1 == hi) {
hi();
} else if (virt_var1 == hello) {
hello();
}
Do not model indirect calls as switch statements. An if-else chain is generally safer, as it preserves order and avoids introducing incorrect assumptions about exhaustiveness. It also prevents unnecessary pollution of the CFG with artificial jump tables.
Next Chapter: Chapter 13 - Safe Page-Level Optimization
Prev Chapter: Chapter 11 - Control Flow Recovery and Branch Simplification