Post

Decompiler Construction: Chapter 4 - Designing an Architecture-Agnostic IR

Decompiler Construction: Chapter 4 - Designing an Architecture-Agnostic IR

Intermediate Representation (IR) is a way of representing programs in a common abstract format.

It differs from an Abstract Syntax Tree (AST). While both represent program structure, an AST reflects the high-level syntax and hierarchy, whereas an IR is designed for analysis and transformation. An AST does not track a control flow graph (CFG); instead, a CFG is typically constructed separately, while many IRs either encode or make it easier to construct a CFG when needed.

Because of this, IRs are generally more optimal for performing many small transformations, while ASTs are suboptimal for repeated low-level modifications.

In this IR, the structure is similar to an AST in that it represents program behavior explicitly, but the CFG is mapped through instructions (e.g., branches) rather than being directly linked into a CFG unless needed.

Statements & Expressions in IR

Expressions (Exprs) should be represented as a tree structure within the IR.

Statements (Stats) should be stored in a linear structure (e.g., a dynamic array) to keep transformation and analysis simple.

Think of the IR as a simple, explicit language.

Key Ideas

The main goal of the IR is to abstract common patterns into a unified representation.
Because this is the stage where most transformations occur, the design should favor representing as many as possible using a small number of kinds while still being able to distinguish them.

Having an explicit flag structure is crucial and will be extended as more passes are introduced.
It is also important to provide helper functions (e.g., operator overloads or builders) to simplify constructing and transforming IR nodes.

IR construction code should be abstracted as much as possible while not polluting your source!

Core IR Model

The IR is structured into a small set of core components:

Blocks

A block is a linear sequence of statements with a single entry point.
Control flow between blocks is handled explicitly through branch instructions.

Statements (Stats)

Statements represent side-effecting operations. These include:

  • Arithmetic and logical operations
  • Memory loads and stores
  • Flag operations
  • Control flow terminators (branches, calls, returns)

Each statement is executed in order unless a terminator is reached.

Expressions (Exprs)

Expressions represent computations or unsafe data.
They form a tree structure and are used inside statements.

Examples include:

  • Binary operations (add, sub, xor)
  • Constants
  • Register reads
  • Computed addresses

Terminators

Control flow is not implicit. It is explicitly represented at the end of a block using terminator statements such as:

  • GOTO
  • IF
  • CALL
  • RETURN

Terminators and control flow transfer define the edges whne a CFG is constructed.

Values

Values represent runtime data in the IR. These include:

  • virtual registers
  • constants
  • expression results

Memory Model

Memory is represented as an explicit address space accessed only through load/store operations.
This avoids ambiguity from pointer aliasing at the IR level.

Flags

CPU flags are represented as explicit state separate from general-purpose registers.
They are produced and consumed by specific kinds rather than being implicit side effects.

Example Structure

Lets do this simple C snippet:

1
2
3
4
5
int32_t hello = 0;
while (hello < 100) {
	hello = hello + (hello * hello);
}
printf("%d, %d\n", hello, hello);

CFG Graph

Download GraphViz Example


Next Chapter: Chapter 5 - Lifting IL to IR

Prev Chapter: Chapter 3 - Lifting Assembly to an Intermediate Language

This post is licensed under CC BY 4.0 by the author.