Decompiler Construction: Chapter 2 - Designing an ISA-like Intermediate Language (IL)
Before you start reading, ask yourself: how do we take machine code from different architectures and transform it into a common representation that preserves program behavior as accurately as possible?
That is an Intermediate Language (IL). It represents multiple architectures in a universal format that preserves control flow and semantics. While some architecture-specific details may be partially lost or require approximation, the IL makes explicit what can be fully translated and what requires further analysis.
Designing
We want to model CPU behavior while remaining architecture-independent. A good starting point is arithmetic and bitwise operations such as add, subtract, multiply, divide, xor, and, or, and not.
To simplify transformation, we use a 3-address instruction format of the form:
1
dst = src1 op src2
Registers are represented as virtual registers using a prefix such as r.
For example, given x86 assembly:
1
ADD eax, esi
We first map architecture-specific registers into virtual registers:
- EAX -> r1
- ESI -> r2
Then we translate into IL:
1
ADD r1, r1, r2
In pseudo code:
1
r1 = r1 + r2;
This is a 3-address intermediate representation (also called 3-way addressing) used throughout the IL design.
Flags
We include virtual flags as explicit state within the IL to represent CPU condition codes such as OF, ZF, CF, and SF. Unlike general-purpose registers, these flags are implicitly produced and used by instructions rather than being directly operated on.
For example, given x86 assembly:
1
seto al
which loads the Overflow Flag (OF) into AL.
We map AL -> r1, OF flag -> 1
We then translate this into IL as:
1
LOAD_FLAG r1, 1
In pseudo code:
1
r1 = get_flag(OF)
This design choice makes flag dependencies explicit in the IR, treating them as first-class state rather than hidden side effects of instructions. This helps during later lifting and analysis.
Internal variables
Internal variables should only be used to represent analysis or transformation metadata, not values that map to the input architecture. This will avoid collisions and ambiguity.
State Model
The IL operates on an explicit program state consisting of:
- Virtual registers (r0, r1, …)
- Flags (OF, ZF, CF, SF, …)
- Memory (byte-addressed space)
Each instruction represents a deterministic transformation of this state.
Calls
Implementing an explicit argument register model helps reduce lifting collisions and ambiguity. Such as [base, count] can be used to represent argument ranges.
For example, calling an abstract function print with the argument “hello world”:
1
2
LOAD_STRING R1, "hello world"
CALL print, R1, 1
Here, we load the string into r1 and invoke the function with r1 as the argument (count = 1).
Memory and CPU specific functions
As mentioned earlier with flags, memory operations also need to be explicitly represented. We can include instructions for reading and writing memory with a specified bit size.
For example, memory operations can take the form of load and store instructions that define:
- address
- size
- value
- dest (if applicable)
- offset (64-Integer)
For CPU-specific functionality, we can include specialized call that can be handled by the user if it is too abstract. These instructions represent architecture-specific behavior that cannot be directly modeled in the core IL and are handled during specialized analysis passes.
Next Chapter: Chapter 3 - Lifting Assembly to an Intermediate Language
Prev Chapter: Chapter 1 - Utilizing Dynamic Binary Instrumentation for Lifting