Chapter 3 - Lifting Assembly to an Intermediate Language
Pipeline
Lifting is the process of translating machine instructions into IL instructions while preserving semantics.
A typical lifting process follows these steps:
- Dissassemble the instruction
- Map architecture-specific registers to virtual registers
- Identify implicit side effects (flags, memory, registers)
- Emit one or more IL instructions representing the behavior
DSL
When lifting assembly to IL, it is often helpful to create a Data Structure Language(DSL) to simplify the process.
Use operator overloading and helper classes to lift complex instructions into smaller transformations.
Reference the target ISA developer manual to understand instruction behavior and determine what needs to be implemented.
Branching
Branching is tricky, as if done incorrectly it will cause many side effects.
Because each instruction in the target architecture is not guaranteed to map to exactly one IL instruction, we map the start of each IL instruction with its original address.
For indirect calls, we can reference all logged indirect targets as discussed in
Chapter 1.
Example Walkthrough
We will lift with simple steps:
1
2
cmp eax, 0
je target
Step 1: Register mapping eax -> r1
Step 2: Emit comparison
1
CMP r1, 0
Step 3: Emit branch condition
1
JUMPIF_EQUAL target
CPU Specific
Example syscall you can do:
1
SCALL syscall, r1, 3
Scall stands for special call which is calling syscall. With it having 3 arguments (r1, r2, r3). These can be used to flag or model architecture-specific effects associated with certain instructions.
Example
Given X86 assembly
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
0x0: cmp eax, 0
0x3: je 0x1e
0x5: cmp eax, 1
0x8: je 0x28
0xa: cmp eax, 2
0xd: je 0x32
0xf: cmp eax, 3
0x12: je 0x3d
0x14: mov rax, qword ptr [rbp - 8]
0x18: xor rax, qword ptr [rbp - 0x10]
0x1c: jmp 0x40
0x1e: mov rax, qword ptr [rbp - 8]
0x22: add rax, qword ptr [rbp - 0x10]
0x26: jmp 0x40
0x28: mov rax, qword ptr [rbp - 8]
0x2c: sub rax, qword ptr [rbp - 0x18]
0x30: jmp 0x40
0x32: mov rax, qword ptr [rbp - 0x10]
0x36: imul rax, qword ptr [rbp - 0x18]
0x3b: jmp 0x40
0x3d: xor rax, rax
0x40: mov rsp, rbp
0x43: pop rbp
0x44: mov qword ptr [rax], rax
0x47: ret
Lifted output (example snippet):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
141 BITCAST r306, r306, 32, 0, true
142 FLAGSET 32, r306
143 NOP
144 BITCAST r248, r248, 64, 0, false
145 LOADINT r249, 0
146 MOVE r248, r249
147 LOADINT r248, 30
148 FLAGREAD r251, 8
149 SEPARATE 1
150 CMPN r251, 1
151 SEPARATE 1
152 SETIFEQUAL r250
153 SEPARATE 1
154 CMPS r250
155 SEPARATE 1
156 JUMPIFNOT 158
157 JUMP 583
....
Next Chapter: Chapter 4 - Designing an Architecture-Agnostic IR
Prev Chapter: Chapter 2 - Designing an ISA-like Intermediate Language (IL)