Zhangtong's Notebook

Hello World

Posted on 2021-05-22

Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub.

Quick Start

Create a new post

1	$ hexo new "My New Post"

More info: Writing

Run server

1	$ hexo server

More info: Server

Generate static files

1	$ hexo generate

More info: Generating

Deploy to remote sites

1	$ hexo deploy

More info: Deployment

Intel Memory Protection Extensions

Posted on 2020-01-17 Edited on 2021-05-22 In note

Introduction

Intel MPX is designed to allow a system to run both Intel MPX enabled software and legacy software. MPX enabled application can link with, call into, or be called from legacy software. MPX associates bounds with pointers in a novel manner, and uses bounds to check that pointer based accesses are suitably constrained. Futhermore, programmers can selectively use MPX to protect a subset of pointers.

The initial goal of MPX is twofold: 1) provide means to defend a system against attacks, 2) provide means to pinpoint accidental logic defects in pointer usage.

Programming Environment

Intel MPX can be enabled for user mode and supervisor mode. And, it is designed to allow software to associate bounds with pointers, and allows software to check memory references against the bounds to prevent out of bound access. The bounds registers hold lower bound and upper bound. An OOB access can causes a #BR (Bound Range Exceeded) exception.

The BNDLDX and BNDSTX instructions each take an operand whose bits are used to travese data structure in memory. In 64-bit mode, these instructions operate only on the lower bits in the supplied 64-bit addresses. The number of bits used is 48 plus a value called the MPX address-width adjust(MAWA) which depends on CPL (if CPL < 3 then MAWA = 0 else MAWA = CPUID.(EAX=07H,ECX=0H):ECX.MAWAU[bits 21:17]).

Bounds Registers

Intel MPX architectrure defines four registers BND0-3, each of which stores a pair of 64-bit lower bound (LB) and upper bound (UB). The bounds are inclusive. The upper bounds are architecturally represented in 1’s complement form. Thus, lower bound = 0, upper bound = 0 will allow access entire address space.

Configuration and Status Registers

Intel MPX defines two configuration registers for user mode (BNDCFGU) and supervisor mode (BDNCFGS), respectively. Also, Intel MPX defines a status register (BNDSTATUS) primarily used to communicates states information for #BR exception.

Intel MPX Instruction Summary

Usage and Examples

BNDMK is typically used after memory is allocated for a buffer. However, many other usages are possible such as when accessing an array member of a structure.

// assume the array A is allocated on the stack at 'offset' from `RBP`
int A[100];

// the instruction to store starting address of array will be
	LEA RAX, [RBP+offset]
// the instruction to create the bounds for array A will be
	BNDMK BND0, [RAX+399]
// store RAX into BND0.LB and ~(RAX+399) into BND0.UB

BNDMOV is typically used to copy bounds from one bound register to another when a pointer is copied from one GP register to another, or to spill/fill bounds into memory corresponding to a spill/fill of a pointer.

BNDCL/BNDCU/BNDCN are typically used before writing to a buffer but can be used in other instances as well. The pointer used to write to memory will be compared against lower bound. However, for upper bound check, the software must add the (operand size - 1) to the pointer before upper bound checking.

BNDSTX is used to store the bounds associated with a buffer and the “pointer value” of the pointer to that buffer onto a bound table entry via address translation using a two-level structure.

// assume a buffer with bounds stored in `BND0`, the pointer to the buffer is in ESI, the following sequence will store the "pointer value" and the bounds into a a configured bound table entry.

// store the pointer value in the index register ECX
MOV ECX, DWORD ptr [ESI]

// store the pointer in the base register EAX
MOV EAX, ESI

// perform address translation from the linear address of the base EAX and store bounds and pointer value ECX onto a bound table entry
BNDSTX DWORD ptr [EAX+ECX], BND0

// Similarly, to retrieve a buffer and its associated bonds from a bound table entry
MOV EAX, DWORD ptr [EBX]

// perform address translation from the linear address of the base EBX, and loads bounds and pointer value from a bound table entry
BNDLDX BND0, dword ptr [EBX+EAX]

Loading and Storing Bounds in Memory

Intel MPX defines two instructions to load and store of the linear address of a pointer to a buffer, along with the bounds of the buffer into a data structure of extended bounds. When storing these extended bounds, the processor parses the address of the pointer (where it is stored) to locate an entry in a bound table in which to store the extended bounds.

An extended bound is a 4-tuple consisting of lower bounds, upper bound, pointer value and a reserved field. On 64-bit paging mode, a bound table entry is 4*64 bits (32 bytes). The linear address of a bound table is stored in a bound directory entry.The linear address of the bound directory is derived from either BNDCFU or BNDCFGS.

The bound directory and bound table are stored in application memory and are allocated by the application. In 64-bit mode, the size of a bound table is 4 MBytes (2 ¹⁷ * 32 bytes), and the size of a bound directory is 2 ^{1 + MAWA} GBytes (2 ^28+MAWA).

Bounds in memory are associated with the memory address where the pointer is stored.

`BNDLDX` and `BNDSTX` in 64-Bit Mode

The linear address of the bound directory is derived from BNDCFGs. In 64-bit mode, each bound-directory entry (BDE) is 8 bytes. The number of entries in the bound directory is 2^28+MAWA.

In 64-bit mode, the processor uses the two-level structures to access extended bounds as follows:

In the first stage, the corresponding BD entry has to be loaded. For that, the CPU: (1) extracts the offset of BD entry from bits 20:47 of the pointer address and shifted it by 3 bits. (2) loads the base address of BD from the BDCFGx register, and (3) sums the base and the offset and loads the BD entry of the resulting address.
In the second stage, the CPU: (4) extracts the offset of BT entry from bits 3:19 of the pointer address and shifts it by 5 bits, (5) shifts the loaded entry by 3 to remove the metadata contained in the first 3 bits, and (6) sums the base and the offset and (7) finally loads the BT entry from the resulting address.

How to Read a Paper

Posted on 2019-12-07 Edited on 2021-05-22

如何阅读 paper?

How to Read a Paper

The three-pass approach

The first pass

迅速看一遍 title, abstract, 和 introduction
读 section 和 subsection 的标题
读 conclusion
看 references, 已经看过的 paper 可以忽略

并回答如下几个问题

Category: 这篇 paper 的类型. 是度量, 是对存在系统的分析, 还是对某个 prototype 的描述?
Context: 它和哪些 paper 相关, 它利用了哪些理论基础来分析这个问题?
Correctness: 论文的假设可行吗?
Contribution: 这篇 paper 的主要贡献点为?
Clarity: 这篇 paper 写作水平如何?

通过回答这些问题, 我们就能知道这篇 paper 是否有看下去的必要. 同时我们在写 paper 的时候, 主要就面向于上述问题.

The second pass

仔细看 figures, diagrams 和其它类型的图表. 它们是否都标了坐标轴? 这些结果是否含有 error bar(置信区间)?
记得标注出这篇 paper 的 reference 中, 你没有读过的部分. 这能有效地帮助你了解背景知识

这个部分大致需要 1 小时地时间. 在走完这步后, 你应该能抓住 paper 的重点, 并向其它人简述它的结构与原理. 如果这篇 paper 不属于你的研究领域, 那么对它理解到这个程度是比较合适的.

有时候你在过完这部分之后还是不理解这篇 paper, 这可能是因为你之前没有接触过相关的内容, 不理解其中的一些术语和缩写. 或者作者可能使用了你不理解的证明或者实验方式. 还有可能是因为 paper 写得不够好. 你现在有 3 个选择: a) 不看它了, b) 重新读一些背景材料, c) 直接读 the third pass

The third pass

为了完全理解这篇文章, 需要 the third pass. 这一步的关键在于, 尝试虚拟复现 (virtually re-implement) 这篇文章, 即做出和作者相同的假设, 复现他的工作. 通过比较这篇文章实际的实现过程, 你不仅可以发现这篇 paper 的创新点, 还可以发现它隐藏的失误和假设.

我们需要专注于细节上, 质疑每条语句中的假设. 并且, 你应该提出一个自己的实现方案. 然后把你的实现方案和 paper 中的方案作比较, 这样你就能够了解这篇文章所使用的证明和技术, 并将它们加入你自己的技能库. 同时这样也可以为 future work 提供想法.

刚开始时, 这一步可能需要 4, 5 个小时, 到后面可能需要差不多 1 小时. 在最后, 你应该能够在脑海中重构这篇 paper 的整个结构, 并能识别出这篇 paper 的优点和缺点. 特别是能够精确识别隐藏的假设, 缺失的引用和实验中存在的问题.

Doing a literature survey

做 survey 能够测试你的论文阅读能力. 但是 survey 需要几十篇论文的阅读量, 你需要读什么 paper? 这个也能通过 three-pass 的方式来帮助你.

首先利用学术搜索引擎选择 keywords 去找 3 到 5 篇这个领域最近的工作. 在每篇 paper 中做 one pass, 初略了解这些工作, 然后阅读它们中的 related works 相关的章节. 如果它们中提到了存在的 survey, 那么直接看就好了.

然后, 去看他们的 reference, 找到相同的引用论文以及重复的作者名字. 这些被引用的 paper 和作者就是这些领域的关键工作和研究者. 将这些文章下载下来, 并去这些作者的个人网站上去看他们的 publications. 通过这样你就能找到领域内的顶会.

第三步就是去顶会网站上找最近的会议论文集, 然后找到一些高质量的相关文献. 将这些文献和前面的关键工作下载之后, 使用 two pass 的方式去阅读, 并组成你的 survey. 如果他们都引用了你之前没有找到的文章, 那么也将这些文章添加到 survey 里面中.

Efficient Context-Sensitive Pointer Analysis for C Programs

Posted on 2019-08-01 Edited on 2021-05-22 In paper

Introduction

Point analysis promises significant benefits for optimizing and parallelizing compilers. However, the analysis must be efficient without sacrificing the accuracy of results. Moreover, the pointer analysis must handle real C programs.

Our approach is based on the insight that procedures are typically called with the same aliases among their inputs. Our idea is to generate incomplete transfer functions that only cover the input conditions that exists in the program. These incomplete transfer functions are made up of simple partial transfer functions (PTFs) that are only applicable to calling contexts that exhibit certain alias relationships.

Major Concepts

Partial Transfer Functions

int f (int *p, int *q, int *r) {
    *p = *q;
    *q = *r;
}

int x, y, z;
int *x0, *y0, *z0;

int main() {
    x0 = &x;
    y0 = &y;
    z0 = &z;

    if (test1)
        f(&x0, &y0, &z0);
    else if (test2)
        f(&z0, &x0, &y0);
    else
        f(&x0, &y0, &x0)
}

The example function int f(int *, int *, int *) illustrates two points. First, the alias among the inputs to a procedure determine its behavior. Second, even for this simple two-statement procedure the complete summary is fairly complex;

The main idea behind our algorithm is that we need not summarize a procedure for all possible aliases among its inputs. Instead, we only to find summaries that apply to the specific alias patterns that occur in the program.

Design of PTFs

There is a tradeoff between the complexity of the individual PTFs and their applicability. Whenever the analysis encounters another call to the procedure with the same input values, it can reuse the final values recorded in the PTF, which known as memorization.

Since it is not the specific input values but their alias patterns that determine the behavior of a procedure, we use symbolic names called extended parameters to abstract the input values. An extended parameter represents the locations reached through an input pointer at the beginning of a procedure.

Extended parameters’ three important roles:

Extended parameters make up part of the name space of a procedure.
The initial points-to function for a PTF specifies the input domain.
The final points-to function at the procedure exit summarizes the pointer assignments.

Besides the alias, the input domain of a PTF is also defined by the value of function pointers passed in and used in calls within the procedure. Because they determine what code could be executed.

Algorithm Outline

When we begin analyzing a procedure, the initial points-to function is empty and the parameter mapping only records the actual values for the formal parameters. When we need to know the initial value of an input pointer and it is not already recorded, we add an entry to the initial points-to function. To check for aliases, we need to look up the initial values in the calling context. The process continues recursively up the call graph until the pointer values are known.

When the iterative analysis encounters a procedure call, it needs to find the effects of the call on the points-to function. if the call is through a pointer passed as an input to one or more of the PTFs on the call stack, we add the potential values of that pointer to the specifications of the input domains for those PTFs. For each potential callee procedure, we first check if any of its PTFs apply. This involves building up a parameter mapping and comparing the input alias to those recorded for the PTF.

Pointer Representations

Instead of using types to identify locations that may contain pointers, we assume that any memory location could potentially hold a pointer. we also refer to locations within a block of memory by their positions, not by their field names. We define a new abstraction, the location set, to specify both a block of memory and a set of positions within that block.

We divide memory into blocks of contiguous storage, whose positions relative to one another are undefined. A block of memory may be one of three kinds: a local variable, a locally allocated heap block, or an extended parameter (global variables).

We distinguish heap blocks based on their allocation context, grouping together all the blocks allocated in each context. The minimal context information is simply the statement that creates the block.

Location Sets

Our goal with regard to aggregates is to distinguish between different fields within a structure but not the different element of an array. Our pointer analysis can be combined with data dependence tests to distinguish between different array elements.

A location set is a triple (b, f, s) where the base b is the name for a block of memory, f is an offset within that block, and s is the stride. That i, it represents the set of locations {f + is | i ∈ Ƶ} within block b.

Expression	Location Set
scalar	(scalar, 0, 0)
struct.F	(struct, f, 0)
array	(array, 0, 0)
array[i]	(array, 0, s)
array[i].F	(array, f, s)
struct.F[i]	(struct, f % s, s)
*(&p + X)	(p, 0, 1)

Extended Parameters

Every object is represented by at most one extended parameter. When adding an entry to the initial points-to function we first find the values of the pointer in the calling context. If the parameter mapping shows that an existing parameter already includes the same values, we reuse the old parameter instead of creating new one.

For simplicity and efficiency, each initial points-to entry points to a single extended parameter. In cases where the initial values are aliased with more than one parameter, we create a new extended parameter that subsumes all the aliased parameters, and we replace all references to the subsumed parameters. Likewise, when the initial values are aliased with an existing parameter but also include new values, we subsume the aliased parameter with new one.

Aliases involving fields of structures can be a bit more complicated. When updating the initial points-to function, if we find a pointer to a field of an existing parameter, we can easily record that fact. However, if we encounter a pointer to a field before a pointer to the enclosing structure, we will create an extended parameter to represent the field. We simply allow the location set offsets to be negative.

Points-to Functions

A points-to function is a mapping from location sets to sets of location sets. It is important for the analysis to know which locations may contain pointers. Since we do not use the high-level type, and since the points-to function only contain entries for pointers that have been referenced, we record separately for each block of memory the location sets that may hold pointers.

Within the pointer analysis itself, that information can enable strong updates, where assignments overwrite the previous contents of their destinations. We only need that information at the point where a pointer is dereferenced.

Analyzing a Procedure

int EvalProc(proc *pr, PTF *ptf) {
    /* iteratively analyze a procedure */
    do {
        changed = false;
        foreach cfgnode nd in pr.flowgraph {
            if (no predecessors of nd evaluated)
                continue;
            if (nd is meet) EvalMeet(nd, ptf);
            if (nd is assign) EvalAssign(nd, ptf);
            if (nd is call) EvalCall(nd, ptf);
        }
    } while (changed)
}

Strong Updates

Unless we know that an assignment will definitely occur and that it assigns to a unique location, we must conservatively assume that location potentially retains its old values.

Sparse Representation

Since we analyze heap data as well as global and stack variables, many possible memory locations could be included in the points-to functions. Fortunately, the information stored in the point-to function is very sparse.

Because of the sparse point-to function representation, looking up the values of a pointer requires searching back for the most recent assignment to that location. Beginning at the current node, we search back through the dominating flow graph nodes.

Evaluating Dereferences

Because location sets may overlap, more than one location set may refer to the same location. Value assigned to one location set must be observed by references to overlapping locations. Thus, when a pointer is dereferenced, we iterate through all of the overlapping locations and look up in the current points-to function the values that have been assigned to each one. However, if the location being referenced is a unique location, values assigned to overlapping locations may have been overwritten by a strong update.

Evaluating Assignments

An assignment node in a flow graph specifies both the source and destination expressions, as well as the size of the value to be assigned. When building the flow graph from the intermediate representation of a program, we automatically convert the assignments to a “points-to” form. That is, since a variable reference on the right-hand side if an assignment refers to the contents of that variable, we add extra dereference to each expression on the right-hand side.

The process of evaluating an assignment begins by finding the locations identified by the source and destination expressions.

Interprocedural Algorithm

To evaluate the effects of a call on the points-to function, we need to find a PTF that applies in the calling context.

Calls Through Pointers

The first step in evaluating procedure calls is to determine the target procedures. Function pointer values may be expressed in terms of extended parameters. This is one situation where we need to know the specific values represented by extended parameters.

Testing if a PTF Applies

The input domain of a PTF is specified by its initial points-to function and by the value of the parameters used as call targets. To test if a PTF applies, we check if these things are the same in the current context as they were when the PTF was created.

Applying a PTF

Recursive Calls

Intel Processor Trace

Posted on 2019-06-05 Edited on 2021-05-22 In note

Overview

Intel PT 是一个处理器架构扩展, 它利用硬件来记录程序执行的 trace, 并且产生很小的开销. 控制流的信息通过数据包收集, 然后使用软件解码器解码. 这些数据包内容包括: 时间, 程序流信息 (e.g., 分支目标, 分支是否执行), program-induced mode related information (e.g., Intel TSX 状态转换, CR3 改变). Intel PT 后期的版本增加了 trace sources, 包括使用 PTWRITE 的 software trace instrumentation, 和 Power Event tracing.

Features and Capabilities

使用 PT 生成的数据包和程序的二进制文件, 可以重构出程序的执行 trace. 这些数据包记录的信息有: instruction pointers (IP), indirect branch targets, directions of conditional branches within contiguous code regions (basic blocks).

PT 可以使用 PTWRITE 来记录软件生成的数据包, 和一些处理器功耗的事件. 并且, Precise Event-based Sampling (PEBS) 也能通过配置, 在 PT 中记录.

PT 有几种控制和筛选机制, 完成自定义的 trace 信息收集. 比如可以通过 current privilege level 和 CR3 值进行筛选. 这些筛选可以通过设置 MSR 来进行.

Packet Summary

程序执行基本信息的数据包如下:

Packet Stream Boundary (PSB) packets: 心跳包, 每隔一定周期 (e.g., every 4K trace packet bytes) 自动生成. 可以作为数据包的边界来供 decoder 识别, decoder 解析的第一个包一定要是这个包.
Paging Information Packet (PIP): 记录 CR3 寄存器的修改情况. 通过结合操作系统中每个进程 CR3 的值, debugger 能够将线性地址和对应的进程 (源代码) 结合起来.
Time-Stamp Counter (TSC) packets: 记录时间.
Core Bus Ratio (CBR) packets: 包含 core:bus 的时钟频率比
Overflow (OVF) packets: 当处理器内部缓冲区溢出时设置, 可能其它的包被丢弃了, 解码器利用这个信息来同步正确的解码.

关于控制流信息的数据包如下:

Taken Not-Taken (TNT) packets: 记录直接条件分支的方向.
Target IP (TIP) packets: 记录间接分支, 异常, 中断和其他分支或事件的目的 IP. 里面存储的 IP 都是被压缩的, 并且存在不同类型的 TIP 包.
Flow Update Packets (FUP): 异步事件 (中断或者异常) 的源 IP. 还有其它不能从二进制文件中获取的源 IP 也通过这个包得到.
MODE packets: 提供解码器重要的处理器执行信息, 使得解码器能够解释反汇编的二进制程序和 trace log.

软件生成的数据包:

PTWRITE (PTW) packets: 包括传给 PTWRITE 指令的操作数的值.

Intel Processor Trace Operational Model

Change of Flow Instruction (COFI) Tracing

Basic block 之间的控制流转移指令叫做 COFI, 分为三种: 1. Direct transfer COFI, 2. Indirect transfer COFI, 3. Far transfer COFI.

同时, 按照不同的类型:

COFI Type	Instructions
Conditional Branch	Jcc , J*CXZ, LOOP
Unconditional Direct Branch	JMP (E9 xx, EB xx), CALL (E8 xx)
Indirect Branch	JMP (FF /4)， CALL (FF /2)
Near Ret	RET (C3, C2 xx)
Far Transfers	INT1, INT3, INT n, INTO, IRE(T/TD/Q), JMP (EA xx, FF/5), CALL (9A xx, FF/3), RET (CB, CA xx), SYSCALL, SYSRET, SYSENTER, SYSEXIT, VMLANUCH, VMRESUME

Direct Transfer COFI

Direct Transfer COFI 是相对分支. 意味着他们的目标地址是当前的 IP 加上一个 offset. 因为二进制反汇编就可以得到直接跳转的 IP, 所以没有必要在 trace 中记录 IP. 对于条件分支, 只需要标记该分支是 taken or not 就行. 无条件分支都不需要任何记录.

Conditional Branch (Jcc, J*CXZ) and Loop: 编码一个 bit 的 TNT 就可以指示程序的控制流. 同时, 处理器会将多个 TNT bit 压缩成一个数据包.
Unconditional Direct Jumps: 不会生成 trace.

Indirect Transfer COFI

Indirect Transfer 指令会从寄存器或者内存地址中更新 IP, 所以 PT 就必须记录目的 IP，使得 debugger 能够确定 COFI 的目的地址. 这个 IP 可能为 linear addres, 或 effective address.

Indirect Transfer 指令会生成一个包含目的地址的 Target IP Packet(TIP).

Near JMP Indirect and Near Call Indirect: 生成 TIP.
Near RET: 当 CALL 指令执行时, 它会将 CALL 后面那条指令的地址压栈. 在调用返回后, RET 指令会出栈这个返回i地址, 并 JMP 过去. 因为这个调用本身可能修改这个返回地址, 所以实际的返回地址在反汇编中仍不确定, 所以对于 RET 还是会生成一个 TIP 包.
- RET Compression: 如果能保证 RET 的返回地址和压栈的返回地址完全一致, 可以仅生成一个 TNT 包表示走了这个分支. 不过这个得需要 decoder 作额外的处理.

Far Transfer COFI

所有改变 instruction pointer 的非 near jumps 指令都是 “far transfer”. 包括: 异常, 中断, trap, TSX abort, 和进行 far transfer 的指令.

所有的这些指令都会产生一个 TIP 数据包. 有一些 far transfer 在二进制代码中无法推断, 比如异步事件异常和中断, 这种会先产生 FUP 数据包提供事件源 IP, 再产生一个 TIP.

Software Trace Instrumentation with `PTWRITE`

PTWRITE 使得软件可以直接利用 PT trace. 每次将 64 bit 数据写入到 PT 产生的 PTW 数据包中. 解码器和分析软件能根据 PTWRITE 指令的 IP, 和数据内容确定这个数据的意义. PTWRITE 通过 IA32_RTIT_CTL.PTWEn[12]开启, 并且用户可以使用 IA32_RTIT_CTL.FUPonPTW[5] 在 PTW 包后面添加 FPU 包, 获取 PTWRITE 指令的 IP.

Power Event Tracing

Trace Filtering

Filtering by Current Privilege Level (CPL)

Intel PT 设置 trace 条件为 CPL = 0, 或者 CPL > 0, 使得 logical processor 生成对应 CPL 的数据包. CPL 筛选保证了只有满足条件的 IP 或者其它的架构信息能够被 trace, 比如说 trace 条件是 CPL > 0, 当软件执行 SYSCALL (CPL = 0)时, SYSCALL 的目的 IP 就不会出现在生成的数据包中.

在 real-address mode 和 virtual-8086 mode 时, CPL 分别为 0, 3.

Filtering by CR3

Intel PT 支持配置 CR3 筛选器, 能根据当前 CR3 的值生成 trace. 如果是多线程的程序, 可以通过切换 RTIT MSRs 的状态来分隔不同线程的输出.

设置 IA32_RTIT_CR3_MATCH MSR寄存器的值为目标 CR3 值, 然后设置 IA32_RTIT_CTL.CR3Filter, 就可以 trace 某个 CR3. 如果处理器被设置为 CPL = 0 时也能记录, 那么生成的 trace 中含有 PIP 数据包.

Filtering by IP

如果 CPUID.(EAX=14H, ECX=0):EBX[bit 2] = 1 时, Intel PT 能够仅在程序执行的 IP 在某个确定的范围内时, 生成 trace 数据包. 对 IA32_RTIT_CTL MSR 寄存器中的 ADDRn_CFG 域进行设置, 可以启用 IP filtering. 这个 n 是从零开始的数字, 用来选择配置的是哪个地址范围. 每个 ADDRn_CFG 都确定一组寄存器对的值 [base, limit], [IA32_RTIT_ADDRn_A, IA32_RTIT_ADDRn_B]. 选定的地址范围个数可以通过 CPUID 查看.

这个 base 和 limit 是 inclusive 的, 所以他包含边界.

TraceStop: 可以通过配置 IP 范围, 不对某些地址进行 trace, 若上述范围重叠, 重叠部分不 trace.

Packet Generation Enable Controls

Intel PT 包含多种控制机制来确定是否生成数据包. 一般来讲, 只要 Packet Enable (PacketEn) 被设置, 大多数数据包都会生成. PacketEn 是硬件的内部状态, 软件只能通过对 MSR 寄存器进行修改来对他进行配置.

Packet Enable (PacketEn)

当 PacketEn 被设置时, 处理器就能使用 PT 生成 trace. PacketEn 是几种状态的组合: PacketEn <- TriggerEn AND ContextEn AND FilterEn AND BranchEn. PacketEn 是处理器是否生成 trace 数据包的最根本的因素. 当它 enable 时, 所有的控制流数据包都会被 enable, 当它被 clear 时, 只有一些其它的数据包 (timing and bookkeeping packets) 生成. 不支持 IP Filtering 的处理器中 (i.e., CPUID.(EAX=14H, ECX=0):EBX.IPFILT_WRSTPRSV[bit 2] = 0), FilterEn 总是被 set.

Trigger Enable (TriggerEn)

Trigger Enable (TriggerEn) 是最主要的生成数据包的指示器. 使用 IA32_RTIT_CTL.TraceEn 来对它 set. 通过以下任意一个条件来 clear 它: (i) 使用软件 clear IA32_RTIT_CTL.TraceEn, (ii) 满足 TraceStop 条件, 并且 IA32_RTIT_STATUES.Stopped 被 set, (iii) IA32_RTIT_STATUS.ERROR 被 set. 软件可以通过查询 IA32_RTIT_STATUS.TriggerEn 来获取当前 TriggerEn 的值.

Context Enable (ContextEn)

Context Enable (ContextEn) 指示了处理器的 state 和 mode 是否满足软件之前的设置. 例如, 如果在 CPL = 0 状态下的代码执行没有被 trace (IA32_RTIT_CTL.OS = 0), 那么表明 CPL = 0 时, ContextEn 为 0. 软件可以通过查询 IA32_RTIT_STATUS.ContextEn 获取当前 ContextEn 的值.

ContextEn = !((IA32_RTIT_CTL.OS = 0 AND CPL = 0)
              OR (IA32_RTIT_CTL.USER = 0 AND CPL > 0)
              OR (IS_IN_A_PRODUCTION_ENCLAVE)
              OR (IA32_RTIT_CTL.CR3Filter = 1 AND IA32_RTIT_CR3_MATCH does not match cr3))

如果 clear ContextEn 导致 PacketEn 被 clear, 就会生成一个 Packet Generation Disable (TIP.PGD) 数据包, 但是里面并没有 IP. 如果 set ContextEn 导致 PacketEn set, 那么 Packet Generation Enable (TIP.PGE) 就会生成.

当 ContextEn 被 clear, 控制流相关数据包 (TNT, FUP, TIP.*, MODE.*) 都不会生成, 并且不会暴露出 IP. 其它的包可能会生成.

只有当 TriggerEn = 1 时, ContextEn 的值才会变化, TriggerEn = 0 时, ContextEn 不会改变.

Branch Enable (BranchEn)

BranchEn 基于 IA32_RTIT_CTL.BranchEn 的值, 如果 BranchEn 没有被 set, 那么相关的 COFI 数据包 (TNT, TIP.*, FUP, MODE.*) 不会生成.

Filter Enable (FilterEn)

FilterEn 表示 IP 位于之前配置好的 IP 范围中. 软件可以查询 IA32_RTIT_STATUS.FilterEn 获取 FilterEn 的状态.

当 clear FilterEn 之后, 一个 Packet Generation Disable (PGD) 数据包会生成, 但和 ContextEn 不同的是, 目标 IP 数据还是在里面的. 所以可能会有范围之外的 IP 出现在 trace log 中. 当 FilterEn clear, 控制流的数据包就不会生成.

若处理器没有配置 trace range of IP, 或者处理器不支持 IP filtering, 那么 FilterEn 总是被 set. FilterEn 仅仅只会在 TraceEn 和 ContextEn 均为 1 的情况下，同时在配置好 trace range 之后被开启关闭.

Trace Output

Intel PT 的输出和 trace 内容, 筛选机制独立. 不同的处理器和平台, 支持不同的输出选项. 使用 IA32_RTIT_CTL 中 ToPA 和 FabricEn 域配置输出策略为以下的一种:

单个, 连续的物理内存地址空间.
多个不同大小的物理内存区域的集合. 这些区域通过一个表来连接和索引, 即 Table of Physical Addresses (ToPA). Trace 的输出绕过 cache 和 TLB 直接写入物理内存, 这个输出策略对性能影响最小, 但输出不一定是序列化的.
平台特定的 trace 传输子系统.

不管选择哪个输出策略, 输出都会绕过 cache. 这保证了 trace 不会消耗宝贵的 cache 空间, 但是和一般的 uncacheable 内存区域的写入不同, trace 输出的写入并不是序列化的. 不要对输出的物理地址区域使用 MTRR 或者 PTE 配置 UnCacheable.

我们并不能控制 trace 在何时输出到物理内存中. 唯一能保证的就是, 将 TraceEn clear, 停止 trace, 并执行 store, fence 或者序列化指令, 从而将处理器内存缓冲区的所有 trace 数据包 flush 到物理内存中.

Single Range Out

当 IA32_RTIT_CTL.ToPA 和 IA32_RTIT_CTL.FabricEn 的 bit 是 clear 状态时, trace 数据包会发送到一块连续的物理内存空间 (or MMIO) 中. 这个物理内存空间通过 IA32_RTIT_OUTPUT_BASE 和 IA32_RTIT_OUTPUT_MASK_PTRS 中的 mask 值来确定. 当前的写入指针也存储在 IA32_RTIT_OUTPUT_MASK_PTRS 中, 这个范围是环形的, 在到达缓冲区末尾之后会重新将写入指针移动到缓冲区的 base.

这个方法最好是用在以下场景:

DRAM 中有足够大的连续空间时
输出到 MMIO 调试端口.

Table of Physical Addresses

当 IA32_RTIT_CTL.ToPA 是 set, IA32_RTIT_CTL.FabricEn 是 clear 时, 就使用 ToPA 机制. ToPA 使用的是一个环形链表, 每个节点都是一个 table. Table 中的每个 entry 都包含一个物理内存区域 (output region) 的 base 指针和大小, 如果是最后一个 entry，那就可能存放下一个 table 的物理内存地址.

处理器将这些物理内存区域看作一个统一的 buffer, 所以一个数据包可能横跨两个 output region.

ToPA 机制可以用以下三个值来控制:

proc_trace_table_base: 这是当前 table 的物理地址. 从 IA32_RTIT_OUTPUT_BASE MSR 寄存器中获取.
proc_trace_table_offset: 当前 table 正在被使用的 entry. 从 IA32_RTIT_OUTPUT_MASK_PTRS 中的 31:7 位获取.
proc_trace_output_offset: 当前 output region 中的指针. 从 IA32_RTIT_OUTPUT_MASK_PTRS 中的 63:32 位获取.

当 trace packet 按照规则写入物理内存时, 若碰到有 entry 标记为 END, 就需要更新 table_base 来切换到下个 table. 若为 STOP,在充满当前 output region 之后停止 trace. 如果一直没有 END 标记或者 STOP 标记, 那么在到达最大 table 大小 (proc_trace_table_offset = 0x0ffffff8h) 之后, 会将 table_offset 和 output_offset 置 0, 循环输出.

很重要的一点是, 处理器对 IA32_RTIT_OUTPUT_BASE 和 IA32_RTIT_OUTPUT_MASK_PTRS MSR 寄存器的更新和指令执行是异步的, 所以直接读取 MSR 得到的值可能是旧的值. 要得到正确的值必须先停止 trace 数据包生成 (clear IA32_RTIT_CTL.TraceEn), 再读取.

同时, 处理器可能会内部 cache 一些 table 的 entry. 如果要修改这些 entry, 也得先停止 trace 再更改.

第一代实现 PT 的处理器只能将 ToPA 配置为 single output region 的模式.

以下是几个 entry 标记的用途:

ToPA STOP: 当前 output region 满了之后, set IA32_RTIT_STATUS.Stopped 位, 停止 trace. 这类停止没有 TIP.PGD 数据包发出, 并且意味着 CPU 内部缓冲区中的数据包就被丢弃了.
ToPA PMI: 当 INT 标记被设置时, 当前 output region 充满之后就会发出一个 performance-monitoring interrupt 中断. 这个中断也是异步的, 不精确. 并且中断处理的过程也能被 trace, 可以通过 filtering 来解决这个污染. 所以， PMI handle 如果想读取 trace 数据包, 一定要先 clear TraceEn, 停止 trace.

管理 Intel PT ToPA PMI 和 XSAVES/XSTORES 的算法:

IF (IA32_PERF_GLOBAL_STATUS.ToPA)
    Save IA32_RTIT_CTL Value;
    IF (IA32_RTIT_CTL.TraceEn)
        Disable Intel PT by clearing TraceEn;
    FI;
    IF (There is space available to grow the current ToPA table)
        Add one more ToPA entries after the last entry in the ToPA table;
        Point new ToPA entry address fields to new output region base;
    ELSE
        Modify an upcoming ToPA entry in the current table to have END = 1;
        IF (out put should translation to a new ToPA table)
            Point the address of the "END = 1" entry of the current table to the new table address base;
        ELSE
            /* make a circular*/
            Point the address of the "END = 1" entry to the base of the current table;
            Modify the ToPA entry address fields for filled output regions to point to new, unused output regions;
        FI;
    FI;
    Restore saved IA32_RTIT_CTL.value;
FI;

Trace Transport Subsystem

Restricted Memory Access

Trace 数据包的输出的目的内存区域不能为限制的区域, 比如说所有的输出区域都会受到 SMRR (System-Management Range Register) 的限制, 所有和它重叠的区域都不能作为 output region.

可以先关闭 trace 数据包生成, 再修改 SMRR, 然后再生成数据包.

Enabling and Configuration MSRs

General Consideration

Trace 数据包的生成可以通过一系列 MSRs 来进行维护:

处理器不支持 Intel PT 时, 使用 RDMSR 或 WRMSR 操作 IA32_RTIT_* 系列的 MSRs 都会造成 #GP.
在改变 MSRs 之前, 一定要先 clear IA32_RTIT_CTL.TraceEn, 不然也会 #GP.
每个 logical processor 都是一套不同的 MSRs. 意思是每次配置, 都是在一个 logical processor 上配置的.
在 cold REST 之后, 所有的 MSRs 配置都被清除
- 如果 CPUID.(EAX = 14H, ECX = 0):EBX.IPFILT_WRSTPRSV[bit 2] = 1, 那么在 warm REST 之后, 只有 TraceEn bit 被清除, 同时可能影响 IA32_RTIT_STATUS. 但是其它 MSRs 不受影响.
除了可以通过 WRMSR 来显式修改外, 还可以使用 VM-exit 或者 VM-entry, XRSTORS, XSAVE 来修改.

IA32_RTIT_CTL MSR

Position	Bit Name	At Reset	Bit Description
0	TraceEn	0	1: tracing, 0: tracing disabled. 从 1 到 0 时, CPU 内部缓冲区会被 flush
1	CYCEn	0	0: disables CYC Packet, 1: enable CYC Packets
2	OS	0	0: 当 CPL = 0 时，不生成数据包, 1: CPL = 0 时生成数据包
3	User	0	0: 当 CPL > 0 时, 不生成数据包, 1: 当 CPL > 0 时, 生成数据包
4	PwrEvtEn
5	FUPonPTW	0	0: PTW 包后面不跟 FUP 包, 1: 跟
6	FabricEN	0	0: 使用 ToPA, 1: 不使用
7	CR3Filter	0	0: 关闭 CR3 filtering, 1: 开启
8	ToPA	0	0: Single-range output, 1: ToPA output
9	MTCEn	0	0: disable MTC packets, 1: enable
10	TSCEn	0
11	DisRETC	0	0: enable RET compression, 1: disable
12	PTWEn	0	0: PTWRITE Packet generation disabled, 1: enable
13	BranchEn	0	0: disable COFI-based packets, 1:enable
17:14	MTCFreq	0	定义 MTC 数据包的频率, 可以基于 Always Running Timer (ART).
22:19	CycThresh	0	CYC packet threshold.
27:24	PSBFreq	0	控制 PSB 的生成频率, 但是不精确
35:32	ADDR0_CFG	0	0: 不使用 ADDR0, 1: 相关的 MSRs 定义了 IP Filter 的上下界, 2: 相关的 MSRs 定义了 TraceStop 上下界
39:36	ADDR1_CFG	0
43:40	ADDR2_CFG	0
47:44	ADDR3_CFG	0
56	InjectPsbPmiOnEnable	0	1: Enable IA32RTIT_STATUS 的 PendPSB, PendTopaPMI, 0: disable

Enabling and Disabling Packet Generation wth TraceEn

当 TraceEn 从 0 到 1 时, Intel PT 就开始工作, 一系列的数据包开始生成. 这些包能帮助 decoder 获取处理器的状态. 如果 IA32_RTIT_STATUS.PacketByteCnt = 0, 那么就会生成 full PSB+，如果不为 0, 那么会生成和 timing 相关的数据包, TSC, TMA, CRR 等.

除了上述包, 还可能会有 TIP.PGE 数据包生成. 当处理器生成 trace 时, CPU 会将 ToPA 相应的数据结构放入 cache 提高性能, 所以要修改 ToPA tables 的话, 必须先停止 TraceEn, 再修改.

IA32_RTIT_STATUS MSR

IA32_RTIT_STATUS MSR 是可以使用软件来读写的, 但是有一些位 ContextEn, TriggerEn, 不能直接被修改.

Position	Bit Name	Bit Description
0	FilterEn	表示当前 IP 是能被 trace 的
1	ContextEn	当前上下文是能被 trace 的
2	TriggerEn	表示 trace 开启
4	Error	表示操作出错, 当它被 set 时, TriggerEn 被 clear, trace 终止
5	Stopped	表示 ToPA STOP 标志
6	PendPSB	set 时会插入一个 PSB+ 到 trace 中, 插入后 clear
7	PendTopaPMI	set 时表示要触发 PMI, 触发完毕就 clear
48:32	PacketByteCnt	表示已经发出数据包的 bytes, 可以用它来触发下一个 PSB 包的发送, 在 trace 过程中, 处理器会一直修改这个值.

IA32_RTIT_ADDRn_A and IA32_RTIT_ADDRn_B MSRs

这些 MSRs 可以通过 IA32_RTIT_CTL 中的 ADDRn_CFG 域来启用.

IA32_RTIT_CR3_MATCH MSR

当 IA32_RTIT_CTL.CR3Filter = 1 时, 会把它与当前 CR3 作比较. 63:5 这几位保存了 CR3 的值.

IA32_RTIT_OUTPUT_BASE MSR

这个是用来配置 trace 输出的目的物理地址, 它的大小受到最大物理地址宽度的限制 (MAXPHYADDR), 从 CPUID.80000008H:EAX[7:0]获取.

当使用 ToPA 输出模式时, 处理器可能会更新这个 MSR 的值, 但是指令的执行和处理器的更新是异步的, 所以这个 MSR 里面的值在 trace 期间是 unreliable 的.

IA32_RTIT_OUTPUT_MASK_PTRS MSR

这个 MSR 存储的是 trace 下一个 byte 输出的位置. 和上面一样, 异步更新, unreliable.

Position	Bit Name	At Rest	Bit Description
6:0	LowerMask	7FH	强制为 1
31：7	MaskOrTableOffset	0	这个值的意义取决于 IA32_RTIT_CTL.ToPA 的值. 若 ToPA = 0, 表示这就是 single, contiguous physical output region 的 mask value. 所以, 这个区域的大小一定是在 128B 到 4GB. 若 ToPA = 1, 表示的是当前 ToPA table 中 offset pointer 的 27:3 bits, 加上上面的 IA32_RTIT_OUTPUT_BASE 可以获取当前 entry 的指针. 每个 table 大小最大为 256 MB.
63:32	Output offset	0	这个值的意义取决于 IA32_RTIT_CTL.ToPA 的值. 若 ToPA = 0, 表示这个是 single, contiguous physical output region 的 offset pointer. 加上 IA32_RTIT_OUTPUT_BASE 的值就可以得到下个 byte 要写入的地址. 这个值一定要小于等于 MaskOrTableOffset. 若 ToPA = 1, 这个是当前 ToPA output region 的 offset pointer. 加上 output region 的 base 之后, 可以获得下个 byte 写入的地址

Interaction of Intel Processor Trace and Other Processor Features

TODO

CONFIGURATION AND PROGRAMMING GUIDELINE

Detection of Intel Processor Trace and Capability Enumeration

见代码

Packet Decoding of RIP versus LIP

FUP, TIP, TIP.PEG 和 TIP.PGD 可能会包含 IP payload. 在有些处理器中, 这个 payload 是 effective address (RIP), 其它就是 linear address (LIP). 前面的模式下, payload 里面的值是距离 code segment base (CS) 的偏移, 后面的是偏移加上 code segment base. 对于执行过程中 CS base address 为 0 的软件 (64 bit mode), 这两者没有区别.

对于 IP filtering 和 TraceStop 中配置的 IP 类型, 也需要和这个保持一致.

Model Specific Capability Restrictions

有一些处理器在进行 trace 的过程中, 限制了对 LBRs/BTS/BTM/LERs 等 MSRs 的使用.

Enabling and Configuration of Trace Packet Generation

首先检测是否支持 Intel PT, 然后再配置一系列 MSRs 进行 Trace.

以下是 Skylake (6th gen), Kaby Lake (7th gen), Coffee Lake (8th gen and 9th gen), Cannon Lake (8th gen i3) 的 PT 相关 MSRs.

Hex	Dec	Register Name	Scope	Desc
0x560H	1376	IA32_OUTPUT_BASE	Thread	Trace Output Base Register (R/W)
0x561H	1377	IA32_OUTPUT_MASK_PTRS	Thread	Trace Output Mask Pointer Register (R/W)
0x570H	1392	IA32_RTIT_CTL	Thread	Trace Control Register (R/W)
0x571H	1393	IA32_RTIT_STATUS	Thread	Tracing Status Register (R/W)
0x572H	1394	IA32_RTIT_CR3_MATCH	Thread	Trace Filter CR3 Match Register (R/W)
0x580H	1408	IA32_RTIT_ADDR0_A	Thread	Region 0 Start Address (R/W)
0x581H	1409	IA32_RTIT_ADDR0_B	Thread	Region 0 End Address (R/W)
0x582H	1410	IA32_RTIT_ADDR1_A	Thread	Region 1 Start Address (R/W)
0x583H	1411	IA32_RTIT_ADDR1_A	Thread	Region 1 End Address (R/W)

Enabling Packet Generation

set IA32_RTIT_CTL 中的 TraceEn.

Disabling Packet Generation

clear IA32_RTIT_CTL 中的 TraceEn, 然后再看看 IA32_RTIT_STATUS MSR 中的 Error bit 和 Stopped bit, 判断停止 trace 的原因.

Flushing Trace Output

数据包都是在 CPU 内部 buffer 中被异步的写入内存. 所以解码器必须先停止 trace 使得所有的数据包都被 flush out.

Warm reset

在 Warm reset 时, MSR 里面配置的内容不会丢失, 除了 TraceEn, 和 IA32_RTIT_STATUS 中的某些 bit.

Manual Trace Configuration Context Switch

可以通过 RDMSR, WRMSR 来将配置进行 save, 和 restore. 保存 trace 配置上下文的方法如下:

RDMSR IA32_RTIT_CTL, 将值保存到内存
WRMSR IA32_RTIT_CTL, 将上面保存的值写入 MSR, 但是注意写入之前 clear TraceEn
将其它的 MSR 的配置使用 RDMSR 读取, 将更改后的值保存在内存中

恢复 trace 配置上下文过程:

将保存在内存的值写入到 MSR, 除了 IA32_RTIT_CTL
将保存在内存中 IA32_RTIT_CTL 的值写入 MSR

Trace Configuration Context Switch Using XSAVES/XRSTORS

TODO

Cycle-Accurate Mode

CYC 数据包提供处理器 core clock domain 的底层信息. 只和指令执行的时间相关.

TODO

Decoder Synchronization (PSB+)

Decoder 使用 PSB 数据包作为 trace log 的同步点. 除了简单的同步之外, decoder 还需要知道一些状态 (state) 信息和潜在的时间信息. Decoder 每次获取的信息一定是在两个 PSB 之间的, 比如 LastIP, call stack, compound packet event.

当 PSB 生成时, 在它之后会有一个 PSBEND 数据包. 在这两个数据包之间的是 decoder 需要了解的处理器状态 (state) 信息. 这些数据包就是 PSB+, 他们表示的是 status, 并不会改变 PSB 的 state, 也和其它指令没什么关系. PSB+ 可以包括当前的 Timestamp.

Internal Buffer Overflow

TODO

Trace Packets and Data Types

Packet Relationships and Ordering

这一节主要讲, 怎么把这些数据包中和反汇编的的代码绑定 (binding) 起来. 有一些数据包 (TIP, FUP) 的 payload 就有关联的 IP, 其它的数据包 (TNT) 需要由 decoder 来查找具体的指令来绑定. Decoder 都需要考虑到这些数据包之间的关系, 然后使用这些关系来确定如何 bind 这些数据包.

有一些叫做 “compound packet event” 的事件, 比如中断, 异常等, 只需要一条指令就能产生多个数据包. 这些 compound event 生成的数据包很多都是以 FUP 开头来表示 source address, 然后以 TIP 或者 TIP.PGD 结尾来表示 destination address. 在这个场景下, 就可以说 FUP is “coupled with” TIP. 有一些其它的包可能处于这两者中间, 比如和时间相关的, 和状态相关的.

Event Type	Beginning	Middle	End	Comment
Unconditional, uncompressed control-flow transfer	FUP, or none	Any combination of PIP, VMCS, Mode.Exec, or None	TIP or TIP.PGD	FUP only for asynchronous events. Order of middle packets may vary. PIP/VMCS/MODE only if the operation modifies the state tracked by these respective packets
TSX Update	MODE.TSX, and (FUP or none)	None	TIP, TIP.PGD, or none	FUP TIP/TIP.PGD only for TSX abort case
Overflow	OVF	PSB, PSBEND, or none	FUP or TIP.PGE	FUP if overflow resolves while ContextEn = 1, else TIP.PGE

Packet Blocks

Packet blocks 是用来 dump 一些状态值的. 具体看手册.

Kaleidoscope: Extending the Language: Control Flow

Posted on 2019-05-22 Edited on 2021-05-22 In note

Chapter 5 Introduction

这是该教程的第五章, 前四章讲了 Kaleidoscope 的实现, 并且包括对 JIT 的支持. 在这章我们要给它添加简单的条件分支和循环指令.

If/Then/Else

给 Kaleidoscope 添加 if/then/else 支持非常直接: 将条件分支的处理分别加入 lexer, AST, parser, 和 LLVM IR 代码生成.

首先我们看一下, 要给它加入什么样的扩展功能 (what):

1	def fib(x) if x < 3 then 1 else fib(x - 1) + fib(x - 2);

在 Kaleidoscope 中, 只有表达式 (expression), 没有语句(statement). 所以 if/then/else 表达式需要被计算出一个值, 根据条件表达式的值不同, 最终的结果可能为 then 表达式值, 也可能为 else 表达式值.

条件表达式中, 如果值为 0.0, 则表示 false, 其它都为 true

Lexer Extension for If/Then/Else

我们先添加关键字:

    // control flow
    tok_if = -6,
    tok_then = -7,
    tok_else = -8,
};

然后在 gettok() 中进行识别:

if (IdentifierStr == "if")
    return tok_if;
if (IdentifierStr == "then")
    return tok_then;
if (IdentifierStr == "else")
    return tok_else;

AST Extensions for If/Then/Else

然后再添加新的 AST 节点:

// IfExprAST - This class represents for if/then/else expression

class IfExprAST : public ExprAST {
    std::unique_ptr<ExprAST> Cond, Then, Else;

public:
    IfExprAST(std::unique_ptr<ExprAST> Cond, std::unique_ptr<ExprAST> Then, std::unique_ptr<ExprAST> Else) : Cond (std::move(Cond)), Then (std::move(Then)), Else (std::move(Else)) {}

    llvm::Value *codegen() override;
};

Parser Extension for If/Then/Else

新的 Parser 函数:

// ifexpr ::= 'if' expr 'then' expr 'else' expr;
std::unique_ptr<ExprAST> ParseIfExpr() {
    // eat 'if'
    getNextToken();
    // condition
    auto Cond = ParseExpression();
    if (!Cond) {
        return nullptr;
    }
    if (CurTok != tok_then) {
        return LogError("Excepted then");
    }

    // eat 'then'
    getNextToken();
    auto Then = ParseExpression();
    if (!Then) {
        return nullptr;
    }
    if (CurTok != tok_else) {
        return LogError("Expected else");
    }

    // eat 'else'
    getNextToken();
    auto Else = ParseExpression();
    if (!Else) {
        return nullptr;
    }

    return std::make_unique<IfExprAST>(std::move(Cond), std::move(Then), std::move(Else));
}

将这个 ParseIfExpr() 添加到 ParsePrimary() 中:

switch (CurTok)
{
case tok_identifier:
    return ParseIdentifierExpr();
case tok_number:
    return ParseNumberExpr();
case '(':
    return ParseParenExpr();
case tok_if:
    return ParseIfExpr();
default:
    return LogError("unknown token when expecting an expression");
}

LLVM IR for If/Then/Else

在有了 AST 和 Parser 之后, 我们就应该实现 codegen() 方法了, 我们要在这部分介绍新的内容.

我们首先来看一段 LLVM IR:

declare double @foo()

declare double @bar()

define double @baz(double %x) {
entry:
  %ifcond = fcmp one double %x, 0.000000e+00
  br i1 %ifcond, label %then, label %else

then:       ; preds = %entry
  %calltmp = call double @foo()
  br label %ifcont

else:       ; preds = %entry
  %calltmp1 = call double @bar()
  br label %ifcont

ifcont:     ; preds = %else, %then
  %iftmp = phi double [ %calltmp, %then ], [ %calltmp1, %else ]
  ret double %iftmp
}

通过使用 llvm-as 和 opt 工具, 我们可以画出这段代码 (t.ll) 的 CFG 图, 运行 llvm-as-7 < t.ll| opt-7 -analyze -view-cfg 之后:

其它生成 CFG 图的方法还有在代码中调用或者在调试器中调用 F->viewCFG() 或者 F->viewCFGOnly(), 其中 F, 是一个 Function*.

不管是 LLVM IR, 还是 CFG 图, 我们都可以发现这些信息: entry block 中将 x 的值和 0 进行 one 类型的 fcmp 比较. one 的意思是 Ordered and not equal, Ordered 的解释. fcmp 表示是浮点数的比较. br 表示的是根据 %ifcond 的值跳转到 label %then 或者 %label %else.

%then 和 %else 两个 basic block 的指令都差不多, 调用其它函数, 然后跳转到 %ifcont. 所以, 在最后一个 basic block 中, 我们需要确定控制流是从哪个 block 来的, 从而确定整个 if/then/else 表达式的值.

所以这就是 phi 指令的作用, 如果控制流是从 %then 过来, 那么它的值为 %calltmp; 如果控制流是从 %else 过来, 那么它的值为 %calltmp1.

Code Generation for If/Then/Else

为了给 If/Then/Else 生成代码, 我们就要实现 Value *IfExprAST::codegen().

llvm::Value *IfExprAST::codegen() {
    llvm::Value * CondV = Cond->codegen();
    if (!CondV) {
        return nullptr;
    }

    // convert condition to a bool by comparing non-equal to 0.0
    CondV = Builder.CreateFCmpONE(CondV, llvm::ConstantFP::get(TheContext, llvm::APFloat(0.0)), "ifcond");

    llvm::Function *TheFunction = Builder.GetInsertBlock()->getParent();

    // Create blocks for the then and else 
    // cases. Insert the 'then' block at the 
    // end of the function
    llvm::BasicBlock *ThenBB = llvm::BasicBlock::Create(TheContext, "then", TheFunction);
    llvm::BasicBlock *ElseBB = llvm::BasicBlock::Create(TheContext, "else");
    llvm::BasicBlock *MergeBB = llvm::BasicBlock::Create(TheContext, "ifcont");
    Builder.CreateCondBr(CondV, ThenBB, ElseBB);

    // Emit then Value
    Builder.SetInsertPoint(ThenBB);
    llvm::Value *ThenV = Then->codegen();
    if (!ThenV) {
        return nullptr;
    }

    Builder.CreateBr(MergeBB);
    ThenBB = Builder.GetInsertBlock();

    // Emit the 'else' block
    TheFunction->getBasicBlockList().push_back(ElseBB);
    Builder.SetInsertPoint(ElseBB);

    llvm::Value *ElseV = Else->codegen();
    if(!ElseV) {
        return nullptr;
    }

    Builder.CreateBr(MergeBB);
    ElseBB = Builder.GetInsertBlock();

    // Emit the 'merge' block
    TheFunction->getBasicBlockList().push_back(MergeBB);
    Builder.SetInsertPoint(MergeBB);
    llvm::PHINode *PN = Builder.CreatePHI(llvm::Type::getDoubleTy(TheContext), 2, "iftmp");

    PN->addIncoming(ThenV, ThenBB);
    PN->addIncoming(ElseV, ElseBB);
    return PN;
}

首先将 CondV 生成代码, 然后使用 Builder->CreateFCmpONE() 生成条件判断代码.

然后通过 Builder->GetInsertBlock()->getParent() 获取整个函数的指针, 用来创建这个函数中其它的 BasicBlock, 也就是 then block, else block 和最后的 merge block. 注意这里的 else block 是直接插入到函数后端的. then block 和 merge block 还没有插入到函数中. 然后使用 Builder.CreateCondBr() 来创建跳转指令, 根据条件判断的值跳转到不同的 Basic Block.

然后使用 Builder.SetInsertPoint(ThenBB) 将当前代码的插入点设置为 then block 的最后, 在后面进行类似于 append 的操作. 然后递归调用 Then 的 codegen() 生成代码, 生成完毕之后创建一个无条件跳转 CreateBr, 将控制流转移到 merge block. (LLVM 中, 所有的 basic block 必须使用控制流指令来终结, 即使是 fall-through 也需要一个无条件的跳转)

最重要的是, 这里我们又需要一个 ThenBB = Builder.getInsetBlock(). 在我们设置 Phi node 时, 我们要创建 basic block 和 value 的 key/value 对, 所以Phi node 需要获取和它直接相邻的前面两个 block 的信息. 然而, 在对 Else block 和 Then block 进行代码生成时, 如果它们中嵌套了 if/then/else, 那么之前的 Then block 和 Else block 就不是和 Phi node 直接相连了. 所以我们要在代码生成完毕之后再获取一次当前的 basic block, 来获取直接与 Phi node 相连的前驱 block.

然后是将 Else block 添加到函数中, 再调用 Builder.SetInsetPoint() 设置需要生成代码的位置, 递归生成代码, 创建跳转, 更新 Phi node 前驱 block.

然后是 merge block, 前面也是一样的操作, 将 basic block 加入函数, 并设置代码插入点. 然后使用 Builder.CreatePHI() 创建一个 Phi node, 来处理分支的合并. 然后再加上进入分支的信息就可以.

最后, 整个 If/Then/Else 表达式的值就是 Phi node 的值.

`for` loop expression

我们想添加一个如下的 for loop:

extern putchard(char);
def printstar(n)
    for i = 1, i < n, 1.0 in
        putchard(42);       # ascii 42 = '*'
printstar(100);

这个表达式定义了一个新的变量 i, 从 1 开始迭代, 当条件 i < n 成立时, 执行循环体中的 putchard(42), 并将自身加 1. 目前我们将循环表达式的值设置为 0.

Lexer Extension for the `for` Loop

看代码

AST Extension for the `for` Loop

看代码

Parser Extension for the `for` loop

看代码

LLVM IR for the `for` loop

当我们生成代码时, 应该是如下所示的结构:

declare double @putchard(double)

define double @printstar(double %n) {
entry:
    ; initial value = 1.0 (inlined into phi)
    br label %loop

; preds = %loop, %entry
loop:
    %i = phi double [1.000000e+00, %entry], [%nextvar, %loop]
    ; body
    %calltmp = call double @putchard(double 4.200000e+01)
    ; increment
    %nextvar = fadd double %i, 1.000000e+00

    ; termination test
    %cmptmp = fcmp ult double %i, %n
    %booltmp = uitofp i1 %cmptmp to double
    %loopcond = fcmp one double %booltmp, 0.000000e+00
    br i1 %loopcond, label %loop, label %afterloop

; preds = %loop
afterloop:
    ; loop always returns 0
    ret double 0.000000e+00
}

Code Generation for the `for` Loop

llvm::Value *ForExprAST::codegen () {
    // Emit the start code first, without 'variable' in scope
    llvm::Value *StartVal = Start->codegen();
    if (!StartVal) {
        return nullptr;
    }

    // Make the new basic block for the loop header, 
    // inserting after current block
    llvm::Function *TheFunction = Builder.GetInsertBlock()->getParent();
    llvm::BasicBlock *PreheaderBB = Builder.GetInsertBlock();
    llvm::BasicBlock *LoopBB = llvm::BasicBlock::Create(TheContext, "loop", TheFunction);

    // Inserting an explict fall-through from current (PreheaderBB) to LoopBB.
    Builder.CreateBr(LoopBB);

    // Start insertion to LoopBB
    Builder.SetInsertPoint(LoopBB);
    // Start the PHI node with one Incoming basic block
    llvm::PHINode *Variable = Builder.CreatePHI(llvm::Type::getDoubleTy(TheContext), 2, VarName.c_str());
    Variable->addIncoming(StartVal, PreheaderBB);

    // Within the loop, the variable is defined equal to 
    // the PHI node, If it shadows an existing variable, 
    // we have to restore it, so save it now.
    llvm::Value *OldVal = NamedValues[VarName];
    NamedValues[VarName] = Variable;

    // Emit the body of the loop. This, like any other 
    // expr, can change the current BB. Note that we 
    // ignore the value computed by the body, but don't 
    // allow an error.
    if (!Body->codegen()) {
        return nullptr;
    }

    // Emit the step value
    llvm::Value *StepVal = nullptr;
    if (Step) {
        StepVal = Step->codegen();
        if (!StepVal) {
            return nullptr;
        }
    } else {
        // if not specified, use 1.0
        StepVal = llvm::ConstantFP::get(TheContext, llvm::APFloat(1.0));
    }

    // add the step value
    llvm::Value *NextVal = Builder.CreateFAdd(Variable, StepVal, "nextvar");

    // Compute the end condition
    llvm::Value *EndCond = End->codegen();
    if (!EndCond) {
        return nullptr;
    }

    // Convert condition to a bool by comparing non-equal to 0.0
    EndCond = Builder.CreateFCmpONE(EndCond, llvm::ConstantFP::get(TheContext, llvm::APFloat(0.0)), "loopcond");

    // Create the "after loop" block and insert it
    llvm::BasicBlock *LoopEndBB = Builder.GetInsertBlock();
    llvm::BasicBlock *AfterBB = llvm::BasicBlock::Create(TheContext, "afterloop", TheFunction);

    // Insert the conditional branch into the end of LoopEndBB
    Builder.CreateCondBr(EndCond, LoopBB, AfterBB);

    // Any new code will be inserted in AfterBB
    Builder.SetInsertPoint(AfterBB);

    // Add a new entry to the PHI node for the backage
    Variable->addIncoming(NextVal, LoopEndBB);

    // Restore the unshadowed variable
    if (OldVal) {
        NamedValues[VarName] = OldVal;
    } else {
        NamedValues.erase(VarName);
    }

    // Always returns 0.0
    return llvm::Constant::getNullValue(llvm::Type::getDoubleTy(TheContext));
}

首先是生成 start 部分, 对应 parser 来看就是循环变量 i 被赋予的初始值, 所以先计算这部分. 下一步就是生成循环体的代码, 首先获取当前 basic block 所在的函数 Function *TheFunction = Builder->GetInsertBlock()->getParent(), 然后把当前的 basic block 标记为 BasicBlock *PreheaderBB. 创建循环体 basic block (BasicBlock *LoopBB), 并把它添加到函数末尾. 然后在 PreheaderBB 最后添加一条 fall-through 到 LoopBB 的Br 指令.

这部分代码和之前看到的 if/then/else 代码生成的方式很相似. 由于循环体所在的 basic block 有两个 incoming 的控制流边 (fall-through 和循环尾), 我们要使用 Phi node 来标记控制流是从哪来的. 目前我们创建的只有 PreheaderBB, 所以现在只添加一个 incoming 即可. 所以, 先把函数插入点移到 LoopBB, 然后用 Builder 生成 Phi node 即可.

在对循环变量作处理时, 循环变量可能和函数参数重名, 导致冲突. 所以我们使用一个 shadow 变量先把变量符号表 (NamedValues) 中的同名值存起来, 这样就实现了作用域的概念. 然后我们给循环体生成代码, 递归的调用 Body->codegen() 即可. 如果我们设置了 StepVal, 那么就生成 Step 的代码, 没有的话默认就是 1. 然后再创建一个 NextVar 用作下一次循环时的循环变量.

最后计算结束条件 EndCond, 首先生成代码. 再使用生成的代码去创建一个 FCmpOne, 用作后续 CondBr 指令的条件判断. 然后使用 BasicBlock *LoopEndBB = Builder.GetInsertBlock() 去更新最新的 basic block, 作为循环体的最后一个 block, 并创建一个 BasicBlock *AfterBB 使得之后的代码生成都是在循环结束之后. 最后就是给 Phi node 添加一个来自于 BasicBlock *LoopEndBB 的边, 并从 shadow object 中恢复原来变量的值, 最后直接返回 0.

Kaleidoscope: Adding JIT and Optimizer Support

Posted on 2019-05-15 Edited on 2021-05-22 In note

Chapter 4 Introduction

这一章主要是两个部分: (1) 给语言添加一个 Optimizer, (2) 添加 JIT 支持.

Trivial Constant Folding

第二章中我们并没有添加任何的优化操作, 然而, IRBuilder 自动帮我们完成了下列的优化:

ready> def test(x) 1+2+x;
Read function definition:
define double @test(double %x) {
    entry:
        %addtmp = fadd double 3.000000e+00, %x
        ret double %addtmp
}

如果没有这个优化 (Constant folding), 我们得到的代码里就会把 1 2 相加, 再把结果加上 x. 这个类型的优化叫做 Constant folding, 很多语言在实现它的时候都是在 AST 表达中进行的. 但是使用 LLVM 实现语言, 使用 IRBuilder 来生成代码时, 这个优化会自动完成.

然而, IRBuilder 自身的优化也受到一定程度的限制:

ready> def test(x) (1+2+x)*(x+(1+2));
Read the function definition:
define double @test(double %x) {
    entry:
        %addtmp = fadd double 3.000000e+00, %x
        %addtmp1 = fadd double %x, 3.000000e+00
        %multmp = fmul double %addtmp, %addtmp1

在这个例子中, LHS 和 RHS 显然是相同的值, 但 IRBuilder 的 local analysis 不可能能检测到并且优化这些代码. 这需要两个 transformation: (1) reassociation of expressions (重新关联表达式, 使得 + 的表示变得唯一), (2) Common SubExpression Elimination (CSE, 公共子表达式消除, 删除重复的 + 指令).

所以我们要使用 “pass“ 来完成这些优化.

LLVM Optimization Passes

PassManger 改版了, 这篇教程是基于 llvm::legacy::FunctionPassManager 的, 这个类可以在 LegacyPassManager.h 找到. ~~但我还是尝试用新版的来实现~~

参考Luke的回答, 使用继承了 PassInfoMixin class 的新版 Pass; 使用 FunctionAnalysisManager 类来注册 Pass; 使用 PassBuilder 工具类辅助.

答案链接

LLVM 提供了很多优化 pass, 而且同时允许编译器开发者定义, 并在合适的时候调用自己的 pass.

举个具体的例子, LLVM 提供了对整个 Module 进行处理的 pass, 同时也包括对单个函数的 pass. 在 Kaleidoscope 中, 我们使用的是针对单个函数的优化 pass, 也就是说用户定义一个我们就优化一个.

首先添加我们需要的全局变量:

1
2
3

static std::unique_ptr<llvm::FunctionPassManager> TheFPM;
static std::unique_ptr<llvm::FunctionAnalysisManager> TheFAM;
static std::unique_ptr<llvm::PassBuilder> ThePB;

然后我们需要设置好 FunctionPassManager, 用它来添加我们所想运行的 pass, 由于每个 Module 都需要 new 一个 FunctionPassManager，我们就添加一个初始化它们的函数. 我们在该函数的最后才初始化 TheFAM, ThePB:

void InitializeModuleAndPassManager()
{
    // open a new Module
    TheModule = std::make_unique<llvm::Module>("my cool jit", TheContext);

    // Create a new pass manager attached to it
    TheFPM = std::make_unique<llvm::FunctionPassManager>(TheModule.get());

    // Combine instructions to form fewer, simple
    // instructions. This pass does not modify the CFG. This
    // pass is where algebraic simplification happens.
    // (https://llvm.org/docs/Passes.html#instcombine-combine-redundant-instructions)
    TheFPM->addPass(llvm::InstCombinePass());

    // This pass reassociates commutative expressions in an order
    // that is designed to promote better constant propagation, GCSE, LICM, PRE, etc.
    // (https://llvm.org/docs/Passes.html#reassociate-reassociate-expressions)
    TheFPM->addPass(llvm::ReassociatePass());

    // This pass performs global value numbering to eliminate fully and partially redundant instructions.
    // It also performs redundant load elimination.
    // (https://llvm.org/docs/Passes.html#gvn-global-value-numbering)
    TheFPM->addPass(llvm::GVN());

    // Performs dead code elimination and basic block merging.
    // (https://llvm.org/docs/Passes.html#simplifycfg-simplify-the-cfg)
    TheFPM->addPass(llvm::SimplifyCFGPass());

    // Using PassBuilder and FunctionAnalysis Manager Register analysis passes
    ThePB = std::make_unique<llvm::PassBuilder>();
    TheFAM = std::make_unique<llvm::FunctionAnalysisManager>();
    ThePB->registerFunctionAnalyses(*TheFAM.get());
}

这部分代码首先初始化全局的 TheModule, 和 TheFPM. 然后使用一系列的 addPass() 方法来给 TheFPM 添加 Pass. 添加完之后, 使用 ThePB 和 TheFAM 来注册这些 Pass.

然后我们得使用它, 所以在 FunctionAST::codegen() 中生成代码之后加上:

// Finish off the function
Builder.CreateRet(RetVal);
// Validate the generated code, checking for consistency
llvm::verifyFunction(*TheFunction);

// Optimization of the function code
    TheFPM->run(*TheFunction, *TheFAM.get());

return TheFunction;

记得在 main() 里面调用 InitializeModuleAndPassManager()

Adding a JIT compiler

将代码生成为 IR 后, 我们可以使用多种不同的工具来对它进行处理. 比如我们可以优化它 (上一节), 或者将它 dump 下来之后, 将 IR 编译为汇编文件 (.s), 或者使用 JIT 来编译它. LLVM IR 就相同于编译器个部分之间的通货 (common currency).

在这节我们将要给解释器添加 JIT 支持, 最基本的要求是: 一旦我们将函数体输入进去之后, 它能够立刻算出 top-level 表达式的值. 比如在输入 1+2; 之后立即输出一个 3. 同时定义好的函数也能直接在命令行被调用.

首先我们先准备 native target 的环境, 并声明和初始化 JIT. 注意这里的 class KaleidoscopeJIT 教程里面引用的是 LLVM 源码中的类, 但我不是源码安装, 所以还是把它复制过来再使用.

首先定义全局变量 static std::unique_ptr<KaleidoscopeJIT> TheJIT;, class Kaleidoscope 定义在 KaleidoscopeJIT.

然后在 main() 函数中, 添加这几段:

int main()
{
#ifdef JIT
    // Initialize native target
    llvm::InitializeNativeTarget();
    llvm::InitializeNativeTargetAsmPrinter();
    llvm::InitializeNativeTargetAsmParser();
#endif
    InitTokPrecedence();
    // prime the first token
    fprintf(stderr, "ready> ");
    getNextToken();
    TheJIT = std::make_unique<llvm::orc::KaleidoscopeJIT>();
    InitializeModuleAndPassManager();
    MainLoop();
#ifdef JIT
    // Print out all of the generated code
    TheModule->print(llvm::errs(), nullptr);
#endif
    return 0;
}

同时要在 InitializeModuleAndPassManager() 函数中设置 data layout:

// open a new Module
TheModule = std::make_unique<llvm::Module>("my cool jit", TheContext);
TheModule->setDataLayout(TheJIT->getTargetMachine().createDataLayout());

// Create a new pass manager attached to it
TheFPM = std::make_unique<llvm::FunctionPassManager>(TheModule.get());

class Kaleidoscope 是一个简单的 JIT 构造类, 我们会在之后的章节中对它进行扩展. 目前的它提供的 API 非常简单: addModule() 给 JIT 添加了一个 Module, 使得函数能够执行; removeModule() 移除一个 Module, 释放相应的内存空间; findSymbol() 使我们能够通过 string name 来查找编译好代码的指针.

我们能使用这些 API 在 top-level 的 handle 函数中添加表达式值的计算:

if (auto FnAST = ParseTopLevelExpr())
    {
        if (FnAST->codegen())
        {
            // JIT the module containing the anonymous expression, keeping a handle so we free it later
            auto H = TheJIT->addModule(std::move(TheModule));
            // Optimization of passes
            InitializeModuleAndPassManager();

            // Search the JIT for the __anon_expr symbol
            auto ExprSymbol = TheJIT->findSymbol("__anon_expr");
            assert(ExprSymbol && "Function not found");

            // Get the symbol's address and cast it to the right type (takes no arguments, returns a double) so we can call it as a native function.
            double (*FP)() = (double (*)())static_cast<intptr_t> (ExprSymbol.getAddress().get());
            fprintf(stderr, "Evaluated to %f\n ", FP());

            // Delete the anonymous expression module from the JIT
            TheJIT->removeModule(H);
        }
    }

如果 Parsing 和 codegen 都成功执行, 下一步就是将包含 top-level 表达式的 Module 添加到 JIT, 我们使用 addModule() 这个方法, 触发这个 Module 中所有函数的代码生成, 并返回一个 VModuleKey 的对象, 使得我们在之后可以移除 Module. 一旦这个 Module 被加入到 JIT, 它就不能被修改, 所以我们需要调用 InitializeModuleAndPassManager() 来打开一个新的 Module, 去持有接下来生成的代码.

一旦我们将 Module 添加到 JIT, 我们需要找到生成代码的指针, 所以我们使用 JIT 的findSymbol() 方法, 将 top-level 表达式的函数名 “__anon_expr” 当作参数传入, 就可以得到指令地址.

注意 “__anon_expr” 是要在 Parse top-level 表达式时传入的:

std::unique_ptr<FunctionAST> ParseTopLevelExpr()
{
    if (auto E = ParseExpression())
    {
        // anonymous nullary function
        auto Proto = std::make_unique<PrototypeAST>("__anon_expr", std::vector<std::string>());

        return std::make_unique<FunctionAST>(std::move(Proto), std::move(E));
    }
    return nullptr;
}

然后我们使用 getAddress().get() 来获取这个函数在内存中的地址. 我们定义的匿名函数没有参数, 返回值为 double. 因为 LLVM JIT 编译器匹配了 native 平台的 ABI, 所以我们可以直接将函数地址转化成一个函数指针, 然后直接调用它. 这意味着, JIT 编译出的代码和静态链接到应用的 native 机器码没有区别.

最后, 由于我们不支持 top-level 的重复计算, 我们在代码生成的最后移除 Module, 释放调关联的内存. 但是我们在之前使用 InitializeModuleAndPassManager() 创建的 Module 仍然开启, 新的代码可以继续添加到其中.

再调用一次之前定义的函数会找不到符号, 因为一个 Module 是 JIT 分配的一个单元, 定义的函数位于前面 Module 中, 当我们把一个 Module 移除之后, 我们就相当于删除了那个 Module 中所有函数的定义, 所以再次调用就找不到符号.

最简单的方式就是将匿名表达式放在一个单独的 Module, 和其他函数定义分开.

实际上, 我们想更进一步, 将每个函数都放到一个单独的 Module, 这样就更加像一个 REPL, 函数能被定义多次, 引用它时, 都是返回最近的定义.

static std::map<std::string, std::unique_ptr<PrototypeAST>> FunctionProtos;

llvm::Function *FunctionAST::codegen()
{
    // Transfer ownership of the prototype to the FunctionProtos  
    // map, but keep a reference to it use below
    auto &P = *Proto;
    FunctionProtos[Proto->getName()] = std::move(Proto);
    // Check for an existing function from a previous `extern` declaration
    llvm::Function *TheFunction = getFunction(P.getName());

    if (!TheFunction)
    {
        return nullptr;
    }
    ....
}

llvm::Value *CallExprAST::codegen()
{
    // Lookup the name in the global module table
    llvm::Function *CalleeF = getFunction(Callee);
    ...
}

llvm::Function *getFunction(std::string Name) {
    // First, check if the function has already been added to the 
    // current mode
    if (auto *F = TheModule->getFunction(Name)) {
        return F;
    }

    // If not exist, check wether we can codegen the declaration // from some existing prototype
    auto FI = FunctionProtos.find(Name);
    if (FI != FunctionProtos.end()) {
        return FI->second->codegen();
    }

    // if no existing prototype exists, return null
    return nullptr;
}

首先添加一个全局的 map<string, unique_ptr<PrototypeAST>> FunctionProtos, 用它来保存每个函数最近的一次函数原型. 并且我们添加了一个 getFunction() , 代替 TheModule->getFunction(). 自定义的 getFunction(), 首先它搜寻 Module, 看有没有存在的函数声明, 如果没有就生成函数原型的代码. 所以我们可以把之前 Function *FunctionAST::codegen() 之中的那个生成函数原型的代码片段删除. 同时在 Function *FunctionAST::codegen() 中, 我们首先更新 FunctionProtos, 再调用 getFunction(). 通过使用这个全局的 FunctionProtos, 我们能获取之前声明过的所有函数.

我们还需要更新 HandleDefinition() 和 HandleExtern():

void HandleDefinition()
{
    if (auto FnAST = ParseDefinition())
    {
        if (auto *FnIR = FnAST->codegen())
        {
            fprintf(stderr, "Read the function definition: ");
            FnIR->print(llvm::errs());
            fprintf(stderr, "\n");
            TheJIT->addModule(std::move(TheModule));
            InitializeModuleAndPassManager();
        }
    }
    else
    {
        // Skip token for error recovery
        getNextToken();
    }
}

void HandleExtern()
{
    if (auto ProtoAST = ParseExtern())
    {
        if (auto *FnIR = ProtoAST->codegen())
        {
            fprintf(stderr, "Read extern: ");
            FnIR->print(llvm::errs());
            fprintf(stderr, "\n");
            FunctionProtos[ProtoAST->getName()] = std::move(ProtoAST);
        }
    }
    else
    {
        // Skip token for error recovery
        getNextToken();
    }
}

在 HandleDefinition() 中, 我们需要将刚刚定义好的函数所在的 Module 添加到 JIT, 然后打开一个新的 Module, 这样就使得每个函数都在不同的 Module 中. 在 HandleExtern() 中, 只需要将这个外部声明添加到 FunctionProtos 中即可.

并且, 对于外部声明 e.g., extern sin(x), JIT 有一套直接的符号解析机制: JIT 按照时间逆序, 搜寻所有添加的 Module, 来查找符号的定义. 如果没找到, 它就 fall back 去调用 dlsym("sin"), 去 JIT 的地址空间里面去找这个符号的定义, 然后去调用它 (libm).

后面我们会进一步讨论 JIT 的符号解析机制, 并调整它来实现一些 feature, 比如安全性, 动态代码生成, 甚至是 lazy evaluation.

调整符号解析规则一个最直接的好处就是, 我们能使用任意的 C++ 代码来扩展语言:

extern "C" DLLEXPORT double putchard(double X) {
    fputc((char)X, stderr);
    return 0;
}

链接时添加 -rdynamic, 使得 dlopen() 打开的 shared object 能解析到自己程序中的符号.

Fast and generic metadata management with mid-fat pointers

Posted on 2019-05-13 Edited on 2021-05-22 In workshop paper

Introduction

We present mid-fat pointers, an approach to support targeted but composable metadata-based software defenses. The key idea is to strike a balance between traditional fat pointers and low-fat pointers by preserving the pointer size but changing its representation with embed data. This is done by piggybacking on SFI to decode the pointers efficiently. In such a scenario, its use by mid-fat pointers incurs no additional overhead.

With mid-fat pointers, each pointer can embed the location of the metadata for the pointed memory object. Compared to recent generic metadata management schemes, we improve performance because we need fewer lookups as the metadata location is cached in the pointer and security because an out-of-bounds pointer is still associated with the same metadata.

Threat model

Since mid-fat pointers provide a general defense framework, the threat model depends on particular defenses deployed. With regards to these defense, we only assume they require an SFI baseline. As a result, the metadata is inherently protected against both arbitrary memory reads and writes.

Design

Pointer encoding

One core insight for our framework is that, given the presence of SFI to protect metadata from attacker, we can use the unused bit in pointers without an additional performance impact. In particular, whenever the program allocates memory, we store a pointer to the metadata for that chunk of memory in the high bits of the pointer. This design has three benefits: (1) it requires a full metadata lookup only once at allocation time, caching the more frequent lookups needed for defenses and increasing efficiency; (2) since we know the base address of newly allocated objects, we do not need range queries; and (3) pointers that go-out-of bounds still point to the metadata for the original object.

By default, we use the 32 lower bits of every pointer to encode where the data is (data pointer), and 32 higher bits to encode where the metadata is (metadata pointer). The downside is that we can use only 32-bit addresses out of the 48 bits implemented on x86-64, which restricts the address space to 4 GB. Note, however, that we could use fewer bits for the metadata pointer by assuming that every object is aligned on 8-byte boundaries.

Allocation instrumentation

We use a compiler pass that searches for calls to memory allocation functions and adds a call to hook our static libraries. The hook looks up metadata for the allocated pointer and adds the metadata pointer in the uppermost bits of the returned pointer.

Load/store instrumentation. As the program can only dereference encoded pointer safely after decoding, our framework includes a compiler pass that identifies load and store operation add adds the data pointer decoder to each instance. We effectively implement SFI, shielding all memory past the first 4GB from attacker.

Call instrumentation. While instrumented code can safely dereference encoded pointers due to automatic masking, protected applications may well use unprotected libraries and we must not pass encoded pointers to them to avoid segmentation faults. We therefore instrument calls to external functions by decoding any pointer-type parameters. Note that we can identify which are imported from libraries because our pass is part of LLVM’s link-time optimization. For indirect calls our instrumentation might not know if the current value of a function pointer is to an external call or not. Instead, when a function is address-taken we provide a replacement function that calls the original function with masked arguments.

Pointer comparison and cast instrumentation. Besides dereferencing operations, compatibility issues arise when comparing pointers, and in integer operations on pointers. We could use static analysis to determine which pointers could be used in integer arithmetic and exclude them from our tagging scheme and SFI.

SGXBounds: Memory Safety for Shielded Execution

Posted on 2019-05-12 Edited on 2021-05-22 In paper

Introduction

Intel SGX provides an abstraction of secure enclave, which can be used to achieve shielded execution for unmodified legacy applications on untrusted infrastructure.

Shielded execution aims to protect confidentiality and integrity of applications when executed in an untrusted environment. The main idea is to isolate the application from the rest of the system, using only a narrow interface to communicate to the outside.

However, shield execution does not protect the program against memory safety attacks. To validate our claim, we reproduced publicly available memory safety exploits inside the secure enclave. These examples highlight that a single exploit can completely compromise the integrity and confidentiality properties of shield execution.

To prevent exploitation of these bugs, we experimented with two prominent software- and hardware based memory protection mechanisms in the context of shield execution: AddressSanitizer and Intel MPX, respectively.

But, both of them incur high performance overhead, due to additional metadata used to track object bounds.

In this paper, we present SGXBOUNDS. The SGXBOUNDS approach is based on a simple combination of tagged pointers and efficient memory layout to reduce overheads inside enclaves. In particular, we note that SGX enclave routine use only 32 lower bits to represent program address space and leave 32 higher bits of pointers unused. We utilize these high bits to represent the upper bound of the referent object (or more broadly the beginning of the object’s metadata area); the lower bound value is stored right after the object. Such metadata layout requires only 4 additional bytes per object and does not break cache locality — unlike Intel MPX and AddressSanitizer.

Futhermore, we show that our design naturally extends for: (1) “synchronization-free” support for multi-threaded application, (2) increased availability instead of usual fail-stop semantics by tolerating out-of-bounds accesses, (3) generic APIs for object’s metadata management to support new use-cases.

Background

SCONE is a shielded execution framework that enables unmodified legacy application to take advantage of the isolation offered by SGX. With SCONE, the program is recompiled against a modified standard C library (SCONE libc), which facilitates the execution of of system calls.

Clearly, the combination of SCONE and SGX is not a silver bullet: bugs in the enclave code itself can render these mechanisms useless.

Address Sanitizer is an extension to GCC and Clang/LLVM that detects the majority of object bounds violations. It keeps track of all objects, and checks whether the address is within one of the used objects on each memory access.

Intel MPX detects all possible spatial memory vulnerabilities including intra-object ones (When one member in a structure corrupts other members). The approach to achieving this goal is different from AddressSanitizer. Instead of separating objects by unaddressable redzones, MPX keeps bounds metadata of all pointers and check against these bounds on each memory access.

One major limitation of the current Intel MPX implementation is a small number of bounded registers. If an application contains many distinct pointers, it will cause frequent loads and stores of bounds in memory.

SGXBOUNDS

We built SGXBOUNDS based on the following three insights. First, shielded application memory (specifically, its working set) must be kept minimal due to the very limited EPC size in current SGX implementation. Second, applications spend a considerable amount of time iterating through the elements of an array, and a smartly chosen layout of metadata can significantly reduce the overhead of bounds checking. Third, we rely on the SCONE infrastructure with its monolithic build process: all application code is statically linked without external dependencies. The first and second insights dictate the use of per-object metadata combined with tagged pointers to keep memory overhead minimal.

Design overview

All modern SGX CPU operates in a 64-bit mode, meaning that all pointer are 64 bits in size. In SGX enclaves, however, only 36 bit of virtual address space are currently addressable. Thus, SGXBOUNDS relies on the idea of tagged pointers: a 64-bit pointer contains the pointer itself in its lower 32 bits and the referent object’s upper bound in the upper 32 bits.

The value stored in the higher 32 bits (UB) serves not only for the upper-bound check, but also as a pointer to the object’s other metadata (LB). The metadata is stored right after the referent object.

This metadata layout has important benefits: (1) it minimizes amount of memory for metadata, (2) it requires no additional memory accesses, (3) it alleviates problems of fat pointers concerning multi-threading and memory layout changes.

int *s[N], *d[N]
s = specify_bounds(s, s + N)
d = specify_bounds(d, d + N)
for (i = 0; i < M; i++)
    si = s + i
    di = d + i
    sp, sLB, sUB = extract(si)
    if bounds_violated(sp, sLB, sUB)
        crash(si)
    val = load si
    dp, dLB, dUB = extract(di)
    if bounds_violated(dp, dLB, dUB)
        crash(di)
    store val, di

Design details

Pointer creation. Whenever an object is created, SGX-BOUNDS associates a pointer with the bound of this object.

For global and stack-allocated variables, we change their memory layout so they are padded with 4 bytes and initialize them at runtime. More specifically, we wrap such variables in two-member structures, e.g., int x is transformed into struct xwarp {int x; void* LB}:

void *specify_bounds(void *p, void *UB) {
    LBaddr  = UB;
    *LBaddr = p;
    tagged  = (UB << 32) | p;
    return tagged;
}

For dynamically allocated variables, SGXBOUNDS wraps memory management functions to append 4 bytes to each newly created object, initialize these with the lower-bound value, and make the pointer tagged with the upper bound:

void *malloc(int size) {
    void *p = malloc_real(size + 4);
    return specify_bounds(p, p + size);
}

Note that there is no need to instrument free as the 4 bytes of metadata are removed together with the object itself.

Run-time bounds checks. SGXBOUNDS inserts run-time bounds check before each memory access: loads, stores, and atomic operations. For this, first the original pointer and the upper and lower bounds are extracted. To extract the original pointer:

1
2
3

void *extract_p(void *tagged) {
    return tagged & 0xFFFF'FFFF
}

To extract the upper bound”

1
2
3

void *extract_UB(void *tagged) {
    return tagged >> 32;
}

To extract the lower bound which stored in the padded region:

1
2
3

void *extract_LB(void *UB) {
    return *UB;
}

Finally, the bound check:

bool bounds_violated(void *p, void *LB, void * UB) {
    if (p < LB || p > UB) {
        return true;
    }
    return false;
}

Pointer arithmetic. SGXBOUNDS instruments pointer arithmetic so that only 32 low bits are affected:

1
2
3

UB = extract_UB(si)
si = s + i
si = (UB << 32) | extract_p(si)

Type casts. Pointer-to-integer and integer-to-pointer casts are a curse for fat/tagged pointer approaches.

Function calls. SGXBOUNDS does not need to instrument function calls or altering calling conventions. The only uninstrumented code is the libc, for which we provide wrappers. This implies that any tagged pointer passed as a function argument will be treated as a tagged pointer in the callee.

Advanced Features of SGXBOUNDS

Multi-threading support

AddressSanitizer does not require any specific treatment of multi-threading, but it can negatively affect cache locality if a multi-threaded application was specifically designed as cache-friendly.

All fat-pointer or disjoint-metadata techniques similar to Intel MPX suffer from multi-threading issues. An update of a pointer and its associated metadata must be implemented as one atomic operation.

SGXBOUNDS does not experience this problem. Indeed, the pointer and the upper bound are always updated atomically since they are stored in the same 64-bit tagged pointer.

Tolerating bugs with boundless memory

To allow applications to survive most bugs and attacks and continue correct execution, SGXBOUNDS reverts to failure-oblivious computing by using the concepts of boundless memory blocks. In this case, whenever an out-of-bounds memory access is detected, SGXBOUNDS redirects this access to a separate “overlay” memory area to prevent corruption of adjacent objects, creating the illusion of “boundless memory allocated for the object.

Consider an example of a classic off-by-one error, SGXBOUNDS will redirect to load and store to an overlay address, instead of a violation of accessing metadata.

Metadata management support

So far, we discussed only one metadata type kept per object — the lower bound. However, our memory layout allows us to add arbitrary number of metadata items for each object to implement additional functionality.

All instrumentation in SGXBOUNDS is implemented as calls to auxiliary functions, which we refer to as instrumentation hooks. One can think of these hooks as a metadata management API : (1) on_create() is called at run-tim whenever a new object is created. In the context of SGXBOUNDS, it corresponds to the specify_bounds() function which initializes our only metadata (lowerbound). (2) on_access() is called at each memory access, be it a write, read, or both (for atomic instruction such as compare-and-swap). In SGXBOUNDS, the hook roughly corresponds to the bounds_violated() function. (3) on_delete() is called whenever the object is deallocated, we support this hook only for the head objects.

function	description
`on_create(base, size, type)`	called after object creation (global, heap, or stack)
`on_access(addr, size, metadata, access_type)`	called before memory access
`on_delete(metadata)`	called before object destruction(only for heap)

Implementation

SGXBOUNDS implementation

SGXBOUNDS is a compile-time transformation pass implemented in LLVM 3.8.

Compiler Support. We treat inline assembly as an opaque memory instruction: all pointer arguments to inline assembly are bounds checked. SGXBOUNDS does not yet completely support c++ exception handling.

Run-time Support. We implement boundless memory feature completely in the run-time support library. To prevent data races, all read/update operations on the cache are synchronized via a global lock. For the tagged pointer scheme, SGXBOUNDS relies on SGX enclaves (thus the virtual address space) to start from 0x0. To allow this, we set Linux security flag vm.mmap_min_addr to zero for our applications. We also modified the original Intel SGX driver to always start the enclave at address 0x0.

Quick Start

Create a new post

Run server

Generate static files

Deploy to remote sites

Introduction

Programming Environment

Bounds Registers

Configuration and Status Registers

Intel MPX Instruction Summary

Usage and Examples

Loading and Storing Bounds in Memory

BNDLDX and BNDSTX in 64-Bit Mode

如何阅读 paper?

The three-pass approach

The first pass

The second pass

The third pass

Doing a literature survey

Introduction

Major Concepts

Partial Transfer Functions

Design of PTFs

Algorithm Outline

Pointer Representations

Location Sets

Extended Parameters

Points-to Functions

Analyzing a Procedure

Strong Updates

Sparse Representation

Evaluating Dereferences

Evaluating Assignments

Interprocedural Algorithm

Calls Through Pointers

Testing if a PTF Applies

Applying a PTF

Recursive Calls

Overview

Features and Capabilities

Packet Summary

Intel Processor Trace Operational Model

Change of Flow Instruction (COFI) Tracing

Direct Transfer COFI

Indirect Transfer COFI

Far Transfer COFI

Software Trace Instrumentation with PTWRITE

Power Event Tracing

Trace Filtering

Filtering by Current Privilege Level (CPL)

Filtering by CR3

Filtering by IP

Packet Generation Enable Controls

Packet Enable (PacketEn)

Trigger Enable (TriggerEn)

Context Enable (ContextEn)

Branch Enable (BranchEn)

Filter Enable (FilterEn)

Trace Output

Single Range Out

Table of Physical Addresses

Trace Transport Subsystem

Restricted Memory Access

Enabling and Configuration MSRs

General Consideration

IA32_RTIT_CTL MSR

Enabling and Disabling Packet Generation wth TraceEn

IA32_RTIT_STATUS MSR

IA32_RTIT_ADDRn_A and IA32_RTIT_ADDRn_B MSRs

IA32_RTIT_CR3_MATCH MSR

IA32_RTIT_OUTPUT_BASE MSR

IA32_RTIT_OUTPUT_MASK_PTRS MSR

Interaction of Intel Processor Trace and Other Processor Features

CONFIGURATION AND PROGRAMMING GUIDELINE

Detection of Intel Processor Trace and Capability Enumeration

Packet Decoding of RIP versus LIP

Model Specific Capability Restrictions

Enabling and Configuration of Trace Packet Generation

Enabling Packet Generation

Disabling Packet Generation

`BNDLDX` and `BNDSTX` in 64-Bit Mode

Software Trace Instrumentation with `PTWRITE`

`for` loop expression

Lexer Extension for the `for` Loop

AST Extension for the `for` Loop

Parser Extension for the `for` loop

LLVM IR for the `for` loop

Code Generation for the `for` Loop