0%

Chapter 3 Introduction

本章主要是将 Chapter 2 中构建的抽象语法树 (AST) 转化成 LLVM IR. 这章主要讲述 LLVM 能做哪些工作, 并且告诉你怎么使用.

Note: 这章的代码需要 LLVM 3.7 或更新版.

Code Generation Setup

为了能生成 IR, 我们首先在每个 AST 类中定义一个生成代码的虚方法 (virtual code generation method).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// ExprAST - Base class for all expression nodes
class ExprAST
{
public:
virtual ~ExprAST(){};
virtual llvm::Value *codegen() = 0;
};

// NumberExprAST - Expression class for numeric literals like "1.0"
class NumberExprAST : public ExprAST
{
double Val;

public:
NumberExprAST(double Val) : Val(Val) {}
virtual llvm::Value *codegen();
};

codegen() 方法会根据每个不同的类生成 IR, 都返回一个 class Value 对象的指针. class Value 表示 LLVM 中的一个 SSA (Static Single Assignment) register 或者 SSA value. SSA 表示每个变量都只能被赋值一次, 而且必须在使用前被定义.

同时我们添加一个 LogError() 方法来记录生成 IR 时的错误:

1
2
3
4
llvm::Value *LogErrorV(const char *Str) {
LogError(Str);
return nullptr;
}

并声明可能需要的静态变量:

1
2
3
4
static llvm::LLVMContext TheContext;
static llvm::IRBuilder<> Builder{TheContext};
static std::unique_ptr<llvm::Module> TheModule;
static std::map<std::string, llvm::Value *> NamedValues;

LLVMContext TheContext 是一个包含大量 LLVM 核心数据结构的对象, 比如包含类型和常量表 (the type and constant value tables). IRBuilder Builder 是一个辅助对象, 能简化 LLVM 指令的生成. 这个对象能够追踪当前需要插入指令的位置, 并有创建新指令的方法. class Module 的实例用来存储和 一个LLVM module 相关的所有信息. 在 LLVM 中, 一个 Module 表示能被处理的单个代码单元 (a single unit of code that is to be processed together), 差不多就是一个源代码文件. TheModule 会拥有我们生成所有 IR 的内存, 这就是我们申明 codegen() 时, 让它返回裸指针 Value * 而不是 unique_ptr<Value> 的原因. map<string, Value *> NamedValues 主要追踪当前作用域内定义的值, 和它们的 LLVM representation. 换句话说 NamedValues 就是一个符号表 (symbol table for the code). 目前在 Kaleidoscope 中, 只有函数参数能被引用, 所以当生成函数体里面的代码时, 我们就会用到存储在它里面的函数参数.

我们现在假设 IRBuilder Builder 已经被设置好可以生成代码.

Expression Code Generation

为表达式节点生成 LLVM code 非常直接, 直接调用相应的函数即可, 首先是对数字常量表达式的代码生成:

1
2
3
llvm::Value *NumberExprAST::codegen() {
return llvm::ConstantFP::get(TheContext, llvm::APFloat(Val));
}

在 LLVM IR 中, 字面常量是用 class ConstantFP 表示的, 字面常量的值由 class APFloat 持有, 即后者是前者的一个成员变量 (AP 表示 Arbitrary Precision). 然后是变量表达式的 codegen():

1
2
3
4
5
6
7
llvm::Value *VariableExprAST::codegen() {
llvm::Value *V = NamedValues[Name];
if (!V) {
LogErrorV("Unknown variable name");
}
return V;
}

目前针对变量的引用只有在函数体内才有, 所以我们先检查被引用的变量名是否在 map<string, Value*> NamedValues 中, 然后直接返回它的值即可. 在教程后面我们会加上对循环变量和局部变量的支持 (loop induction variables and local variables).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
llvm::Value *BinaryExprAST::codegen() {
llvm::Value *L = LHS -> codegen();
llvm::Value *R = RHS -> codegen();
if (!L || !R){
return nullptr;
}

switch (Op)
{
case '+':
return Builder.CreateFAdd(L, R, "addtmp");
case '-':
return Builder.CreateFSub(L, R, "subtmp");
case '*':
return Builder.CreateFMul(L, R, "multmp");
case '<':
L = Builder.CreateFCmpULT(L, R, "cmptmp");
// Convert bool 0/1 to double 0.0 or 1.0
return Builder.CreateUIToFP(L, llvm::Type::getDoubleTy(TheContext), "booltmp");
default:
return LogErrorV("invalid binary operator");
}
}

二元操作符表达式也是递归进行操作, 左表达式先进行代码生成, 然后是右表达式, 然后生成二元操作符表达式的代码. 在上述代码中, class IRBuilder 调用自己的成员方法, 例如 CreateFAdd(), 通过传入的参数来生成新的指令. 在 CreateFAdd() 方法中传入的名字仅仅只是一个提示, 如果上面的代码生成函数有多个 CreateFAdd(L, R, "addtmp"), 那么 LLVM 会自动给每个都加上一个唯一的后缀.

LLVM 指令是被严格约束的: 对于加法指令, 左操作数和右操作数必须是相同类型, 且结果的类型和操作数的类型必须相同. Kaleidoscope 中的所有值都是 double, 所以这里不需要考虑类型.

并且, 在 LLVM 文档中, fcmp 指令返回的是 1个 bit 的整数 i1. 我们的数据类型中没有整数, 所以我们要转化为 0.0 或者 1.0. 所有我们使用 uitofp 指令, 它将第一个参数视为无符号的整数类型, 并把它转化为第二个参数表示的浮点类型 (double). 同时 sitofp 指令也可以完成类似的功能, 不过它是把第一个参数视为有符号的整数类型.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
llvm::Value *CallExprAST::codegen() {
// Lookup the name in the global module table
llvm::Function *CalleeF = TheModule -> getFunction(Callee);
if (!CalleeF) {
LogErrorV("Unknown function referenced");
}

// Argument checking
if (CalleeF -> arg_size() != Args.size()) {
return LogErrorV("Incorrect number of arguments passed");
}

std::vector<llvm::Value *> ArgsV;
for (unsigned i = 0, e = Args.size(); i != e; ++i) {
ArgsV.push_back(Args[i] -> codegen());
if (!ArgsV.back()) {
return nullptr;
}
}

return Builder.CreateCall(CalleeF, ArgsV, "calltmp");
}

函数调用的代码生成也是很直接的, 首先在 LLVM Module 的符号表里面查找这个函数名. LLVM Module 是用来持有函数的容器, 并且通过解析 Module 的符号表, 我们就能引用这些函数.

一旦我们找到函数之后, 就检查它在符号表中的参数数量和实际获取的参数数量是否一致. 并且递归的解析传给它的每个参数. 然后就能创建一个 LLVM call 指令.

Function Code Generation

对函数原型和函数定义的代码生成包含不少细节, 不像上面表达式生成代码时那么直观和简洁. 但是我们同时也能看到一些比较重要的概念. 下面是函数原型的代码生成:

1
2
3
4
5
llvm::Function *PrototypeAST::codegen() {
// Make the function prototype: double(double, double).
std::vector<llvm::Type *> Doubles(Args.size(), llvm::Type::getDoubleTy(TheContext));
llvm::FunctionType *FT = llvm::FunctionType::get(llvm::Type::getDoubleTy(TheContext), Doubles, false);
llvm::Function *F = llvm::Function::Create(FT, llvm::Function::ExternalLinkage, Name, TheModule.get());

这部分的代码包含了很多重要内容. 首先 Function *PrototypeAST::codegen() 返回值的类型是 Function * 而不是 Value *, 因为一个函数原型主要是表示函数的外部接口, 而不是像表达式一样能被计算出值. 对 FunctionType::get() 的调用可以得到函数类型, 第一个参数是返回值类型, 第二个参数是参数类型的列表, 第三个是是否支持可变参数列表.

最后一行就是实际生成 IR 代码的一行, Function::Create() 的参数分别为函数类型, Linkage 类型 (一般熟知的有 external linkage 和 internal linkage, 分别表示能被外部 translation unit (Module) 访问和不能. 例如 c/c++ 关键字 externstatic), 函数名, 和 TheModule. 在之前版本的 LLVM 文档中, 直接传了 TheModule (unique_ptr) 而不是 TheModule.get(), 会造成编译器错误.

1
2
3
4
5
6
7
8
    // Set names for all arguments
unsigned Idx = 0;
for (auto &Arg: F -> args()) {
Arg.setName(Args[Idx++]);
}

return F;
}

然后根据函数原型的参数名字, 给 IR 的参数命名, 增加 IR 生成之后的可读性, 并且后续操作中, 能直接根据参数名对参数进行引用, 不需要再到函数原型的 AST 节点中查找.

然后就是函数定义:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
llvm::Function *FunctionAST::codegen() {
// Check for an existing function from a previous `extern` declaration
llvm::Function *TheFunction = TheModule -> getFunction(Proto -> getName());

if (!TheFunction) {
TheFunction = Proto -> codegen();
}

if (!TheFunction) {
return nullptr;
}

if (!TheFunction -> empty()) {
return (llvm::Function *)LogErrorV("Function cannot be redefined");
}

对于函数定义, 我们首先查找 TheModule 的符号表, 看看有没有匹配的函数名. 这样就可以看出这个函数有没有使用 extern 的前向声明. 如果没有声明, 那么我们就生成函数原型的代码. 并且, 我们断言这个时候函数体是没有生成代码的, 也就是说我们这次定义并不是重复定义.

1
2
3
4
5
6
7
8
9
// Create a new Basic Block to start insertion into.
llvm::BasicBlock *BB = llvm::BasicBlock::Create(TheContext, "entry", TheFunction);
Builder.SetInsertPoint(BB);

// Record the function arguments in the NamedValues map.
NamedValues.clear();
for (auto &Arg : TheFunction -> args()) {
NamedValues[Arg.getName()] = &Arg;
}

第一行创建了一个新的 basic block, 并把它插入到 TheFunction 中. 第二行就告诉 Builder, 新的指令应该被插入到这个 basic block 的末尾. Basic block 定义了 LLVM 中的控制流图 (Control flow graph). 虽然我们现在没有控制流, 但在 Chapter 5 会加上.

然后就是将函数参数添加到 map<string, Value *> NamedValues, 这样 VariableExprAST 的节点就能在函数定义里面来引用它们.

1
2
3
4
5
6
7
8
9
10
11
12
    if (llvm::Value *RetVal = Body -> codegen()) {
// Finish off the function
Builder.CreateRet(RetVal);
// Validate the generated code, checking for consistency
llvm::verifyFunction(*TheFunction);

return TheFunction;
}

TheFunction -> eraseFromParent();
return nullptr;
}

一旦代码的插入点被设置完毕, 我们就调用函数体的 codegen() 方法, 如果没有错误发生的话, 它就会将生成函数体里面表达式的代码, 然后将这些代码插入到上面定义的 entry basic block 中, 然后返回这些表达式计算出来的值. 接着创建 ret 指令完成函数. 生成完毕后使用 verifyFunction() 来检查生成代码的一致性. 如果生成函数体代码失败, 那就调用 eraseFromParent() 删除生成的函数.

但是这部分的代码也有问题, 如果 FunctionAST::codegen() 方法已经找到一个存在的函数声明, 那么它就跳过了函数原型代码的生成, 就导致跳过了对参数名的设置, 所以当前向声明中的参数名和函数定义使用的参数名不一致时, 会发生错误:

1
2
extern foo(a);   # ok, declare foo
def foo(b); # Error: Unknown variable name

Driver changes and Closing Thoughts

主要就是在 main() 和相应的 HandleDefinition() 之类的函数中, 相应的更改用 preprocess macro 做了判断, 直接看代码即可.

Chapter 2 Introduction

这一章将会使用 Chapter 1 中实现的 lexer 来构造 Kaleidoscope 的 parser. 一旦我们完成 parser 的实现, 我们就能定义并且构造一个抽象语法树 (Abstract Syntax Tree, AST).

The Abstract Syntax Tree (AST)

AST 是源代码语法结构的抽象表示, 它以树状的形式来表示源代码中的语法结构, 树上的每个节点都表示源代码中的一种结构, 从而能建模这个语言, 方便编译器后续的解释 (code generation). 在 Kaleidoscope 中, 我们有表达式 (expressions), 函数原型 (prototype), 函数对象 (function object). 我们先从表达式开始:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// ExprAST - Base class for all expression nodes
class ExprAST
{
public:
virtual ~ExprAST() {};
};

// NumberExprAST - Expression class for numeric literals like "1.0"
class NumberExprAST : public ExprAST
{
double Val;

public:
NumberExprAST(double Val) : Val(Val) {}
};

上述的代码表示了基类 ExprAST 和它的一个子类 NumberExprAST 的定义. NumberExprASR 将数字常量的值存储到实例中的成员变量, 使得编译器能够在后面的步骤中对它进行处理.

然后是其他 AST 节点的定义:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// VariableExprAST - Expression class for referencing a variable, like "a"
class VariableExprAST : public ExprAST
{
std::string Name;

public:
VariableExprAST(const std::string &Name) : Name(Name) {}
};

// BinaryExprAST - Expression class for a binary operator
class BinaryExprAST : public ExprAST
{
char Op;
std::unique_ptr<ExprAST> LHS, RHS;

public:
BinaryExprAST(char Op, std::unique_ptr<ExprAST> LHS, std::unique_ptr<ExprAST> RHS) : Op (Op), LHS (std::move(LHS)), RHS(std::move(RHS)) {}
};

// CallExprAST - Expression class for function calls
class CallExprAST : public ExprAST {
std::string Callee;
std::vector<std::unique_ptr<ExprAST>> Args;

public:
CallExprAST(const std::string &Callee, std::vector<std::unique_ptr<ExprAST>> args) : Callee(Callee), Args(std::move(Args)) {}
};

以上的定义都比较直接, 使用一个 std::string Name 变量来获取变量的名字; 二元操作符使用 char Op 来获取操作符, 然后使用 std::unique_ptr<ExprAST> LHS, RHS 分别持有左边和右边的表达式; 函数调用使用 std::string Callee 来存储被调用的函数名, 使用 std::vector<std::unique_ptr<ExprAST>> args 来持有函数的参数列表.

对于我们最基本的语言, 目前没有条件跳转的控制流, 所以还不是图灵完全的. 我们会在之后添加条件跳转. 接下来我们讨论函数接口和函数定义:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// PrototypeAST - This class represents the "prototype"
// for a function, which captures its name, and its
// argument names (thus implicitly the number of
// arguments the function takes)
class PrototypeAST {
std::string Name;
std::vector<std::string> Args;

public:
PrototypeAST(const std::string &Name, std::vector<std::string> Args) : Name(Name), Args(std::move(Args)) {}

const std::string &getName() const {return Name};
};

// FunctionAST - This class represents a function definition itself
class FunctionAST {
std::unique_ptr<PrototypeAST> Proto;
std::unique_ptr<ExprAST> Body;

public:
FunctionAST(std::unique_ptr<PrototypeAST> Proto, std::unique_ptr<ExprAST> Body) : Proto (std::move(Proto)), Body (std::move(Body)) {}
};

注意, 在 Kaleidoscope 中,函数原型和函数定义并不是表达式, 所以没有继承 class ExprAST. 由于在 Kaleidoscope 中只有 double 类型, 所以函数的类型也仅仅和参数数量有关. 在更为复杂的语言中, class ExprAST 会有一个 field 专门存放类型信息.

Parser Basics

我们需要使用一个 parser 来构造 AST. 例如, 我们想解析 x + y 这类的表达式, 并生成一个 AST 时, 我们可以进行如下的操作:

1
2
3
auto LHS = std::make_unique<VariableExprAST>("x");
auto RHS = std::make_unique<VariableExprAST>("y");
auto Result = std::make_unique<BinaryExprAST>('+', std::move(LHS), std::move(RHS));

为了实现上述操作, 我们先定义一些基本的辅助过程:

1
2
3
4
5
6
// CurTok - CurTok is the current token the parser is looking at.
static int CurTok;
// getNextToken - getNextToken reads another token from the lexer and update CurTok with its values.
int getNextToken() {
return CurTok = gettok();
}

这在 lexer 的基础上实现了一个简单的 token buffer. 使得我们能够保存 lexer 返回的 token. Parser 中的函数都假定 static int CurTok 是当前需要被处理的 token.

然后是一些简单的错误处理函数:

1
2
3
4
5
6
7
8
9
// LogError* - These are little helper functions for error handling
std::unique_ptr<ExprAST> LogError(const char *Str) {
fprintf(stderr, "LogError: %s\n", Str);
return nullptr;
}
std::unique_ptr<PrototypeAST> LogErrorP(const char *Str) {
LogError(Str);
return nullptr;
}

Basic Expression Parsing

我们首先处理数字常量. 对每种不同的表达式,我们都定义单独的函数进行处理. 下面是对数字常量进行处理的函数:

1
2
3
4
5
6
7
// numberexpr ::= number
std::unique_ptr<ExprAST> ParseNumberExpr() {
auto Result = std::make_unique<NumberExprAST> (NumVal);
// consume the number
getNextToken();
return std::move(Result);
}

gettok() 返回 tok_number 时它会被调用. 此时 static int NumVal 已被设置为对应的值, 因此 ParseNumberExpr() 构造一个 NumberExprAST 的节点, 并使 lexer 获取下一个 token, 然后返回构造出的节点.

所以, 递归下降解析器的特点就是: 将当前表达式的 token 全部获取, 构造出表达式节点. 然后使 lexer 获取下一个 token, 再返回构造出的表达式节点.

下面是一个带括号的表达式的例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// parenexpr ::= '(' expression ')'
std::unique_ptr<ExprAST> ParseParenExpr() {
// eat '('
getNextToken();
auto V = ParseExpression();
if (!V) {
return nullptr;
}
if (CurTok != ')') {
return LogError("Expected ')'");
}
// eat ')'
getNextToken();
return V;
}

这个函数说明了以下几个方法:

  1. 告诉我们怎么使用 LogError 函数来打印错误.

  2. 在处理过程中, 递归调用 ParseExpression 来进行表达式的处理. ParseExpression 调用不同的 parser 函数来对不同的表达式进行处理. 这种递归调用使得我们可以实现递归语法.

接下来是变量引用和函数调用的 parser 函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// identifierexpr
// :: = identifier
// :: = identifier '(' expression* ')'
std::unique_ptr<ExprAST> ParseIdentifierExpr() {
std::string IdName = IdentifierStr;
// eat identifier
getNextToken();

if (CurTok != '(') {
// It's a variable ref
return std::make_unique<VariableExprAST>(IdName);
}

// It's a functional call

// eat '('
getNextToken();
std::vector<std::unique_ptr<ExprAST>> Args;

if (CurTok != ')') {
while (1)
{
if (auto Arg = ParseExpression()) {
Args.push_back(std::move(Arg));
} else {
return nullptr;
}

if (CurTok == ')') {
break;
}
if (CurTok != ',') {
return LogError("Expected ')' or ',' in argument list");
}

getNextToken();
}
}
// eat '')
getNextToken();

return std::make_unique<CallExprAST>(IdName, std::move(Args));
}

这个函数和上面的函数类似, 在 gettok() 返回 tok-identifier 时被调用. 它也有递归处理和错误处理. 它使用 lookahead 来判断这是一个简单的变量引用或者是一个函数调用. 在消耗掉第一个 identifier 之后, 它会看看后面是不是一个 (, 如果是的话, 那就构造一个参数列表, 并返回一个 class CallExprAst 节点; 如果不是, 那就直接返回一个 class VariableExprAST 节点.

现在我们有了一个简单的 expression-parser 逻辑. 我们可以定义一个辅助的 primary parser 函数来 wrap 这些 parser 函数. 这个 primary 函数能够根据 static int CurTok 来调用对应的 parser.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// primary
// :: = identifierexpr
// :: = numberexpr
// :: = parenexpr
std::unique_ptr<ExprAST> ParsePrimary() {
switch (CurTok)
{
case tok_identifier:
return ParseIdentifierExpr();
case tok_number:
return ParseNumberExpr();
case ')':
return ParseParenExpr();
default:
return LogError("unknown token when expecting an expression")
}
}

Binary Expression Parsing

由于二义性, 二元表达式更加难以处理. 比如说当给定一个字符串 x + y * z, parser 可以选择解释为 (x + y) * z 也可以解释为 x + (y * z). 不过一般来讲, 符合常规的是第二种做法.

所以我们需要使用运算符优先级来处理这个问题, Operator-Precedence Parsing 使用运算符的优先级来告诉我们递归的顺序, 即首先处理哪个表达式. 首先我们定义一个运算符优先级表:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// GetTokPrecedence - Get the precedence of the pending binary operator token.
int GetTokPrecedence() {
if (!isascii(CurTok)) {
return -1;
}

int TokPrec = BinOpPrecedence[CurTok];
// Make sure it's a declared binOp
if (TokPrec <= 0) {
return -1;
}
return TokPrec;
}

// InitTokPrecedence - Install standard binary operators with their precedence
void initTokPrecedence() {
// Install standard binary operators
BinOpPrecedence['<'] = 10;
BinOpPrecedence['+'] = 20;
BinOpPrecedence['-'] = 20;
BinOpPrecedence['*'] = 40;
}

在最基本的 Kaleidoscope 中, 我们只支持 4 个二元操作符. 不过这个可以按需添加. GetTokPrecedence() 会返回当前 static int CurTok 的优先级.

当定义完上述的辅助函数之后, 我们就可以开始解析二元表达式. 操作符优先级解析器的思路就是, 将具有二义性的表达式拆分成不同的部分. 例如, 针对表达式 a + b + (c + d) * e * f + g, 我们可以将它看作被二元操作符分割的 primary 表达式流 (a stream of primary expressions separated by binary operators). 就这样, 首先处理第一个表达式 a, 然后就是 [+, b], [+, (c + d)], [*, e], [*, f] and [+, g].

所以, 一个表达式就是一个 primary 表达式, 加上后面可能存在的 [binOp, primary expr] 序列.

1
2
3
4
5
6
7
8
9
10
// expression
// :: = primary binOpRHS
std::unique_ptr<ExprAST> ParseExpression() {
auto LHS = ParsePrimary();
if (!LHS)
{
return nullptr;
}
return ParseBinOpRHS(0, std::move(LHS));
}

ParseBinOpRHS() 能够解析 [binOp, primary expr] 序列流. 它的参数是优先级 (上述例子为 0), 和目前已经被解析的部分 (上述例子为 LHS). 以表达式 x 为例, 它被划分为 x 和一个空的 binOpRHS. 以 a + b + (c + d) * e * f + g 为例, a 是一个合法的 primary 表达式, 即 LHS, 同时 binOpRHS+ b + (c + d) * e * f + g, 当前的 static int CurTok+.

传给 ParseBinOpRHS() 的第一个参数是优先级值, 它表明了这个函数能处理的最小操作符优先级. 如果当前的 pair stream 是 [+, x] 并且第一个传入的参数为 40, 那它不会再消耗 tokens (因为 + 的优先级为 20 小于 40), 所以:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// binOpRHS
// :: = ( binOp primary)*
std::unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec, std::unique_ptr<ExprAST> LHS)
{
// if this is a binOp, find its precedence
while (1)
{
int TokPrec = GetTokPrecedence();

// if this is a binOp that binds at least
// as tightly as the current binOP,
// consume it, otherwise we are done.
if (TokPrec < ExprPrec) {
return LHS;
}

// we know it's a binOP
int BinOp = CurTok;
// eat binOp
getNextToken();

// parse the primary expression after the binOP.
auto RHS = ParsePrimary();
if (!RHS) {
return nullptr;
}

// If binOP binds less tightly with RHS than the
// operator after RHS, let the pending operator take
// RHS and its LHS
int NextPrec = GetTokPrecedence();
if (TokPrec < NextPrec) {
// a + b * c -> a + (b * c)
RHS = ParseBinOpRHS(TokPrec + 1, std::move(RHS));
if (!RHS) {
return nullptr;
}
}

// a + b + c -> (a + b) + c
LHS = std::make_unique<BinaryExprAST>(BinOp, std::move(LHS), std::move(RHS));

// loop until the ends of expr.
}
}

ParseBinOpRHS() 首先比较当前 token 的优先级和传入参数的优先级. 如果当前 binOp 优先级, TokPrec 低于传入参数的优先级, 那么就说明 binOp 左边的 primary 表达式要优先计算, 直接返回 LHS 即可. 若 binOP 的优先级不低于传入参数, 那么就使用 int BinOp 持有当前的这个 binOp, 并调用 getNextToken() 处理好当前 binOp 右边的这个 primary 表达式, 并使用 RHS 持有引用, 所以现在 static int CurTok 表示的是下一个 binOp, 它的优先级是 NextPrec. 然后, 需要判断上面的这个 RHS 是和 LHS 结合的紧还是和下一个 primary 表达式结合的紧. 如果 NextPrec 优先级大于 TokPrec, 那么 RHS 和它右边的 primary 表达式结合的紧, 所以递归调用 ParseBinOpRHS(), 第一个参数为 TokPrec + 1, 第二个为 RHS, 可以将高优先级的序列先处理完毕, 并作为 RHS 返回. 若 NextPrec 不大于 TokPrec 那么, LHSRHS 结合得更紧, 那么先算出 LHSRHS, 之后再将它们的结果作为 LHS 循环算出后面的值.

在遇到优先级高的 binOP, 递归调用 ParseBinOpRHS() 时, 传的参数一定是 TokPrec + 1 才能在第一个条件判断中退出.

Parsing the Rest

然后我们要处理函数原型, 和函数定义. 函数原型如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// prototype
// :: = id '(' id* ')'
std::unique_ptr<PrototypeAST> ParsePrototype() {
if (CurTok != tok_identifier)
{
return LogErrorP("Expected function name in prototype");
}

std::string FnName = IdentifierStr;
// eat the function name
getNextToken();

if (CurTok != '(') {
return LogErrorP("Expected '(' in prototype");
}

// feed the arg list into vector
std::vector<std::string> ArgNames;
while (getNextToken() == tok_identifier)
{
ArgNames.push_back(IdentifierStr);
}

if (CurTok != ')')
return LogErrorP("Expected ')' in prototype");

// eat ')'
getNextToken();

return std::make_unique<PrototypeAST>(FnName, std::move(ArgNames));
}

函数定义就是函数原型加上函数体:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// function definition
// :: = 'def' prototype expression
std::unique_ptr<FunctionAST> ParseDefinition() {
// eat def
getNextToken();

auto Proto = ParsePrototype();
if (!Proto) {
return nullptr;
}

if (auto E = ParseExpression()) {
return std::make_unique<FunctionAST>(std::move(Proto), std::move(E));
}
return nullptr;
}

并且我们支持使用 extern 关键字进行前向声明:

1
2
3
4
5
6
7
// external forward declaration
// :: = 'extern' prototype
std::unique_ptr<PrototypeAST> ParseExtern() {
// eat 'extern'
getNextToken();
return ParsePrototype();
}

最后我们允许用户使用任意的顶级 (非嵌套) 表达式 (top-level expressions), 我们使用匿名零元函数 (anonymous nullary function) 来处理:

1
2
3
4
5
6
7
8
9
10
11
// toplevelexpr
// :: = expression
std::unique_ptr<FunctionAST> ParseTopLevelExpr() {
if (auto E = ParseExpression()) {
// anonymous nullary function
auto Proto = std::make_unique<PrototypeAST>("", std::vector<std::string>());

return std::make_unique<FunctionAST>(std::move(Proto), std::move(E));
}
return nullptr;
}

The Driver

Driver 就是使用循环来调用不同的 parser 函数 (dispatch loop):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
void HandleDefinition() {
if (ParseDefinition()) {
fprintf(stderr, "Parsed a function definition.\n");
} else {
// Skip token for error recovery
getNextToken();
}
}

void HandleExtern() {
if (ParseExtern()) {
fprintf(stderr, "Parsed an extern.\n");
} else {
// Skip token for error recovery
getNextToken();
}
}

void HandleTopLevelExpression() {
if (ParseTopLevelExpr()) {
fprintf(stderr, "Parsed a top-level expression.\n");
} else {
// Skip token for error recovery
getNextToken();
}
}

// top ::= definition | external | expression | ';'
void MainLoop() {
while (1) {
fprintf(stderr, "ready> ");
switch (CurTok) {
case ';' : // ignore top-level semicolons
getNextToken();
break;
case tok_def :
HandleDefinition();
break;
case tok_extern :
HandleExtern();
break;
case tok_eof :
return;
default:
HandleTopLevelExpression();
break;
}
}
}

The Basic Language

Kaleidoscope 是一个面向过程的语言, 允许你定义函数, 使用条件判断, 数学计算等. 在后面的教程中, 我们还会扩展 Kaleidoscope 使它支持 if/then/else 结构, 循环结构和用户定义的操作符.

Kaleidoscope 中唯一的数据类型是 64 bit 的浮点类型 (double in C). 下面是使用 Kaleidoscope 实现的斐波拉契数列.

1
2
3
4
5
6
7
8
# Compute the x'th Fibonacci Number
def fib(x)
if x < 3
then 1
else fib(x-1) + fib(x-2)

# This expression will compute the 40th number
fib(40)

我们还允许 Kaleidoscope 能够调用标准库函数 (LLVM JIT 能够轻易的实现它). 你可以使用 extern 关键字来定义一个函数.

1
2
3
4
5
extern sin(arg);
extern cos(arg);
extern atan2(arg1 arg2);

atan2(sin(.4), cos(42))

The Lexer

当实现一个语言时, 第一件事就是能够处理源代码文件, 并知道它想表达的意思. 传统的做法是使用 lexer (scanner) 将输入的源文件处理为 tokens. Lexer 返回的每个 token 都包括一个 token code 和一些 metadata (比如数字的值). 首先我们定义这些可能使用到的数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// The lexer return tokens [0-255]
// if it is an unknown character,
// otherwise one of these known things.
enum Token {
tok_eof = -1,

// commands
tok_def = -2,
tok_extern = -3,

// primary
tok_identifier = -4,
tok_number = -5,
};

// Filled in if tok_identifier
static std::string IdentifierStr;
// Filled in if tok_number
static double NumVal;

Lexer 返回的每个 token 要么是 Token 枚举中的值, 要么是一个 “未知” 的字符的 ascii 值, 比如 +. 如果当前的 token 是一个标识符, 那么 IdentifierStr 会保存它的名字. 如果当前 token 是一个数字, 那么 NumVal 会保存它的值.

Lexer 实现是 gettok() 函数, gettok() 被调用之后, 会返回从标注输入中获取的下一个 token.

1
2
3
4
5
6
7
8
int gettok() {
static int LastChar = ' ';

// Skip any whitespace
while (isspace(LastChar))
{
LastChar = getchar();
}

gettok 通过调用 getchar() 从标准输入中读取字符, 并忽略 tokens 之间的空格, 它仅仅是将字符存储到 LastChar 中, 不进行处理.

gettok 接下来要识别标识符和关键字.

1
2
3
4
5
6
7
8
9
10
11
12
if (isalpha(LastChar)) {
IdentifierStr = LastChar;
while (isalnum(LastChar = getchar()))
{
IdentifierStr += LastChar;
}

if (IdentifierStr == "def") return tok_def;
if (IdentifierStr == "extern") return tok_extern;

return tok_identifier;
}

如果不是标识符和关键字,那进一步判断是否为数字.

1
2
3
4
5
6
7
8
9
10
11
if (isdigit(LastChar) || LastChar == '.') {
std::string NumStr;
do
{
NumStr += LastChar;
LastChar = getchar();
} while (isdigit(LastChar) || LastChar == '.');

NumVal = strtod(NumStr.c_str(), 0);
return tok_number;
}

当读取的字符为数字时, 使用 strtod() 将它转化为 double 并存储在 NumVal 中. 这一步没有做异常处理, 有需要的可以添加.

然后我们处理注释:

1
2
3
4
5
6
7
8
9
10
if (LastChar == '#') {
// Comment until end of line
do
{
LastChar = getchar();
} while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
// Return the next token
if (LastChar != EOF)
return gettok();
}

注释会跳过 # 后面所有的字符,直到下一行或者 EOF. 当跳过之后, 需要重新进行处理并返回下一个 token, 所以直接尾递归调用 gettok() 即可. 如果输入不匹配标识符和关键字, 也不匹配数字, 只有可能是 + 之类的操作符, 和 EOF.

1
2
3
4
5
6
7
8
9
10
// Check for end of file. Don't eat EOF
if (LastChar == EOF) {
return tok_eof;
}

// Otherwise, just return the character as its ascii value
int ThisChar = LastChar;
LastChar = getchar();
return ThisChar;
}

以上就是 Kaleidoscope 的 lexer.

Introduction

As a defense, information hiding is more efficient than integrity-based defenses. In particular, randomization is almost ‘free’, as even a sophisticated defense against code reuse attacks such as Code Pointer Integrity adds a modest 2.9% performance overhead.

Unfortunately, recent research demonstrates that attackers bypass even the most advanced information-hiding defenses. They show that, by repeatedly probing the address space (either directly or by means of side channels), it is possible to break the underlying randomization and reveal the sensitive data.

In this paper, we show that we can transforming from fast information hiding to strong software integrity if (and only when) attacks start probing to break the randomization.

Derandomization primitives. To break randomization, attackers make use a number of derandomization primitives. Since one-shot leaks are rare in modern defenses—as the defenses move all sensitive information (e.g., code pointers) out of reach of the attacker, state-of-art derandomization primitives invariably must probe by repeatedly executing an operation to exhaust the entropy.

Selective hardening. The key idea we present is that, in a software protected by a fast baseline defense (information hiding), we keep monitoring the running program for any occurrence of probing attempts. When we encounter any such attempt, we automatically locate it origin, and patch only the offending piece of code at runtime with stronger and more expensive integrity-based defenses.

The first stage of ProbeGuard is form of anomaly detection. We detect probing attempts that characterize derandomization primitives. For most varieties of probing attacks, the anomaly detection itself is simple and non-intrusive (e.g., detecting of repeated exceptions).

The second stage, namely probe analysis, uncovers the particular code site the attacker abused for probing. Doing so is complicated in the general case. However, by leveraging fast control-flow tracing feature (e.g., Intel Processor Trace), ProbeGuard conservatively pinpoints the offending code fragment in a secure way.

Finally, ProbeGuard hotpatch the program by selectively replacing the offending code fragment with a harden variant. In principle, ProbeGuard is agnostic to the hotpatching technique it itself. A simple and elegent way is to create a binary that already contains multiple versions of all code fragments, where each version offers different levels of protection.

Threat model

We consider a determined remote attacker who aims to mount a code reuse attack over the network on a server application hardened by any ideal state-of-art information hiding-base defenses. ProbeGuard’s goal is to address the fundamental weakness of practical (information-hiding based) code reuse defenses, making them resistant to attacks that bypass the defense by derandomizing hidden memory regions.

We trust the underlying operating system, and we assume a modern processor that provides efficient control flow tracing, such as Intel Processor Trace.

We assume a determined attacker who has access to derandomization primitives to probe the victim’s address space. Further, we assume that the attacker has unlimited probing attempts as the application recovers automatically upon any crash.

Design

An application employs information-hiding based on state-of-art defenses. ProbeGuard must ensure what is hidden remain hidden. We embed anomaly detectors within the application that sense probing attacks and a secure code cache consisting of a collection of code fragments hardened by applying LLVM-based integrity checking instrumentations. A separate reactive defense server decodes execution traces obtained by Intel PT and performs fast probe analyses. ProbeGuard then reactively activates hardened code fragments by hotpatching when under attack.

Anomaly detection

Arbitrary reads and writes. An attacker may exploit an arbitrary memory read or write vulnerability in the application with the goal of derandomization the hidden region. Typically, only a very small fraction of the application’s virtual address is actually mapped. When the attacker uses such a vulnerability to access random address, it is highly likely to hit an unmapped virtual memory address leading to a segmentation fault (or a crash). We detect such probing attacks by simply handling and proxying the signal using a custom SIGSEGV handler.

Kernel reads and writes. Attackers prefer probing silently and avoid detection. Hence, to avoid the crashes, they could also attempt to derandomize the victim application’s address space by probing memory via the kernel. Certain system calls (e.g., read) accepts memory addresses in their argument list and return specific error codes (e.g., EFAULT) if the argument is a pointer to an inaccessible or unallocated memory location. Using arbitrary-read/write primitives on such arguments, they could attempt CROP attacks to enable probes eliminating application crashes (thereby not generating SIGSEGV signals). We can detect such probing attacks by intercepting system calls, either in glibc or directly in the kernel, and inspecting their results.

Arbitrary jumps. Attacker can use this primitive to scanning the address space, and looking for valid code pointers, and then locating gadgets. An attempt to execute unmapped or non-executable memory results in either a segmentation fault (raising a SIGSEGV signal) or an illegal instruction exception (raising a SIGILL signal). Thus, we extend our custom signal handler to handle both the signals.

Allocation oracles. Such probes exploit memory allocation functions in the target application by attempting to allocate large memory areas. Success or failure of the allocation leaks information about the sizeof holes in the address space, which in turn, helps locate the hidden region. We hook into glibc to intercept the system calls used to allocate memory (e.g., mmap() and brk). We choose a configurable threshold on allocation size, above which our detector triggers reactive hardening (half of the address space by default).

Probe analysis

Upon an anomaly detector flagging a potential attack, ProbeGuard must locate the offending code fragment. To locate the probing primitive, we employ hardware-assisted branch tracing to fetch to control flow prior to when we detectd the anomaly. We build a reverse mapping to fetch source-level information from the trace.

We obtain past executed control-flow using Intel PT, which offers low-overhead and secure branch tracing. Control bits in the CPU’s model-specific registers (MSRs) allow an operating system kernel to turn this hardware feature to on or off. The buffer size is configurable, typical values range from 2 MB to 4 MB or more. The backward trace analysis can limit itself to the relevant recent control-flow history and avoid decoding all of the trace in its entirely.

We use perf record command interface’s snapshot mode and dump the trace when required. Although the decoded trace provided, it’s hard to mapping them back to the source file and determining the offending code fragment.

The probe analyzer must locate the affected spot in the source code. We repurpose a field in LLVM’s debug metadata that normally carries column number of the source code location to instead place respective basic block identifiers. This only simplifies out prototype implementation to let LLVM’s default code generator pass on the metadata through DWARF 4.0 symbols onto the resulting application binarym instead of having to use a new metadata stream and write the supporting code.

We choose to mark the entire parent function that includes the probing primitive and use this for hardening.

Hotpatching

To facilitate hotpatching, we first transform the program using our LLVM compiler passes. The goal is to be able to quickly and efficiently replace each vanilla variant of a function with a different (hardened) variant of the same function at runtime. We clone all functions found in the target application’s LLVM IR and selectively invoke security-hardening instrumentation passes on specific functions clones at compiler time.

A global switchboard (which we insert in the application) allows switching between each function variant at runtime. It contains an entry for each function in the program, controlling which of the variants to use during execution.

To deter attacks against ProbeGuard, we mark the switchboard as read-only during normal execution.

Selective security hardening

Arbitrary memory reads and writes. Software Fault Isolation mitigates probing attempts that use arbitrary reads and writes. It simply instruments every load or store operation in the application binary by masking the target memory location with a bit mask (47th bit of the memory pointer).

Kernel reads and writes. We mask all pointer arguments to library calls.

Arbitrary jumps. We implemented a type-based CFI policy for forward-edge protection, and a per-thread shadow stack for backward-edge protection.

Allocation oracles. A white-list based threshold on the size arguments of library functions.

Implementation

ProbeGuard’s implementation consists of the following:

  1. A static library linked with the application: it houses a signal handler registered at startup, and helps in hotpatching to support switching between function variants at runtime.

  2. glibc modifications to intercept mmap(), and syscalls that results in EFAULT.

  3. LLVM compiler passes to generate and propagate function identifying markers onto the binary via DWARF 4.0 symbols, and function cloning to facilitate hotpatching.

  4. A seprate reactive defense server that does probe analysis by fetching Intel PT traces using libipt to map them onto the binary by reading the markers using libdwarf

Introduction

Defenses against memory-corruption typically reduce the attack surface by preventing the adversary from corrupting part of the application’s memory which is essential for a successful attack, such as W⨁X, CFI, CPI, and DFI.

Some of these defenses can be implemented efficiently using mechanisms that reside entirely outside the underlying application process. For instance, the kernel configures W⨁X and the hardware enforces it. However, using an external mechanism is not always feasible in practice due to high performance overhead. For instance, CFI requires run-time checks and a shadow stack, which is updated every time a function is invoked or returns. CPI requires run-time checks and a safe region, which contains meta-data about the program’s variables. This data cannot be stored outside of the process, e.g., in kernel memory, because accessing it would impose an impratical performance overhead due to the time needed for a context switch. Hence, to prevent the adversary from accessing the data some form of in-process memory isolation is needed.

Goals and Contribution. In this paper, we present IMIX, which enables lightweight in-process memory isolation for memory-corruption defenses that target the x86 architecture. IMIX enables isolated pages, which marked with a special flag. Isolated pages can only be accessed using a single new instruction called smov.

Adversary model

  • Memory corruption. We assume the presence of a memory-corruption vulnerability, which the adversary can repeatedly exploit to read and write data according to memory access permissions.

  • Sandboxed code execution. We assume memory-corruption mitigations cannot be bypassed unless the attacker can corrupt the mitigation’s metadata.

  • Immutable code. The adversary cannot inject new code or modify existing code.

IMIX

Like for applications, the correct functionality of defenses relies on the integrity of their code and data. Thus, the attacker may leverage a memory-corruption vulnerability in the application to bypass those defense. Traditionally, defense developers enforce the integrity of the code using W⨁X or execute-only memory, which forces defense developers to choose between high performance overheads and compromised security. IMIX allocates data belonging to run-time mitigations in isolated pages, which can only be accessed by smov.

In addition to rhe smov instruction and the associated access permissions, IMIX includes a kernel extension and compiler support.

Hardware. For IMIX, we extend two of the CPU’s main responsibilities, instruction processing and memory management. We add smov instruction to the instruction set, reusing the logic of regular memory access instruction. The memory access logic is modified so that it will generate a fault if (1) an instruction other than smov is used to access a page protected by IMIX, or if (2) an smov instruction is used to access a normal page.

Kernel. We extend the kernel to support an additional access permission, which identifies all pages protected by IMIX. This enables protected memory allocation for code generated at runtime.

Compiler. IMIX provides two high-level primitives: one for allocating protected memory and one for accessing it. Mitigations like CPI are implemented as an LLVM optimization pass that works at the IR level. For applications developers, IMIX provides source code annotations.

Implementation

Developers can build programs with IMIX, using our extented Clang compiler. We also modified its backend to support smov instructions. Programs protects by IMIX mark isolated pages using the system call mprotect with a special flag. Therefore, we extented the kernel’s existing page-level memory protection functionality to support this flag and mark isolated pages appropriately. To support IMIX, the CPU must be modified to support the smov instruction and must perform the appropriate checks when accessing memory.

CPU extension

We mapped the IMIX protection flag to an ignored bit in the PTE; specifically, we chose bit 52, as it is the first bit not reserved, and is normally ignored by the MMU. We used a hardware simulator to show the feasibility of our design.

Simulated hardware. We use Wind River Simics, a full system simulator, in order to simulator a complete computer which supports IMIX. And we use the complementary Intel Simulation and Analysis Engine (SAE) add-on to boot the Linux kernel and test our Linux extension. SAE supports emulating an x86 system running in a full operating system within its processes, while allowing various architecture instrumentations. This is done using extensions, called ztools.

To instrument a simulated system, ztools registers callback for specific hooks either at initialization time or dynamically. First, we make sure that our ztool is initialized by registering a callback for the initialization hook. Then, we register a callback that is executed when an instruction is added to the CPU’s instruction cache. If either a mov or smov instruction that accesses memory is found, we register an instruction replacement callback.

First, we check the protection flag of the memory accessed by the instruction. To identify protected memory, we look up the related PTE by combining the virtual address and the base address of the page table hierarchy linked from the cr3 register.

If a regular instruction attempts to access regular memory, we execute the original instruction to avoid instruction cache changes. For smov instruction attempting to access an isolated page, we first remove the instruction from the instruction cache, and the execute our ztool implementation of this instruction.

Real hardware. Adding IMIX support to a real CPU would require extending the CPU’s instruction decoder to make it aware of our smov instruction. Moreover, we need to modify the MMU to perform necessary checks.

Operating system extension

The isolated pages need to be marked as such in the PTEs, which are located in kernel memory. We add a dedicated PROT_IMIX flag into the mprotect system call. Note that once a pages is marked as PORT_IMIX, the only way to remove this flag from a page is by unmapping it first.

Compiler extension

Our modification mainly concerns the IR to provide access to the smov instruction to mitigations like CPI, and the x86 backend to emit the instruction. Further, we introduced an attribute that can be used the protect a single variable by allocating it in an isolated page.

IR Extension. Runtime defenses are usually implemented as LLVM optimization passes that interact with and modify LLVM’s IR. In order to allow those defenses to generate smov instructions, we extended the IR instruction set. We created two IMIX instructions: sload and sstore.

LLVM IR instructions are implemented as C++ classes and therefore supports inheritance. We implemented our IR instructions to as subclass of their regular counterparts in order to reuse the existing translation functionality from LLVM IR to machine code, called lowering in LLVM parlance.

To allocate memory in isolated pages, we implemented an LLVM function that can be called from an optimization pass, which allocates memory at page granularity using malloc and sets the IMIX permission using mprotect.

Attribute support. We added a IMIX attribute which can be used to annotate C/C++ variables which should be allocated in isolated pages. All instructions accessing those annotated variables will use the IMIX IR instructions instead of regular ones. We implemented this as an LLVM optimization pass that replaces regular variable allocations with indexed slots in a IMIX protected safe region (one per compilation module).

Modification to x86 backend. In the backend, we added the code needed to process sload and sstore instructions. The process of lowering IR instructions to machine code is two-staged. First, the FastEmit mechanism is used. It consists of transformation rules explicitly coded in C++ that are too complex to be processed using regular expressions. The mechanism can be used either generate machine code directly, or to assign a rule that should be applied in the next stage. In the second stage, LLVM applies rule-based lowering using pattern matching. The IR instructions and its operands are matched against string pattern in LLVM’s TableGen definitions. We modified both stage of the lowering process, similarly to how load and store are handled.

Introduction

The traditional memory corruption vulnerabilities, such as buffer overflow, exist inside SGX programs, which not only nullifies the security guarantee that SGX claims to provide, but also, perhaps more critically, allows attacker to exploit isolation and confidentiality to lurk. For example, by exploiting a stack overflow vulnerability in a trusted web server or database running in a enclave, an adversarial client can launch traditional return-orient programming (ROP) attacks to disclose security-sensitive data in an enclave.

To defeat such attack, Intel includes a simple ASLR scheme for SGX in Intel SDKs for Linux and Windows. However, we find that Intel’s ASLR’s design has several critical limitations that invalidate the security guarantees of ASLR.

Challenges:

  1. The strong, unique attack model of SGX exposes the enclave memory layout to untrusted system software, leaving SGX programs completely unprotected by ASLR.

  2. SGX provides the limited memory to an enclave typically 64 MB or 128 MB in total can be protected. Thus, the degree of randomness and the security of ASLR has been significantly limited.

  3. ASLR requires a dynamic relocation scheme that updates relative addresses in the code and data section, which comflict with the attestation process of SGX; specifically, SGX finalizes the integrity measurement before an enclave execution starts, but the relocation for ASLR must be performed afterwards. The inherent design disagreement results in writable code pages, invalidating executable space protection.

  4. The SGX specification forces the use of a fixed address for security-critical data in an enclave.

To address these issues, this paper proposes SGX-Shield. It introduces the concept of a multistage loader, which pushes back all ASLR-related operations to its secure in-enclave loader, hiding all security-sensitive operations from adversaries. SGX-Shield employs fine-grained randomization by splitting the code into a set of randomization units. SGX-Shield also enforces a software data execution protection to guarantee W⨁X in enclave’s code pages.

Design

Threat model

SGX-Shield assumes the same attack model as SGX, as our ASLR scheme is designed for SGX program. Our attack model consideration focuses on an attacker who wishes to exploit a vulnerability in the target program running in the enclave.

Multistage loader

SGX-Shield consists of three phases: preparation, bootstrapping, and secure in-enclave loading.

First, the preparation phase builds the target SGX program that a user wants to deploy. This built executable contains a secure in-enclave loader in its code section and the target SGX program in its data section.

Second, in the bootstrapping phase, SGX-Shield performs the first part of multistage loading. The primary role of the bootstrapping phase is to create an enclave and initialize the secure in-enclave loader with the help of the untrusted kernel. Because the memory layout of an enclave is assumed to be visible to the non-trusted party in this phase, it is designed to make as minimal decision on resource provisioning as possible and defer all security-sensitive decisions to the secure in-enclave loader. This phase allocates code pages with read, write and execute permissions, and data pages with read and write permissions. The read/write permissions granted the code pages can be written with target SGX program (performing relocation as well), and then execute it. While the code pages can be writable and executable, SGX-Shield removes read and write permissions from these pages using a software-level enforcement.

Finally, the secure in-enclave loader loads the target SGX program into the memory space from its data pages. The secure in-enclave loader randomly picks the base address using the RDRAND instruction, which relies on the non-deterministic on-processor entropy. Then, it loads each section of the target program, where the address of each section is further adjusted independently at random. Before finishing the loading, SGX-Shield resolves all relocation information, which includes global variables, static variables, and the destination of all branches. At last step, SGX-Shield wipes out the secure in-enclave loader from the memory space, and the jumps to the entry point of the target SGX program.

Fine-grained randomization for enclaves

Preparation. SGX-Shield relocates code at smaller granularity, called a randomization unit. Our implementation supports 32- and 64- byte units. During the compilation, SGX-Shield ensures that the terminating instructions of randomization unit are not fall-through cases. This is because fall-through assumes that randomization units are placed consecutively, which is not true when they are relocated for ASLR. Thus, for each fall-through case, SGX-Shield appends an unconditional branch instruction that points to the entry point of the next randomization unit.

Note that this instruction pass cannot be done naively at the intermediate language (IR) level. Even when IR does not have conditional branch instructions with fall-through features, the compiler backend may automatically introduce this.

Finally, the size of the randomization unit introduces a trade off between security and performance.

Stage 1: Bootstrapping. We let the loading scheme in the bootstrapping phase over-estimate the memory space (both code and data pages are 32 MB) required to load the target program, as the size is directly related to the ASLR entropy.

Secure in-enclave loading. Using the target SGX program in data pages, the secure in-enclave loader starts to place each randomization unit into previously allocated memory space. SGX-Shield randomizes all data objects as well, which includes stack, heap, and global variables.

Since SGX-Shield randomizes all code and data objects, all reference to memory objects including the absolute address and the PC-relative address must be determined after placing them. The secure in-enclave loader conducts the relocation for all memory objects after loading them.

Software DEP in enclaves

SGX-Shield enforces the NRW boundary, which is a virtual barrier between code and data pages. SGX-Shield guarantees this by (1) shepherding all memory access instructions and (2) ensuring only aligned control transfer are made.

Shepherding memory access. In general, there are two types of memory access instruction: (1) explicit memory accessing instructions (e.g., mov, inc, add with memory operands in x86) and (2) stack accessing instructions (e.g., push, pop, ret or sub with an explicit stack register operand).

In order to prevent read or write attempts through the first type of instruction, SGX-Shield makes sure that a memory address to be accessd is always higher tha the NRW boundary. SGX-Shield reserves the register r15 to hold the boundary, and transform the original instruction such that it accesses memory using a positive offset from the NRW boundary. We then enfoce that the maximum positive offset is smaller than 2^32 to ensure that the instruction never accesses memory beyond the NRW boundary.

1
2
3
4
5
6
7
8
9
10
11
; Before enforcing non-writable code
mov [rdx + 0x10], rax

; After enforcing non-writable code
; (r15 is initialized to hold the NRW boundary)
; (enfoce rdx >= r15)

lea r13, [rdx + 0x10] ; r13 = rdx + 0x10
sub r13, r15 ; r13 = r15 - r13
mov r13d, r13d ; r13 = r13 & 0xffff'ffff
mov [r15 + r13], rax ; *(r15 + r13) = rax

To enfoce no-writable code pages on stack accessing instruction, SGX-Shield makes sure that a stack pointer (i.e., rsp) never points to code pages. To handle instructions that adjust stack pointer implictly, we simply map a guard page (i.e, no permission is granted) at the top and buttom of the stack area.

1
2
3
4
5
6
7
8
9
10
; Before enforcing non-writable code
sub rsp, 0x40

; After enforcing non-writable code
; (r15 is initialized to hold the NRW boundary)
; (enfoce rdx >= r15)
sub rsp, r15 ; rsp = rsp - r15
sub rsp, 0x40 ; rsp = rsp - 0x40
mov esp, esp ; rsp = rsp & 0xffff'ffff
lea rsp, [rsp + r15] ; rsp = rsp + r15

Ensuring aligned control transfer. SGX-Shield restricts the control transfers only to the entry point of the randomization unit. It enforces that there is only one way to decode instructions, ensuring that only shepherded memory access takes place. This enforcement is performed for all control transfer instructions, including indirect branches as well as return instruction.

1
2
3
4
5
6
7
; Before enforcing aligned indirected branch
jmp rax

; After enforcing aligned indirected branch
; (enforce rax % [random unit size] = 0)
and rax, $-0x20 ; rax = rax && 0xffff'ffff'ffff'ffe0
jmp rax ; jump to the address pointed by rax

Isolating access to security-critical data

SGX places the page for the State Save Area (SSA) at a known location and does not permit its relocation. We using software DEP mechanism implements SFI to isolate SSA.

Implemenataion

Preparation. The preparation phase includes an LLVM compiler 4.0, a tatic linker, and a sign tool of Intel SGX SDK for Linux. By modifying the backend of LLVM, we insert two kinds of instructions: (1) unconditional jump instructions (instead of fall-through) at the end of randomization unit and (2) instructions to enforce the software DEP. In addition, the LLVM emits each randomization unit as a symbol. The fine-grained symbol information is used in the secure in-enclave loading.

The current version of SGX-Shield supports only static linking. We modify Intel SDK for Linux to provide the enclave program with sufficient code and data pages for shuffling. We emebed the binary of enclave program into the binary of secure in-enclave loader as a section using the objcopy command.

Bootstrapping. The bootstrapping simply creates an enclave and loads the secure in-enclave loader to the enclave.

Secure in-enclave loader. The secure in-enclave loader is a dynamic loader that conducts the randomization unit-level memory object loading and relocations. It resolves the relocation information for all the memory reference including the absolute address and the PC-relative address. We implements this (parsing the ELF file, randomly loading and relocation ) with a single C file.

Introduction

We propose a system, called Shuffler, which provides a deployable defense against JIT-ROP and other code reuse attacks. Other defenses have had significant barriers to deployment: some utilize a custom hypervisor; others involve a modified compiler, runtime, or operating system kernel. In comparison, Shuffler runs in userspace along side the target program, and requires no system modification beyond a minimal patch to the loader.

Shuffler operates by performing continuous code re-randomization at runtime, within the same address space as the program it defends. Additional, we bootstrap into a self-hosted and self modifying egalitarian environment — Shuffler always shuffles itself.

We achieve a shuffle period on the order of tens of milliseconds, so fast that is nearly impossible to form a complete exploit. Shuffler creates new function permutations asynchronously in a separate thread, and then atomically migrates program execution from one copy of code to the next. This migration requires a vanishingly small global pause time, as program threads continue to execute unhindered 99.7% of the time. Thus, if the host machine has a spare CPU core, shuffling at faster rates does not significant impact the target’s performance.

Our system operates on program binaries, analyzing them and performing binary rewriting.

Threat model

We assume that the protection against code injection (W^X) is in place, and that an x86_64 architecture is in use. Our system does not require (and, in fact, is orthogonal to) other defensive techniques.

Design

Architecture

Shuffer is design to require minimal system modifications. To aviod kernel changes, it runs entirely in userspace; to avoid requiring source or a modified compiler, it operates on program binaries. Performing re-randomization soundly requires complete and precise pointer analysis, we leverage symbol and relocation information from the (unmodified) compiler and linker.

At load-time, Shuffler transforms the program’s code using binary rewriting. The goal of rewriting is to be able to track and update all code pointers at runtime. We leverage our complete and accurate disassembly to transform all code pointers into unique identifiers —indices into a code pointer table. These indices cannot be altered after load time. We handle return addresses (dynamically generated code pointers) differently, encrypting them on stack rather than using indices.

Our system performs re-randomization at the level of functions within a specific shuffle period, a randomization deadline specific in milliseconds. Shuffler runs in a separate thread and prepares a new shuffled copy of code within this deadline. The vast majority of the re-randomization process is performed as asynchronously: creating new copies of code, fixing up instruction displacements, updating pointers in the code table. The threads are globally paused only to atomically update return addresses. Since any existing return addresses reference the old copy of code, we must revisit saved stack frames and update them.

To prevent our own code from being used in a code reuse attack, Shuffer randomizes it the same way it does all other code. In fact, our scheme uses binary rewriting to transform all code in a userspace application (the program, Shuffler, and all shared libraries) into a single code sandbox, essentially turning it into a staticlly linked application at runtime.

Challenges

Changing function pointer behavior. Normal program’s memory layout remains consistent and function pointers have indefinite lifetime. Re-randomization introduces an arbitrary lifetime for each block of code, and it becomes an exercise in avoiding dangling code pointers.

Hence, we need to accurately track and update every code pointer during the re-randomization process. We opt to statically transform all code pointers into unique identifiers—namely, indices into a hidden code pointer table. Then wherever the code pointer is copied throughout memory, it will continue to refer to the same entry in the table.

Some code pointers are dynamically generated, in particular, return addresses on the stack. We could dynamically allocated table indices, but call/ret pairs are highly optimized, and replacing them with table mechanism would involve a large performance degradation. Instead, we allow ordinary calls to proceed as usual, and at re-randomization time we unwind the stack and update return addresses to new values. Rather than leave return addresses exposed on the stack, we encrypt each address with an XOR cipher.

Augmented binary analysis. We propose a augment binary analysis, which involves analyzing program binaries that have additional information included by the compiler.

The common problems with binary analysis are distinguishing code from data, and distinguishing pointers from integers. To tackle these problems, we require that (i) the compiler preserve the symbol table, and (ii) that the linker preserve relocations. The symbol table indicates all valid call targets and makes disassembly straightforward—we iterate through symbols and disassemble each one independently. Reloactions are used to indicate portions of an object file (or executable) that needs to be patched up once its base address is known. Since each base address is initially zero, every absolute code pointer must have a relocation—but as object files are linked together, most code pointers get resolved and their relocations are discarded. We simply ask the linker to preserve these relocations.

bootstrapping into shuffled code. Shuffler defends its own code the same way it defends all other code. Shuffled code cannot start running until the code pointer table is initialized, requiring some unshuffled startup code. Shuffled and original code are incompatible if they use code pointers; the process of transforming code pointers to indices overwrites data that the original code accesses, and then the original code will no longer execute correctly. Hence, we would have to call new function as they became available, and carefully order the function-pointer rewrite process to avoid invalidating any functions currently on the call stack.

Instead, we opted for a simpler and more general solution. Shuffler is split into two stages, a minimal and a runtime stage. The minimal stage is completely self-contained, and it can safely transform all other code, including libc and the second-stage Shuffle. The it jumps to the shuffled second stage, which erases the previous stage (and all other original code). The second stage inherits all the data structures created in the first so that is can easily create new shuffled code copies.

Implementation

Code pointers are directed through the code pointer table and return address are stored on the stack, encrypted with an XOR cipher. In each shuffle period, Shuffler makes a new copy of code, updates the code pointer table and sends a signal to tell all threads (including its self); each thread unwinds and fixes up its stack. Shuffler waits on a barrier until all threads have finished unwinding, then erases the previous code copy.

Our Shuffler Implementation supports many system-level features, including shared libraries, multiple threads, forking, {set/long}jmp, system call re-entry, and signals.

Transformation to support shuffling

Code pointer abstraction. We allocate the code pointer table at load-time and set the base address of the GS segment at it. Then, we transform every function pointer at its initialization point from an address value to an index into this table. Jump tables are handled similarily, with indices assigned to each offset within a function that is used as a target.

Every instruction which originally used a function pointer value is rewritten to instead indirect through the %gs table. This adds an extra memory dereference. Since x86 instruction can contain at most one memory reference, if there is already a memory reference, we use the caller-saved register %r11 as scratch space. For (position-dependent) jump tables, there is no register we can safely overwrite, so we use a thread-local variable allocated by Shuffler as a scratch space (denoted as %fs: 0x88).

Return address encryption. We encrypt return address on the stack with a per-thread XOR key. We reuse the stack canary storage location for our key; our scheme operates similarly to stack canaries, but does not affect the layout of the stack frame. We add two instruction mov %fs:0x28, %r11; xor r11, (%rsp) at the beginning of every function and before every exit jump; after each call, we insert a mov instruction to erase the now-visible return address on the stack. We again use %r11 as a scratch register, since it is a caller-saved register according to the x86-64 ABI.

Displacement reach. A normal call instruction has a 32-bit displacement and must be within ± 2GB of its target to “reach” it. Shared libraries use Procedure Linkage Table trampolines to jump anywhere in the 64-bit address space.

Completeness of disassembly

While shuffling some libraries and programs, we encountered myriad special cases. The issues boil down to: (a) dealing with inaccurate/missing metadata, especially in the symbol table; (b) handling special types of symbols and relocations; and (c) discovering jump table entries and invocations.

Our major challenge is identifying whether relocations are part of jump tables, and distinguishing between indirect tail-recursive jumps and jump-table jumps. If we fail to realize a relocation in a jump table, we will calculate its target incorrectly and the jump will branch to the wrong location; if we decide that a jump table’s jump is actually tail recursive, we will insert return-address decryption instruction before it, corrupting %11 and scrambling the top of the stack.

GCC generates jump tables differently in position-dependent and position-independent code (PIC). Position-dependent jump tables use 8-byte direct pointers, and are nearly always invoked by an instruction of the form jmpq *(%rax, %rbx, 8) in any optimization level. PIC jump tables use 4-byte relative offsets added to the address of the beginning of the table—and the lea that loads the table address may be quite distant from the final indirect jump. To find PIC jump tables, we use outgoing %rip-relative references from functions as bounds and check if they point at sequences of relocation in the data section.

It is difficult to tell whether a jmpq *%rax instruction is used for indirect tail recursion, or a PIC jump table. We use a liner sweep to record push instructions in the function’s first basic block, and keep a log of the pop instruction seen since the last jump. If an indirect jump is preceded by pop instructions that are in the reverse order of the push instructions, we assume we have found a function epilogue and that the jump is indirect tail recursive.

bootstrapping and requirements

We carefully bootstrap into shuffled code using two libraries (stage 1 and stage 2), so that the system never overwrites code pointers for the module that is currently executing. The constructor of stage 1 is called before any other via the linker mechanism -z initfirst. Then, stage 1 make sure all other constructors run in shuffled code. The last constructor to be called is stage 2’s own constructor; stage 2 creates a dedicated Shuffler thread.

  1. Compiler flags. We require the program binary and all dependent libraries to be compiled with -Wl, -q, a link flag that preserves relocations, and -gdwarf-2, when compiling with c++. Since we require symbols and DWARF unwind information, the user must avoid -s and -fno-asynchronous-unwind-tables.

  2. System modifications. The -z initfirst loader feature currently only supports one shared library, and libpthread already use it. Since shuffled functions must be within ± 2GB of each other, we simplify Shuffler’s task and map all ELF PT_LOAD sections into the lower 32 bits of the address space. Finally, we disabled a manually-constructed jump table in the vfprintf of glibc.

Implementation Optimizations

Generating new code. The Shuffer thread maintains a large code sandbox that stores shuffled functions. In each shuffle period, every function within the sandbox is duplicated and the old copies are erased. The sandbox is split in half, so that one half may be erased with a single mprotect system call. We maintain servral buckets and each function is placed in a random bucket; when a bucket fills up, it is committed with an mprotect call and a fresh bucket is allocated.

We use a Binary Indexed Tree for function allocations. Our tree keeps track of all valid addresses for new buckets, storing disjoint intervals; it also tracks the sum of interval lengths.

Stack unwinding. We wrote a custom unwind library with a straightforward DWARF state machine.

Binary rewritting. Suffler’s load-time transformations are all implemented through binary rewriting. We disassemble each function with diStorm and produce intermediate data structures which we call rewrite blocks. Rewrite blocks are similar to basic blocks but may be split at arbitrary points to accommodate newly inserted instructions.

Security analysis

Analysis of traditional attacks

Normal ROP. Absolutely.

Indirect JIT-ROP. Indirect JIT-ROP relies on leaked code pointers and computes gadgets accordingly. Because code pointers are replaced with table indices, the attack cannot gather code pointers from data structures; nor can the attacker infer code pointers from data pointers, since the relative offset between code and data sections changes continuously.

Direct JIT-ROP. In direct JIT-ROP, the attacker is assumed to know one valid code address, and employs a memory disclosure recursively, harvesting code pages and finding enough gadgets for a ROP attack. The entire attack must be completed within the shuffle period of r milliseconds.

Blind ROP. BlindROP tries to infer the layout of a server process by probing it workers, which are forked from the parent and have the same layout. The attack uses a timing channel to inter information about the parent based on whether the child crashed or not. Shuffler easily thwarts this attack because it randomizes child and parent processes independently.

Introduction

More recently, researchers have demonstrated the feasibility of enforcing Control-Flow Integrity (CFI) in kernel space. However, in addition to problems discovered in CFI, a more fundamental problem is that because OS kernel are mainly data-driven, CFI can be easily by passed by non-control-data attacks.

The objective of this work is to provide a defense system that is both principled and practical. We achieve the first goal by utilizing Data-Flow Integrity. Similar to CFI, DFI guarantees that runtime data-flow cannot deviate from the data-flow graph generated from static analysis. For example, data from a string buffer should never flow to the return address on the stack (control-data) or the uid (non-control-data). In this work, we focus on enforcing invariants that are related to kernel access control mechanisms (a.k.a., reference monitors) so as to defeat privilege escalation attacks.

While this approach sounds intuitive at high level, enforcing it with practical runtime performance overhead is very challenging. First, access control checks are scattered through out the kernel, and related data are mixed with other data. Therefore, the protection must be deployed kernel-wide. Moreover, without hardware support, software-based DFI implementation can be every expensive.

We propose a system called KENALI, which consists of two key techniques. Our first technique, INFERDISTS, is based on the observation that although the protection has to be kernel-wide, only a small portion of data is essential for enforcing the two security invariants. For ease of discussion, we refer to this set of data as distinguishing regions. Our technique, INFERDISTS, overcomes these challenges with a new program-analysis-based approach. Specifically, by leveraging implicit program semantics, INFERDISTS is able to infer security checks with out any manual annotation. After this, by considering both data and control-dependencies of data that can affect each security check, as well as sensitive pointers, INFERDISTS generates a complete set of distinguishing regions.

Our second technique, PROTECTDISTS, is a new technique to enforce DFI over the distinguishing regions. PROTECTDISTS uses a two-layer protection scheme. The first layer provides a coarse-grained but low overhead data-flow isolation that prevents illegal data-flow from non-distinguishing regions to distinguishing regions. After this separation, the second layer the enforce fine-grained DFI over the distinguishing regions. Because the access pattern to most data in the distinguishing regions is very asymmetric, read accesses are usually magnitudes more frequent than write accesses, PROTECTDISTS employs the Write Integrity Test (WIT).

Problem scope

Threat model and assumptions

We only consider attacks that originate from unprivileged code. The attacker can read and write word-size value at an arbitrary memory address.

Motivation

  1. Simple rooting attacks. CVE-2013-6282 allows attacker to read and write arbitrary kernel memory, which matches our adversary model. (i) Retrieving the address of prepare_kernel_cred() and commit_creds(). Depending on the target system, they can be at fixed address, or obtainable from the kernel symbol table (kallsyms_addresses); (ii) Invoking prepare_kernel_cred() and pass the results to commit_creds(), then the kernel will replace the credential of the current thread with one of root privilege.

  2. Bypassing CFI with non-control-data attacks. The above attack can be prevented by kernel wide CFI. But CFI can be easily bypassed by locating the cred structure and overwriting the euid field. The cred structure can be located in many ways. (i) if kallsyms is available and contains the address of init_task, we can traverse the process list to locate the task_struct of the current process, then task_struct->cred (ii) if there is a vulnerability that leaks the stack address, attackers can directly obtain the address of the thread_info, then follows the links to locate the task_struct.

  3. Diversity of non-control-data attacks. From the target kernel we evaluated, we found that 2419 data structures contain critical data.

Technical approach

Inferring distinguishing regions

The challenge is that for security, our solution must be sound (i.e., no false positives), but for performance, we want the size of the inference result to be as small as possible.

  1. Control-Data

  2. Non-Control-Data: Access controls are implemented as security checks, and while a kernel may have many security checks scattered throughout different components, they all follow one consistent semantic: if a security check fails, it should return a security related error codes. Leveraging this observation, we can collect security check without manual annotation. Then, distinguishing regions can be constructed via standard dependency analysis over the conditional variables of security check.

1
2
3
4
5
6
7
8
9
10
11
12
13
int acl_permission_check (struct inode * inode, int mask) {
unsigned int mode = inode -> i_mode;

if (current_cred -> fsuid == inode -> i_uid)
mode >>= 6;
else if (in_group_p (inode -> i_gid))
mode >>= 3;

if ((mask & ~mode & (MAY_READ | MAY_WRITE | MAY_EXEC)) == 0)
return 0;

return -EACCES;
}

We must consider branches that dominate a security check. However, naively including all dominators will introduce many false positives. To reduce false positives, INFERDISTS conservatively excludes two cases: (i) a branch can lead to non-related error return and (ii) a branch instruction is post-dominated by either a security check or checks in (a).

  1. Sensitive pointers: pointers to sensitive regions must be protected as well; otherwise, attackers can indirectly control the data in distinguishing regions by manipulating the pointers.

Protecting distinguishing regions

The challenge for this step is how to minimize the performance overhead on commodity processors that lack support for fine-grained data-flow tracking. To address this challenge, our key observation is that, after separating the memory into distinguishing region and non-distinguishing region, there could be three types of data-flow: (i) within non-distinguishing region, (ii) between two regions, and (iii) within distinguishing regions.

We design a two-layer scheme: lightweight data-flow isolation on second type, and more expensive DFI enforcement to prevent illegal data-flow of the third type.

  1. Data-flow isolation: We explored the feasibility of hardware-based data-flow isolation for AArch64. We developed a novel, virtual address space-based isolation mechanism by tagging the TLB with an identifier (i.e., Process Context ID)

  2. Write Integrity Test: In addition to preventing illegal data-flow from non-distinguishing regions to distinguishing regions, we use DFI to prevent illegal data-flow within distinguishing regions. We leveraged the write integrity test.

  3. Shadow Objects: As a hardware protection unit (e.g., page) may contain both distinguishing and non-distinguishing regions, once we write-protect that page, we also have to pay additional overhead for accessing non-distinguishing regions. Our solution to this problem is shadow objects: a normal copy for non-distinguishing regions and a shadow copy for distinguishing regions.

  4. Safe Stack: Using shadow stack.

A prototype for Android

Data-flow isolation

  1. AArch64 VMSA (Virtual Memory System Architecture): The AArch64 supports a maximum of 48-bit virtual address (VA), which is split into two parts: the bottom part is for user space, and the top part is for kernel space.

  2. Shadow Address Space: Our data-flow isolation implementation leverages the ASID (Address Space IDentifier) tagging feature and is based on shadow address space. Under this isolation scheme, the same physical page is mapped into two different address spaces with different permissions.

  3. Atomic Primitive Operations: Considering the security of the shadow address space, we make every operation under the shadow address space atomic, such as write a single data, memcpy and memset.

MMU integrity

Three additional security invariants: (i) MMU isolation: Only MMU management code can modify MMU-related data. (ii) Code integrity: Kernel code integrity. (iii) Dedicated entry and exit: MMU management code is always invoked through dedicated entries and exits.

Shadow objects

  1. SLUB Allocator: In our target kernel, most distinguishing regions are allocated from the SLUB allocator. When SLUB allocates pages for a new slab, we allocate a shadow slab of the same size and map it read-ony at a fixed offset (4GB) from the original slab.

  2. Global Objects: While most distinguishing regions are dynamically allocated, some of them are statically allocated in the form of global objects. We allocate shadow memory for entire .data section, copy all the contents to populate the shadow memory and then map it.

  3. Analysis and instrumentation: We first identify all allocation sites for objects in distinguishing regions and modify the allocation flag. Next, we identify all pointer arithmetic operations for accessing distinguishing regions and modify them to access the shadow objects. Finally, we identify all write to distinguishing regions, and modify them to invoke our atomic operations instead. inline assembly needs to be handled manually.

Kernel stack randomization

We map kernel stack to a unused AV above the logical map (top 256GB), because kernel stacks are small (16KB), we have around 24-bit of entropy available.

We contain the risk of information leak as follows. First, we mark functions like copy_to_user as unsafe. Then, we store the real stack pointer in a redirection table and replace the original pointer with an index into the table, which mapped as inaccessible under normal context. So does the page tables of shadow stack.

Evaluation

Distinguishing regions discovery

For control data, our analysis identified 6192 code pointers, 991 are function args, 11 are return addresses, both of them are protected by safe stack. 1490 are global variables, and 3699 are fields over 783 data structures.

For non-control data, the error codes we used were EPERM, EACCES, and EROFS. 526 functions as capable of returning permission-related errors; 1077 function args, 279 global variable, and 1731 data fields over 855 data structures as security-critical.

For sensitive pointers, we combined both control and non-control data to result a total of 4096 data fields over 1316 data structures as the sensitive pointer inference. Futher, there are 4002 fields over 1103 structures as distinguishing regions. For the target kernel, we should protect about 27.30% of all data structures.

Introduction

Code pointers can be divided into two categories: backward edge, i.e., return address or forward edge pointers, such as function pointers or virtual table pointers. Control-Flow Integrity protects forward edges and assumes that the backward edges are protected. However, stack canaries and safe stacks are the strongest backward edge protection available in mainline compilers, and both are easily bypassed by information leaks.

Shadow stacks enforce stack integrity, protecting against stack pivot attacks and overwriting return addresses. Shadow stacks store the return address in a separate, isolated region of memory that is not accessible by the attacker. Upon returning, the integrity of the program return address is checked against the protected copy on the shadow stack. Two shadow stack designs have been proposed: compact shadow stacks, which rely on a separate shadow stack pointer, and parallel shadow stacks, which place the shadow stack at a constant offset to the orignal stack. These existing shadow stack designs suffer from a combination of poor performance, i.g., greater than 5%, and difficulty supporting C and C++ programming paradigms, i.g., multi-threading and exception handling.

Shadow stack design space

For any shadow stack mechanism to be adopted in practice, it must be highly performance, compatible with existing code, and provide meaningful security.

Shadow stack mechanisms are defined by how they map from the program stack to the shadow stack. This includes the type of mapping, as well as how the mapping is encoded in the protected binary.

Shadow stack mechanisms

Direct mapping schemes for parallel shadow stacks use the location of the return address on the program stack to directly find the corresponding entry on the shadow stack. The parallel shadow stack is as large as the program stack, and a simple offset maps from the program stack to the shadow stack.

Indirect mapping schemes for compact shadow stacks maintain a shadow stack pointer, equivalent to the stack pointer used for the program stack. The shadow stack pointer points to the last entry on the shadow stack, exactly as the stack pointer does for the program stack. Maintaining a shadow stack pointer allows a compact shadow stack to allocate significantly less memory, as only room for the return address is required, instead of duplicating the program stack.

Parallel and compact shadow stacks have different compatibility implications. if calls and returns were always perfectly matched, there would be no difference. However, the setjump / longjump functionality of C, which allows jumping mutiple stack frames back up the stack, and the equivalent stack unwinding capability used by C++ for exception handling, both break the assumption of perfectly matched calls and returns. Consequently, this leads to additional overhead for indirect shadow stack mapping schemes, while having no effect on direct mapping schemes.

Shadow stack mechanisms based on register.

All 64 bit architecture have at least 16 general purpose registers, and it is possible to dedicate a general purpose register to the shadow stack mechanism.

Parallel shadow stack mechanisms: The existing mechanism places shadow stack entries at a constant offset from the program stack, which is very efficient but higher memory overhead and lower security. However, the compatibility concerns arise from requiring a constant offset, which is limited to 32 bits for immediate operand in x86, from the program to the shadow stack from all threads, severely constraining the address space layout for programs with many threads, such as browsers.

To mitigate the compatibility and security concerns, we propose a new parallel shadow stack mechanism, which encodes the offset in a dedicated register, allowing the offset to the shadow stack to be determined at runtime. Further, the offset may vary from thread to thread as register are thread as register are thread local, and the offset can be set when the thread is created.

Compact Shadow Stack Mechanisms: For compact shadow stack mechanisms, the key question is where to store the shadow stack pointer. The shadow stack pointer will be dereferenced twice in every function: once in the prologue to push the correct return address, and once in the epilogue to pop the shadow return address.

The earliest approach of a shadow stack scheme with a dedicated register focuses on x86_32. We rejuvenate this idea for 64 bit architectures as gerneral purpose register provide the fastest possible option for storing the shadow stack pointer.

Shadow stack implementations

Each of the shadow stack mechanisms we evaluate is implemented as a backend compiler pass in LLVM 7.0.0. In particular, each shadow stack mechanism must instrument calls and returns to update its shadow stack and validate the return address before using it to transfer control. We show that the best way to accomplish this is to instrument function prologues and epilogues.

Instrumented locations

The instrumentation is responsible for pushing the return address to the shadow stack, and updating the shadow stack pointer for compact shadow stacks. Returns must be instrumented to pop from the shadow stack and validate the program return address in the function epilogue before the control-flow transfer to mitigate control-flow hijacking attacks.

The elegant solution for instrumenting calls is to place the protection in the function prologue. In this way, the function is protected, not particular call sites. instrumenting function prologues and epilogues maintains the symmetry of calls and returns naturally, as each will be executed for every function call.

On x86, instrumenting the function prologues results in a one-instruction wide Time Of Check To Time Of Use (TOCTTOU) opportunity due to architecture limitations. The call instruction pushes the return address to the stack where it may be modified by an attacker before it is picked up by the prologue in the called function. Architectures, such as ARM, where the address of the called function is stored in a register, do not have this limitation.

Our proposed mechanism halves the attack window as we jump to the verified address, but not fully mitigate the TOCTTOU window. Intel Control Enforcement Technology (CET) introduces a shadow stack based on hardare and compiler support. This extension will mitigate the TOCTTOU window on x86 and simplify the required instrumentation.

Stack unwinding mechanisms such as longjmp and c++ exceptions require additional instrumentation for compact shadow stacks. We must be able to unwind to the correct point on the shadow stack as well. Simply matching return addresses does not suffice for this, as the same address can show up multiple times in the call stack due to, e.g., recursive calls. To deal with this, our compact shadow stack implementations also push the stack pointer, i.g., rsp. The stack pointer and return address uniquely identify the stack frame to unwind to, allowing our mechanisms to support stack unwinding.

For the shadow stack mechanisms that use a register to encode the shadow stack mapping, ensuring compatibility with unprotected code constrains our selection of register. Our implementations use r15 in practice.

Runtime support

Our runtime library is responsible for allocating the shadow stack, and hooking setjump and longjmp. We add a new function in the pre_init array that initializes the shadow stack for the main program thread. This function also initializes the shadow stack pointer for compact shadow stack mappings.

For compact shadow stack mappings to support multi-threading and libunwind, we preload a small support library. It intercepts calls to pthread_create and pthread_exit to set up and tear down shadow stacks for additional threads.

Shadow stack epilogue optimizations

Traditionally, shadow stacks have relied on compare instructions to validate the shadow return address and program return address are equivalent, which potentially leads to pipe line stalls even with branch prediction. Consequently, as an optimization, we explore two different methods to optimize this validation.

To replace the compare instruction, we propose an xor of the program return address and shadow return address. This will result in 0 bits anywhere the two are identical, and 1s elsewhere. x86 has an instruction, popcnt, that returns the number of bits set to 1. Consequently, if the popcnt of the xor of the program return address and shadow return address is 0, then two are equivalent.

We leverage the memory managment unit (MMU) to compare the popcnt to zero as a side effect by creating a protection fault: triggering a page fault.

  1. By shifting this value left 48 and or it into the return address, we create a general purpose fault for a non-canonical address from if its value is not zero.

  2. The Last Byte in Page (LBP) scheme create two pages in memory, the first of which is mapped read write, the second of which has no permissions. We then attempt to read from the first page at the address of the last valid byte, plus the popcnt value. If the popcnt value is zero, we read the last byte of the valid page, otherwise we read from the guard page, causing the MPU to return a fault.

Hardware integrity mechanisms

Once a shadow stack design has been chosen, the shadow stack mechanism must guarantee the integrity of the shadow stack. Integrity guarantees are best provided by hardware solutions, which offer greater security and performance than software solutions, and can be as generic.

Existing hardware mechanisms take two different approaches to encoding access privileges to provide integrity protection. (i) MPK which encodes access privileges in each thread’s register file, providing per thread integrity, and (ii) MPX which encodes access in the individual instructions, so that access privileges are the same across all threads and depend only on the execution instruction

Thread centric solutions

Thread centric solutions operate by changing the permissions on the page of the protected memory region. Adding write permissions elevates the thread’s privileges, thereby creating a privileged region that is able to modify the protected memory region, i.e., shadow stack.

Memory Protection Keys (MPK) aims to address this by providing a single, unprivileged instruction that can change page access permissions on a per thread basis.

1
2
3
4
5
6
7
8
9
10
11
12
13
# Read Write (disable all MPKs)
mov $0, %eax
xor %ecx, %ecx
xor %edx, %edx
wrpkru

# protection is off, write to the shadow stack
....
# Read Only (enable write disable bit for shadow stack)
mov $8, %eax
xor %ecx, %ecx
xor %edx, %edx
wrpkru

Note that the wrpkru instructions requires edx and ecx to be set to 0. Consequently, for functions which take more that two arguments, it is necessary to preserve the original values of these registers.

Code centric solutions

The Intel ISA extension Memory Protection Extension (MPX) provides a hardware mechanism that can be used to implement segmentation in a flexible manner. MPX segmentation schemes divide writes into two categories, those that are privileged to write into the protected region, and all others. This approach is surprisingly performant.

Evaluation

We believe that Shadesmar, a compact, register based shadow stack that directly uses the shadow RA, and relies on information hiding to protect the shadow stack, is the best candidate for adoption by mainline compilers based on our initial experiments with SPEC CPU 2006.

Introduction

Most software is written in unsafe languages like C and C++, which is vulnerable to attacks. Almost all these attacks subvert the intended data-flow in the program. Control-data attacks exploit buffer overflows or other vulnerabilities to overwrite a return address, a functional pointer. Non-control-data attacks exploit similar vulnerabilities to overwrite security critical data without subverting the intended control-flow in the program.

This paper present a technique that prevent both attack by enforcing data-flow integrity. This technique computes a data-flow graph for a vulnerable program using static analysis, and instruments the program to ensure that flow of data at runtime is allowed by the data-flow graph.

We implemented data-flow integrity enforcement which using reaching definition analysis to compute a static data-flow graph. For each value read by an instruction, it computes the sets of instructions that may write the value.

To enforce data-flow integrity at runtime, our implementation instruments the program to compute the definition that actually reaches each use at runtime. It maintains a table with the identifier of the last instruction to write to each memory location. The program is instrumented to update this table before every write and prevent attacker from tampering this table. We also instrument reads to check if the identifier of the instruction that wrote the value being read is an element of the set computed by the static analysis.

Data flow integrity enforcement

Overview

The first phase uses static analysis to compute a data-flow graph for the vulnerable program. The second instruments the program to ensure that the data-flow at runtime is allowed by the graph. The last one runs the instruments program and raises an exception if data-flow integrity is violated.

1
2
3
4
5
6
7
8
9
10
11
12
int authenticated = 0;
char packet[1000];

while (!authenticated) {
PacketRead(read);
if (Authenticate(packet)) {
authenticated = 1;
}
}

if (authenticated)
ProcessPacket(packet);

We use reaching definitions analysis to compute the static data-flow graph. An instruction that writes to a memory position defines the value in the memory position, and an instruction that reads the value is said to use the value. The analysis computes the set of reaching definitions for each use and assigns an identifier to each definition. It returns a map from instructions to definition identifiers and a set of reaching definition identifiers for each use, which we call the static data-flow graph.

The analysis can be imprecise but it is important that it be conservative. It must include in the set all definitions that may reach a use at runtime but it may include additional definitions.

The second phase instruments the program to enforce a simple safety property that we call data flow integrity, i.e., whenever a value is read, the definition identifier of the instruction that wrote the value is in the set of reaching definitions for the read.

The program is instrumented to compute the definition that reaches each read at runtime and to check if this definition is in the set of reaching definition identifiers that was computed statically. We maintain the runtime definitions table (RDT) that records the identifier of the last instruction to write to each memory position. Every write is instrumented to update the RDT. The instrumentation before reads uses the address of the value being read to retrieve the identifier from the RDT. Then, it checks if the identifier is in the statically-computed set.

The attacker must be prevented from tampering with the RDT (Any attempt to write to the RDT generates an exception), tampering with the code (Read-only protection for code pages) or bypassing the instrument (Instrumentation of control-data, e.g., CFI).

Static analysis

We compute reaching definitions using a combination of two analyses: a flow-sensitive intra-procedural analysis and a flow-insensitive and context-insensitive inter-procedural analysis.

The intra-procedural analysis takes flow control into account. We use this analysis to compute reaching definitions for uses of local variables that have no definitions external to the function in which they are declared. The inter-procedural analysis is used to compute reaching definitions for all other uses.

The inter-procedural analysis is less precise to allow it to scale to large programs. It ignore control-flow and it does not take the calling context into account when analyzing functions. We implemented points-to analysis to compute the set of objects that each pointer can point to, and we use these point-to sets to compute reaching definitions.

The points-to analysis makes a global pass over all source files to collect subset constraints. Each assignment x = y results in a subset constraints x ⊇ y, which means that the set of possible values of x contains the set of possible values of y. The analysis compiles each source file to IR, and it writes all subset constraints in the IR to a file. After this global pass, it computes the points-to sets by iterating over all the constraints until it reaches a fix point.

During the global pass, we also collect the target locations and identifiers of instruction that write to locations that may be read in other functions.

We compute inter-procedural reaching definitions using the points-to sets and information about write instructions collected during global pass.
For uses of variables, the set of reaching definitions is the union of the set containing the identifiers of all writes to the variables with the sets containing the identifiers of all writes to dereferences of pointers that may point to the variable. For pointer dereferences, the set of reaching definitions is the union of set containing the identifier of all writes to the dereferences pointer with the sets of reaching definitions of all the variable the pointer can point to.

Both the intra-procedural and the inter-procedural analyses assume that correct programs do not use pointer arithmetic to navigate between independent objects in memory. However, it is precisely this assumption that is violated by most attacks.

Since we control code generation, we can ensure that these temporaries are placed in registers beyond the reach of an attack. The attacker cannot voilate data-flow integrity by overwriting these registers because our instrumentation prevents it from subverting thr control flow. We only compute reaching definitions and instrument accesses of temporaries that are spilled to memory.

Instrumentation

We add instrumentation by inserting new instruction into the IR of the program.

  • SETDEF opnd id
  • CHECKDEF opnd setName

The first instruction sets the RDT entry for opnd to id. The second retrieves the runtime definition identifier for opnd from the RDT and checks if the identifier is in the reaching definitions set with name setName. The compiler maintains a map from the setName to set values that is used when lowering CHECKDEF instructions to the assembly of the target machine.

We do not instrument temporaries that we can ensure are allocated to registers, and we also do not instrument the use of local variables of which address are computed by adding a constant to the frame pointer.

To enable efficient accesses, the RDT is implemented as an array with a definition identifier for each 32-bit memory word in the instrumented program. Each definition identifier is two bytes long, which seems sufficient even for large programs (2 ^ 16 identifiers).

Since programs can access memory at byte granularity, it would seem necessary to record a 2-byte definition identifier for every byte of memory. This would result in a space overhead of approximately 200%, which is not practical. We are able to record a single identifier for each 32-bit word because we can generate code in which no two variables with distinct reaching definition sets share the same aligned 32-bit memory word. We only changed the compiler to use a minimum alignment of 32 bits when laying out local variables in a stack frame and globals in the data segment.

In current implementation, we allocate the lowest 1 GB of virtual address space to the program being instrumented and 512 MB to the RDT with a guard page between them, that is, the guard page is at address 4000 0000h and the base address of RDT is at 4000 1000h. This layout also enables efficient bounds checking of the target addresses: we raise an exception if the bitwise and of the target address with c000 0000h is non-zero.

The high-level instrumentation is lowered to x86 assembly as illustrated by the following examples:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# SETDEF authenticated 1

lea ecx, [_authenticated]
test ecx, C0000000h
je L
int 3
L: shr ecx, 2
mov word ptr [ecx*2 + 40001000h], 1

# CHECKDEF authenticated 100

lea ecx, [_authenticated]
shr ecx, 2
mov cx, word ptr [ecx*2 + 40001000h]
cmp cx, 1
je L
cmp cx, 8
je L
int 3
L:

Optimizations

A naive implementation of a data-flow integrity enforcement can perform poorly: each definition introduces a write to the RDT and each use check introduces a read from the RDT followed by comparisons against each identifier in the set of reaching definitions for the use.

Renaming equivalent definitions

The first Optimization partitions definitions into equivalence classes in a way that allows us to safely assign the same identifier to all definitions in the same class. Two definitions are equivalent if they have a exactly the same set of use.

Removing bounds check on writes

We can optimize SETDEFs by removing these checks from all safe writes. In the current implementation, a write is safe if the target address is obtained by adding a small constant offset (possibly) zero to the stack pointer, frame pointer, or to the address of a global or static variable.

Removing SETDEFs and CHECKDEFs

To identify redundant instrumentation, we use symbolic execution of the native code augmented with SETDEF and CHECKDEF operations. Two addresses are equal if they are syntactically equal. They are different if they are computed by adding different offsets to the same symbolic register state. Otherwise, they may refer to aliased memory locations. A write to memory invalidates the symbolic state of a register if the state refers to the content of a memory position that may be aliased with the write’s target. Additionally, it removes mappings for any memory that may be aliased with the write’s target from the symbolic RDT state. We apply the rules to eliminate redundant instrumentation after each SETDEF and CHECKDEF by examining the symbolic RDT state.

Optimizing membership checks

Another optimization renames definitions to reduce the cost of membership checks in the CHECKDEFs. Membership checks can be implemented more efficiently when sets contain ranges of consecutive identifier: a check against {0 .. n} can be implemented by a single unsigned integer camparison against n.

Removing SETDEFs for safe difinitions

The last optimization identifies local variables that have no definitions outside the function and that are written only by safe writes. It replaces all SETDEFs for such a variable by a single SETDEF with identifier 0 that is placed on function entry.

Evaluation

Overhead

We used serveral programs from the SPEC CPU 2000 benchmark suite to measure overhead added by our instrumentation. These benchmarks are CPU-intensive and they spend most time executing instrumented code at user level.

The first set of experiments measured two variants of the overhead of data-flow integrity enforcement: intraproc DFI only instruments uses of control-data and use of local variables without definitions outside their function, and interproc DFI is the variant described earlier in the paper.

The average overhead of execution time is 43% for intraproc DFI and 104% for interproc DFI.