Kaleidoscope: Implementing a Parser and AST

Posted on 2019-05-09 Edited on 2021-05-22 In note

Chapter 2 Introduction

这一章将会使用 Chapter 1 中实现的 lexer 来构造 Kaleidoscope 的 parser. 一旦我们完成 parser 的实现, 我们就能定义并且构造一个抽象语法树 (Abstract Syntax Tree, AST).

The Abstract Syntax Tree (AST)

AST 是源代码语法结构的抽象表示, 它以树状的形式来表示源代码中的语法结构, 树上的每个节点都表示源代码中的一种结构, 从而能建模这个语言, 方便编译器后续的解释 (code generation). 在 Kaleidoscope 中, 我们有表达式 (expressions), 函数原型 (prototype), 函数对象 (function object). 我们先从表达式开始:

// ExprAST - Base class for all expression nodes
class ExprAST
{
public:
    virtual ~ExprAST() {};
};

// NumberExprAST - Expression class for numeric literals like "1.0"
class NumberExprAST : public ExprAST
{
    double Val;

public:
    NumberExprAST(double Val) : Val(Val) {}
};

上述的代码表示了基类 ExprAST 和它的一个子类 NumberExprAST 的定义. NumberExprASR 将数字常量的值存储到实例中的成员变量, 使得编译器能够在后面的步骤中对它进行处理.

然后是其他 AST 节点的定义：

// VariableExprAST - Expression class for referencing a variable, like "a"
class VariableExprAST : public ExprAST
{
    std::string Name;

public:
    VariableExprAST(const std::string &Name) : Name(Name) {}
};

// BinaryExprAST - Expression class for a binary operator
class BinaryExprAST : public ExprAST
{
    char Op;
    std::unique_ptr<ExprAST> LHS, RHS;

public:
    BinaryExprAST(char Op, std::unique_ptr<ExprAST> LHS, std::unique_ptr<ExprAST> RHS) : Op (Op), LHS (std::move(LHS)), RHS(std::move(RHS)) {}
};

// CallExprAST - Expression class for function calls
class CallExprAST : public ExprAST {
    std::string Callee;
    std::vector<std::unique_ptr<ExprAST>> Args;

public:
    CallExprAST(const std::string &Callee, std::vector<std::unique_ptr<ExprAST>> args) : Callee(Callee), Args(std::move(Args)) {}
};

以上的定义都比较直接, 使用一个 std::string Name 变量来获取变量的名字; 二元操作符使用 char Op 来获取操作符, 然后使用 std::unique_ptr<ExprAST> LHS, RHS 分别持有左边和右边的表达式; 函数调用使用 std::string Callee 来存储被调用的函数名, 使用 std::vector<std::unique_ptr<ExprAST>> args 来持有函数的参数列表.

对于我们最基本的语言, 目前没有条件跳转的控制流, 所以还不是图灵完全的. 我们会在之后添加条件跳转. 接下来我们讨论函数接口和函数定义:

// PrototypeAST - This class represents the "prototype"
// for a function, which captures its name, and its
// argument names (thus implicitly the number of
// arguments the function takes)
class PrototypeAST {
    std::string Name;
    std::vector<std::string> Args;

public:
    PrototypeAST(const std::string &Name, std::vector<std::string> Args) : Name(Name), Args(std::move(Args)) {}

    const std::string &getName() const {return Name};
};

// FunctionAST - This class represents a function definition itself
class FunctionAST {
    std::unique_ptr<PrototypeAST> Proto;
    std::unique_ptr<ExprAST> Body;

public:
    FunctionAST(std::unique_ptr<PrototypeAST> Proto, std::unique_ptr<ExprAST> Body) : Proto (std::move(Proto)), Body (std::move(Body)) {}
};

注意, 在 Kaleidoscope 中,函数原型和函数定义并不是表达式, 所以没有继承 class ExprAST. 由于在 Kaleidoscope 中只有 double 类型, 所以函数的类型也仅仅和参数数量有关. 在更为复杂的语言中, class ExprAST 会有一个 field 专门存放类型信息.

Parser Basics

我们需要使用一个 parser 来构造 AST. 例如, 我们想解析 x + y 这类的表达式, 并生成一个 AST 时，我们可以进行如下的操作:

1
2
3

auto LHS = std::make_unique<VariableExprAST>("x");
auto RHS = std::make_unique<VariableExprAST>("y");
auto Result = std::make_unique<BinaryExprAST>('+', std::move(LHS), std::move(RHS));

为了实现上述操作，我们先定义一些基本的辅助过程:

// CurTok - CurTok is the current token the parser is looking at.
static int CurTok;
// getNextToken - getNextToken reads another token from the lexer and update CurTok with its values.
int getNextToken() {
    return CurTok = gettok();
}

这在 lexer 的基础上实现了一个简单的 token buffer. 使得我们能够保存 lexer 返回的 token. Parser 中的函数都假定 static int CurTok 是当前需要被处理的 token.

然后是一些简单的错误处理函数:

// LogError* - These are little helper functions for error handling
std::unique_ptr<ExprAST> LogError(const char *Str) {
    fprintf(stderr, "LogError: %s\n", Str);
    return nullptr;
}
std::unique_ptr<PrototypeAST> LogErrorP(const char *Str) {
    LogError(Str);
    return nullptr;
}

Basic Expression Parsing

我们首先处理数字常量. 对每种不同的表达式，我们都定义单独的函数进行处理. 下面是对数字常量进行处理的函数:

// numberexpr ::= number
std::unique_ptr<ExprAST> ParseNumberExpr() {
    auto Result = std::make_unique<NumberExprAST> (NumVal);
    // consume the number
    getNextToken();
    return std::move(Result);
}

当 gettok() 返回 tok_number 时它会被调用. 此时 static int NumVal 已被设置为对应的值, 因此 ParseNumberExpr() 构造一个 NumberExprAST 的节点, 并使 lexer 获取下一个 token, 然后返回构造出的节点.

所以, 递归下降解析器的特点就是: 将当前表达式的 token 全部获取, 构造出表达式节点. 然后使 lexer 获取下一个 token, 再返回构造出的表达式节点.

下面是一个带括号的表达式的例子:

// parenexpr ::= '(' expression ')'
std::unique_ptr<ExprAST> ParseParenExpr() {
    // eat '('
    getNextToken();
    auto V = ParseExpression();
    if (!V) {
        return nullptr;
    }
    if (CurTok != ')') {
        return LogError("Expected ')'");
    }
    // eat ')'
    getNextToken();
    return V;
}

这个函数说明了以下几个方法:

告诉我们怎么使用 LogError 函数来打印错误.
在处理过程中, 递归调用 ParseExpression 来进行表达式的处理. ParseExpression 调用不同的 parser 函数来对不同的表达式进行处理. 这种递归调用使得我们可以实现递归语法.

接下来是变量引用和函数调用的 parser 函数:

// identifierexpr
//       :: = identifier
//       :: = identifier '(' expression* ')'
std::unique_ptr<ExprAST> ParseIdentifierExpr() {
    std::string IdName = IdentifierStr;
    // eat identifier
    getNextToken();

    if (CurTok != '(') {
        // It's a variable ref
        return std::make_unique<VariableExprAST>(IdName);
    }

    // It's a functional call

    // eat '('
    getNextToken();
    std::vector<std::unique_ptr<ExprAST>> Args;

    if (CurTok != ')') {
        while (1)
        {
            if (auto Arg = ParseExpression()) {
                Args.push_back(std::move(Arg));
            } else {
                return nullptr;
            }

            if (CurTok == ')') {
                break;
            }
            if (CurTok != ',') {
                return LogError("Expected ')' or ',' in argument list");
            }

            getNextToken();
        }
    }
    // eat '')
    getNextToken();

    return std::make_unique<CallExprAST>(IdName, std::move(Args));
}

这个函数和上面的函数类似, 在 gettok() 返回 tok-identifier 时被调用. 它也有递归处理和错误处理. 它使用 lookahead 来判断这是一个简单的变量引用或者是一个函数调用. 在消耗掉第一个 identifier 之后, 它会看看后面是不是一个 (, 如果是的话, 那就构造一个参数列表, 并返回一个 class CallExprAst 节点; 如果不是, 那就直接返回一个 class VariableExprAST 节点.

现在我们有了一个简单的 expression-parser 逻辑. 我们可以定义一个辅助的 primary parser 函数来 wrap 这些 parser 函数. 这个 primary 函数能够根据 static int CurTok 来调用对应的 parser.

// primary
//      :: = identifierexpr
//      :: = numberexpr
//      :: = parenexpr
std::unique_ptr<ExprAST> ParsePrimary() {
    switch (CurTok)
    {
    case tok_identifier:
        return ParseIdentifierExpr();
    case tok_number:
        return ParseNumberExpr();
    case ')':
        return ParseParenExpr();
    default:
        return LogError("unknown token when expecting an expression")
    }
}

Binary Expression Parsing

由于二义性，二元表达式更加难以处理. 比如说当给定一个字符串 x + y * z, parser 可以选择解释为 (x + y) * z 也可以解释为 x + (y * z). 不过一般来讲, 符合常规的是第二种做法.

所以我们需要使用运算符优先级来处理这个问题, Operator-Precedence Parsing 使用运算符的优先级来告诉我们递归的顺序, 即首先处理哪个表达式. 首先我们定义一个运算符优先级表:

// GetTokPrecedence - Get the precedence of the pending binary operator token.
int GetTokPrecedence() {
    if (!isascii(CurTok)) {
        return -1;
    }

    int TokPrec = BinOpPrecedence[CurTok];
    // Make sure it's a declared binOp
    if (TokPrec <= 0) {
        return -1;
    }
    return TokPrec;
}

// InitTokPrecedence - Install standard binary operators with their precedence
void initTokPrecedence() {
    // Install standard binary operators
    BinOpPrecedence['<'] = 10;
    BinOpPrecedence['+'] = 20;
    BinOpPrecedence['-'] = 20;
    BinOpPrecedence['*'] = 40;
}

在最基本的 Kaleidoscope 中, 我们只支持 4 个二元操作符. 不过这个可以按需添加. GetTokPrecedence() 会返回当前 static int CurTok 的优先级.

当定义完上述的辅助函数之后, 我们就可以开始解析二元表达式. 操作符优先级解析器的思路就是, 将具有二义性的表达式拆分成不同的部分. 例如, 针对表达式 a + b + (c + d) * e * f + g, 我们可以将它看作被二元操作符分割的 primary 表达式流 (a stream of primary expressions separated by binary operators). 就这样, 首先处理第一个表达式 a, 然后就是 [+, b], [+, (c + d)], [*, e], [*, f] and [+, g].

所以, 一个表达式就是一个 primary 表达式，加上后面可能存在的 [binOp, primary expr] 序列.

// expression
//      :: = primary binOpRHS
std::unique_ptr<ExprAST> ParseExpression() {
    auto LHS = ParsePrimary();
    if (!LHS)
    {
        return nullptr;
    }
    return ParseBinOpRHS(0, std::move(LHS));
}

ParseBinOpRHS() 能够解析 [binOp, primary expr] 序列流. 它的参数是优先级 (上述例子为 0), 和目前已经被解析的部分 (上述例子为 LHS). 以表达式 x 为例, 它被划分为 x 和一个空的 binOpRHS. 以 a + b + (c + d) * e * f + g 为例, a 是一个合法的 primary 表达式，即 LHS, 同时 binOpRHS 为 + b + (c + d) * e * f + g, 当前的 static int CurTok 是 +.

传给 ParseBinOpRHS() 的第一个参数是优先级值, 它表明了这个函数能处理的最小操作符优先级. 如果当前的 pair stream 是 [+, x] 并且第一个传入的参数为 40, 那它不会再消耗 tokens (因为 + 的优先级为 20 小于 40), 所以:

// binOpRHS
//      :: = ( binOp primary)*
std::unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec, std::unique_ptr<ExprAST> LHS)
{
    // if this is a binOp, find its precedence
    while (1)
    {
        int TokPrec = GetTokPrecedence();

        // if this is a binOp that binds at least 
        // as tightly as the current binOP, 
        // consume it, otherwise we are done.
        if (TokPrec < ExprPrec) {
            return LHS;
        }

        // we know it's a binOP
        int BinOp = CurTok;
        // eat binOp
        getNextToken();

        // parse the primary expression after the binOP.
        auto RHS = ParsePrimary();
        if (!RHS) {
            return nullptr;
        }

        // If binOP binds less tightly with RHS than the
        // operator after RHS, let the pending operator take
        // RHS and its LHS
        int NextPrec = GetTokPrecedence();
        if (TokPrec < NextPrec) {
            // a + b * c -> a + (b * c)
            RHS = ParseBinOpRHS(TokPrec + 1, std::move(RHS));
            if (!RHS) {
                return nullptr;
            }
        }

        // a + b + c -> (a + b) + c
        LHS = std::make_unique<BinaryExprAST>(BinOp, std::move(LHS), std::move(RHS));

        // loop until the ends of expr.
    }
}

ParseBinOpRHS() 首先比较当前 token 的优先级和传入参数的优先级. 如果当前 binOp 优先级, TokPrec 低于传入参数的优先级, 那么就说明 binOp 左边的 primary 表达式要优先计算, 直接返回 LHS 即可. 若 binOP 的优先级不低于传入参数, 那么就使用 int BinOp 持有当前的这个 binOp, 并调用 getNextToken() 处理好当前 binOp 右边的这个 primary 表达式, 并使用 RHS 持有引用，所以现在 static int CurTok 表示的是下一个 binOp, 它的优先级是 NextPrec. 然后, 需要判断上面的这个 RHS 是和 LHS 结合的紧还是和下一个 primary 表达式结合的紧. 如果 NextPrec 优先级大于 TokPrec, 那么 RHS 和它右边的 primary 表达式结合的紧, 所以递归调用 ParseBinOpRHS(), 第一个参数为 TokPrec + 1, 第二个为 RHS, 可以将高优先级的序列先处理完毕, 并作为 RHS 返回. 若 NextPrec 不大于 TokPrec 那么, LHS 和 RHS 结合得更紧, 那么先算出 LHS 和 RHS, 之后再将它们的结果作为 LHS 循环算出后面的值.

在遇到优先级高的 binOP, 递归调用 ParseBinOpRHS() 时, 传的参数一定是 TokPrec + 1 才能在第一个条件判断中退出.

Parsing the Rest

然后我们要处理函数原型, 和函数定义. 函数原型如下:

// prototype
//      :: = id '(' id* ')'
std::unique_ptr<PrototypeAST> ParsePrototype() {
    if (CurTok != tok_identifier)
    {
        return LogErrorP("Expected function name in prototype");
    }

    std::string FnName = IdentifierStr;
    // eat the function name
    getNextToken();

    if (CurTok != '(') {
        return LogErrorP("Expected '(' in prototype");
    }

    // feed the arg list into vector
    std::vector<std::string> ArgNames;
    while (getNextToken() == tok_identifier)
    {
        ArgNames.push_back(IdentifierStr);
    }

    if (CurTok != ')')
        return LogErrorP("Expected ')' in prototype");

    // eat ')'
    getNextToken();

    return std::make_unique<PrototypeAST>(FnName, std::move(ArgNames));
}

函数定义就是函数原型加上函数体:

// function definition
//          :: = 'def' prototype expression
std::unique_ptr<FunctionAST> ParseDefinition() {
    // eat def
    getNextToken();

    auto Proto = ParsePrototype();
    if (!Proto) {
        return nullptr;
    }

    if (auto E = ParseExpression()) {
        return std::make_unique<FunctionAST>(std::move(Proto), std::move(E));
    }
    return nullptr;
}

并且我们支持使用 extern 关键字进行前向声明:

// external forward declaration
//          :: = 'extern' prototype
std::unique_ptr<PrototypeAST> ParseExtern() {
    // eat 'extern'
    getNextToken();
    return ParsePrototype();
}

最后我们允许用户使用任意的顶级 (非嵌套) 表达式 (top-level expressions), 我们使用匿名零元函数 (anonymous nullary function) 来处理:

// toplevelexpr
//      :: = expression
std::unique_ptr<FunctionAST> ParseTopLevelExpr() {
    if (auto E = ParseExpression()) {
        // anonymous nullary function
        auto Proto = std::make_unique<PrototypeAST>("", std::vector<std::string>());

        return std::make_unique<FunctionAST>(std::move(Proto), std::move(E));
    }
    return nullptr;
}

The Driver

Driver 就是使用循环来调用不同的 parser 函数 (dispatch loop):

void HandleDefinition() {
    if (ParseDefinition()) {
        fprintf(stderr, "Parsed a function definition.\n");
    } else {
        // Skip token for error recovery
        getNextToken();
    }
}

void HandleExtern() {
    if (ParseExtern()) {
        fprintf(stderr, "Parsed an extern.\n");
    } else {
        // Skip token for error recovery
        getNextToken();
    }
}

void HandleTopLevelExpression() {
    if (ParseTopLevelExpr()) {
        fprintf(stderr, "Parsed a top-level expression.\n");
    } else {
        // Skip token for error recovery
        getNextToken();
    }
}

// top ::= definition | external | expression | ';'
void MainLoop() {
    while (1) {
        fprintf(stderr, "ready> ");
        switch (CurTok) {
        case ';' : // ignore top-level semicolons 
            getNextToken();
            break;
        case tok_def :
            HandleDefinition();
            break;
        case tok_extern :
            HandleExtern();
            break;
        case tok_eof :
            return;
        default:
            HandleTopLevelExpression();
            break;
        }
    }
}