1 ===========================================
2 Kaleidoscope: Implementing a Parser and AST
3 ===========================================
11 Welcome to Chapter 2 of the "`Implementing a language with
12 LLVM <index.html>`_" tutorial. This chapter shows you how to use the
13 lexer, built in `Chapter 1 <LangImpl1.html>`_, to build a full
14 `parser <http://en.wikipedia.org/wiki/Parsing>`_ for our Kaleidoscope
15 language. Once we have a parser, we'll define and build an `Abstract
16 Syntax Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
18 The parser we will build uses a combination of `Recursive Descent
19 Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
21 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
22 parse the Kaleidoscope language (the latter for binary expressions and
23 the former for everything else). Before we get to parsing though, lets
24 talk about the output of the parser: the Abstract Syntax Tree.
26 The Abstract Syntax Tree (AST)
27 ==============================
29 The AST for a program captures its behavior in such a way that it is
30 easy for later stages of the compiler (e.g. code generation) to
31 interpret. We basically want one object for each construct in the
32 language, and the AST should closely model the language. In
33 Kaleidoscope, we have expressions, a prototype, and a function object.
34 We'll start with expressions first:
38 /// ExprAST - Base class for all expression nodes.
44 /// NumberExprAST - Expression class for numeric literals like "1.0".
45 class NumberExprAST : public ExprAST {
48 NumberExprAST(double Val) : Val(Val) {}
51 The code above shows the definition of the base ExprAST class and one
52 subclass which we use for numeric literals. The important thing to note
53 about this code is that the NumberExprAST class captures the numeric
54 value of the literal as an instance variable. This allows later phases
55 of the compiler to know what the stored numeric value is.
57 Right now we only create the AST, so there are no useful accessor
58 methods on them. It would be very easy to add a virtual method to pretty
59 print the code, for example. Here are the other expression AST node
60 definitions that we'll use in the basic form of the Kaleidoscope
65 /// VariableExprAST - Expression class for referencing a variable, like "a".
66 class VariableExprAST : public ExprAST {
69 VariableExprAST(const std::string &Name) : Name(Name) {}
72 /// BinaryExprAST - Expression class for a binary operator.
73 class BinaryExprAST : public ExprAST {
75 std::unique_ptr<ExprAST> LHS, RHS;
77 BinaryExprAST(char op, std::unique_ptr<ExprAST> LHS,
78 std::unique_ptr<ExprAST> RHS)
79 : Op(op), LHS(std::move(LHS)), RHS(std::move(RHS)) {}
82 /// CallExprAST - Expression class for function calls.
83 class CallExprAST : public ExprAST {
85 std::vector<ExprAST*> Args;
87 CallExprAST(const std::string &Callee,
88 std::vector<std::unique_ptr<ExprAST>> Args)
89 : Callee(Callee), Args(std::move(Args)) {}
92 This is all (intentionally) rather straight-forward: variables capture
93 the variable name, binary operators capture their opcode (e.g. '+'), and
94 calls capture a function name as well as a list of any argument
95 expressions. One thing that is nice about our AST is that it captures
96 the language features without talking about the syntax of the language.
97 Note that there is no discussion about precedence of binary operators,
98 lexical structure, etc.
100 For our basic language, these are all of the expression nodes we'll
101 define. Because it doesn't have conditional control flow, it isn't
102 Turing-complete; we'll fix that in a later installment. The two things
103 we need next are a way to talk about the interface to a function, and a
104 way to talk about functions themselves:
108 /// PrototypeAST - This class represents the "prototype" for a function,
109 /// which captures its name, and its argument names (thus implicitly the number
110 /// of arguments the function takes).
113 std::vector<std::string> Args;
115 PrototypeAST(const std::string &name, std::vector<std::string> Args)
116 : Name(name), Args(std::move(Args)) {}
119 /// FunctionAST - This class represents a function definition itself.
121 std::unique_ptr<PrototypeAST> Proto;
122 std::unique_ptr<ExprAST> Body;
124 FunctionAST(std::unique_ptr<PrototypeAST> Proto,
125 std::unique_ptr<ExprAST> Body)
126 : Proto(std::move(Proto)), Body(std::move(Body)) {}
129 In Kaleidoscope, functions are typed with just a count of their
130 arguments. Since all values are double precision floating point, the
131 type of each argument doesn't need to be stored anywhere. In a more
132 aggressive and realistic language, the "ExprAST" class would probably
135 With this scaffolding, we can now talk about parsing expressions and
136 function bodies in Kaleidoscope.
141 Now that we have an AST to build, we need to define the parser code to
142 build it. The idea here is that we want to parse something like "x+y"
143 (which is returned as three tokens by the lexer) into an AST that could
144 be generated with calls like this:
148 auto LHS = llvm::make_unique<VariableExprAST>("x");
149 auto RHS = llvm::make_unique<VariableExprAST>("y");
150 auto Result = std::make_unique<BinaryExprAST>('+', std::move(LHS),
153 In order to do this, we'll start by defining some basic helper routines:
157 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
158 /// token the parser is looking at. getNextToken reads another token from the
159 /// lexer and updates CurTok with its results.
161 static int getNextToken() {
162 return CurTok = gettok();
165 This implements a simple token buffer around the lexer. This allows us
166 to look one token ahead at what the lexer is returning. Every function
167 in our parser will assume that CurTok is the current token that needs to
173 /// Error* - These are little helper functions for error handling.
174 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
175 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
176 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
178 The ``Error`` routines are simple helper routines that our parser will
179 use to handle errors. The error recovery in our parser will not be the
180 best and is not particular user-friendly, but it will be enough for our
181 tutorial. These routines make it easier to handle errors in routines
182 that have various return types: they always return null.
184 With these basic helper functions, we can implement the first piece of
185 our grammar: numeric literals.
187 Basic Expression Parsing
188 ========================
190 We start with numeric literals, because they are the simplest to
191 process. For each production in our grammar, we'll define a function
192 which parses that production. For numeric literals, we have:
196 /// numberexpr ::= number
197 static std::unique_ptr<ExprAST> ParseNumberExpr() {
198 auto Result = llvm::make_unique<NumberExprAST>(NumVal);
199 getNextToken(); // consume the number
200 return std::move(Result);
203 This routine is very simple: it expects to be called when the current
204 token is a ``tok_number`` token. It takes the current number value,
205 creates a ``NumberExprAST`` node, advances the lexer to the next token,
208 There are some interesting aspects to this. The most important one is
209 that this routine eats all of the tokens that correspond to the
210 production and returns the lexer buffer with the next token (which is
211 not part of the grammar production) ready to go. This is a fairly
212 standard way to go for recursive descent parsers. For a better example,
213 the parenthesis operator is defined like this:
217 /// parenexpr ::= '(' expression ')'
218 static std::unique_ptr<ExprAST> ParseParenExpr() {
219 getNextToken(); // eat (.
220 auto V = ParseExpression();
221 if (!V) return nullptr;
224 return Error("expected ')'");
225 getNextToken(); // eat ).
229 This function illustrates a number of interesting things about the
232 1) It shows how we use the Error routines. When called, this function
233 expects that the current token is a '(' token, but after parsing the
234 subexpression, it is possible that there is no ')' waiting. For example,
235 if the user types in "(4 x" instead of "(4)", the parser should emit an
236 error. Because errors can occur, the parser needs a way to indicate that
237 they happened: in our parser, we return null on an error.
239 2) Another interesting aspect of this function is that it uses recursion
240 by calling ``ParseExpression`` (we will soon see that
241 ``ParseExpression`` can call ``ParseParenExpr``). This is powerful
242 because it allows us to handle recursive grammars, and keeps each
243 production very simple. Note that parentheses do not cause construction
244 of AST nodes themselves. While we could do it this way, the most
245 important role of parentheses are to guide the parser and provide
246 grouping. Once the parser constructs the AST, parentheses are not
249 The next simple production is for handling variable references and
256 /// ::= identifier '(' expression* ')'
257 static std::unique_ptr<ExprAST> ParseIdentifierExpr() {
258 std::string IdName = IdentifierStr;
260 getNextToken(); // eat identifier.
262 if (CurTok != '(') // Simple variable ref.
263 return llvm::make_unique<VariableExprAST>(IdName);
266 getNextToken(); // eat (
267 std::vector<std::unique_ptr<ExprAST>> Args;
270 auto Arg = ParseExpression();
271 if (!Arg) return nullptr;
272 Args.push_back(std::move(Arg));
274 if (CurTok == ')') break;
277 return Error("Expected ')' or ',' in argument list");
285 return llvm::make_unique<CallExprAST>(IdName, std::move(Args));
288 This routine follows the same style as the other routines. (It expects
289 to be called if the current token is a ``tok_identifier`` token). It
290 also has recursion and error handling. One interesting aspect of this is
291 that it uses *look-ahead* to determine if the current identifier is a
292 stand alone variable reference or if it is a function call expression.
293 It handles this by checking to see if the token after the identifier is
294 a '(' token, constructing either a ``VariableExprAST`` or
295 ``CallExprAST`` node as appropriate.
297 Now that we have all of our simple expression-parsing logic in place, we
298 can define a helper function to wrap it together into one entry point.
299 We call this class of expressions "primary" expressions, for reasons
300 that will become more clear `later in the
301 tutorial <LangImpl6.html#unary>`_. In order to parse an arbitrary
302 primary expression, we need to determine what sort of expression it is:
307 /// ::= identifierexpr
310 static std::unique_ptr<ExprAST> ParsePrimary() {
312 default: return Error("unknown token when expecting an expression");
313 case tok_identifier: return ParseIdentifierExpr();
314 case tok_number: return ParseNumberExpr();
315 case '(': return ParseParenExpr();
319 Now that you see the definition of this function, it is more obvious why
320 we can assume the state of CurTok in the various functions. This uses
321 look-ahead to determine which sort of expression is being inspected, and
322 then parses it with a function call.
324 Now that basic expressions are handled, we need to handle binary
325 expressions. They are a bit more complex.
327 Binary Expression Parsing
328 =========================
330 Binary expressions are significantly harder to parse because they are
331 often ambiguous. For example, when given the string "x+y\*z", the parser
332 can choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common
333 definitions from mathematics, we expect the later parse, because "\*"
334 (multiplication) has higher *precedence* than "+" (addition).
336 There are many ways to handle this, but an elegant and efficient way is
337 to use `Operator-Precedence
338 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
339 This parsing technique uses the precedence of binary operators to guide
340 recursion. To start with, we need a table of precedences:
344 /// BinopPrecedence - This holds the precedence for each binary operator that is
346 static std::map<char, int> BinopPrecedence;
348 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
349 static int GetTokPrecedence() {
350 if (!isascii(CurTok))
353 // Make sure it's a declared binop.
354 int TokPrec = BinopPrecedence[CurTok];
355 if (TokPrec <= 0) return -1;
360 // Install standard binary operators.
361 // 1 is lowest precedence.
362 BinopPrecedence['<'] = 10;
363 BinopPrecedence['+'] = 20;
364 BinopPrecedence['-'] = 20;
365 BinopPrecedence['*'] = 40; // highest.
369 For the basic form of Kaleidoscope, we will only support 4 binary
370 operators (this can obviously be extended by you, our brave and intrepid
371 reader). The ``GetTokPrecedence`` function returns the precedence for
372 the current token, or -1 if the token is not a binary operator. Having a
373 map makes it easy to add new operators and makes it clear that the
374 algorithm doesn't depend on the specific operators involved, but it
375 would be easy enough to eliminate the map and do the comparisons in the
376 ``GetTokPrecedence`` function. (Or just use a fixed-size array).
378 With the helper above defined, we can now start parsing binary
379 expressions. The basic idea of operator precedence parsing is to break
380 down an expression with potentially ambiguous binary operators into
381 pieces. Consider ,for example, the expression "a+b+(c+d)\*e\*f+g".
382 Operator precedence parsing considers this as a stream of primary
383 expressions separated by binary operators. As such, it will first parse
384 the leading primary expression "a", then it will see the pairs [+, b]
385 [+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are
386 primary expressions, the binary expression parser doesn't need to worry
387 about nested subexpressions like (c+d) at all.
389 To start, an expression is a primary expression potentially followed by
390 a sequence of [binop,primaryexpr] pairs:
395 /// ::= primary binoprhs
397 static std::unique_ptr<ExprAST> ParseExpression() {
398 auto LHS = ParsePrimary();
399 if (!LHS) return nullptr;
401 return ParseBinOpRHS(0, std::move(LHS));
404 ``ParseBinOpRHS`` is the function that parses the sequence of pairs for
405 us. It takes a precedence and a pointer to an expression for the part
406 that has been parsed so far. Note that "x" is a perfectly valid
407 expression: As such, "binoprhs" is allowed to be empty, in which case it
408 returns the expression that is passed into it. In our example above, the
409 code passes the expression for "a" into ``ParseBinOpRHS`` and the
410 current token is "+".
412 The precedence value passed into ``ParseBinOpRHS`` indicates the
413 *minimal operator precedence* that the function is allowed to eat. For
414 example, if the current pair stream is [+, x] and ``ParseBinOpRHS`` is
415 passed in a precedence of 40, it will not consume any tokens (because
416 the precedence of '+' is only 20). With this in mind, ``ParseBinOpRHS``
422 /// ::= ('+' primary)*
423 static std::unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec,
424 std::unique_ptr<ExprAST> LHS) {
425 // If this is a binop, find its precedence.
427 int TokPrec = GetTokPrecedence();
429 // If this is a binop that binds at least as tightly as the current binop,
430 // consume it, otherwise we are done.
431 if (TokPrec < ExprPrec)
434 This code gets the precedence of the current token and checks to see if
435 if is too low. Because we defined invalid tokens to have a precedence of
436 -1, this check implicitly knows that the pair-stream ends when the token
437 stream runs out of binary operators. If this check succeeds, we know
438 that the token is a binary operator and that it will be included in this
443 // Okay, we know this is a binop.
445 getNextToken(); // eat binop
447 // Parse the primary expression after the binary operator.
448 auto RHS = ParsePrimary();
449 if (!RHS) return nullptr;
451 As such, this code eats (and remembers) the binary operator and then
452 parses the primary expression that follows. This builds up the whole
453 pair, the first of which is [+, b] for the running example.
455 Now that we parsed the left-hand side of an expression and one pair of
456 the RHS sequence, we have to decide which way the expression associates.
457 In particular, we could have "(a+b) binop unparsed" or "a + (b binop
458 unparsed)". To determine this, we look ahead at "binop" to determine its
459 precedence and compare it to BinOp's precedence (which is '+' in this
464 // If BinOp binds less tightly with RHS than the operator after RHS, let
465 // the pending operator take RHS as its LHS.
466 int NextPrec = GetTokPrecedence();
467 if (TokPrec < NextPrec) {
469 If the precedence of the binop to the right of "RHS" is lower or equal
470 to the precedence of our current operator, then we know that the
471 parentheses associate as "(a+b) binop ...". In our example, the current
472 operator is "+" and the next operator is "+", we know that they have the
473 same precedence. In this case we'll create the AST node for "a+b", and
474 then continue parsing:
478 ... if body omitted ...
482 LHS = llvm::make_unique<BinaryExprAST>(BinOp, std::move(LHS),
484 } // loop around to the top of the while loop.
487 In our example above, this will turn "a+b+" into "(a+b)" and execute the
488 next iteration of the loop, with "+" as the current token. The code
489 above will eat, remember, and parse "(c+d)" as the primary expression,
490 which makes the current pair equal to [+, (c+d)]. It will then evaluate
491 the 'if' conditional above with "\*" as the binop to the right of the
492 primary. In this case, the precedence of "\*" is higher than the
493 precedence of "+" so the if condition will be entered.
495 The critical question left here is "how can the if condition parse the
496 right hand side in full"? In particular, to build the AST correctly for
497 our example, it needs to get all of "(c+d)\*e\*f" as the RHS expression
498 variable. The code to do this is surprisingly simple (code from the
499 above two blocks duplicated for context):
503 // If BinOp binds less tightly with RHS than the operator after RHS, let
504 // the pending operator take RHS as its LHS.
505 int NextPrec = GetTokPrecedence();
506 if (TokPrec < NextPrec) {
507 RHS = ParseBinOpRHS(TokPrec+1, std::move(RHS));
508 if (RHS == 0) return 0;
511 LHS = llvm::make_unique<BinaryExprAST>(BinOp, std::move(LHS),
513 } // loop around to the top of the while loop.
516 At this point, we know that the binary operator to the RHS of our
517 primary has higher precedence than the binop we are currently parsing.
518 As such, we know that any sequence of pairs whose operators are all
519 higher precedence than "+" should be parsed together and returned as
520 "RHS". To do this, we recursively invoke the ``ParseBinOpRHS`` function
521 specifying "TokPrec+1" as the minimum precedence required for it to
522 continue. In our example above, this will cause it to return the AST
523 node for "(c+d)\*e\*f" as RHS, which is then set as the RHS of the '+'
526 Finally, on the next iteration of the while loop, the "+g" piece is
527 parsed and added to the AST. With this little bit of code (14
528 non-trivial lines), we correctly handle fully general binary expression
529 parsing in a very elegant way. This was a whirlwind tour of this code,
530 and it is somewhat subtle. I recommend running through it with a few
531 tough examples to see how it works.
533 This wraps up handling of expressions. At this point, we can point the
534 parser at an arbitrary token stream and build an expression from it,
535 stopping at the first token that is not part of the expression. Next up
536 we need to handle function definitions, etc.
541 The next thing missing is handling of function prototypes. In
542 Kaleidoscope, these are used both for 'extern' function declarations as
543 well as function body definitions. The code to do this is
544 straight-forward and not very interesting (once you've survived
550 /// ::= id '(' id* ')'
551 static std::unique_ptr<PrototypeAST> ParsePrototype() {
552 if (CurTok != tok_identifier)
553 return ErrorP("Expected function name in prototype");
555 std::string FnName = IdentifierStr;
559 return ErrorP("Expected '(' in prototype");
561 // Read the list of argument names.
562 std::vector<std::string> ArgNames;
563 while (getNextToken() == tok_identifier)
564 ArgNames.push_back(IdentifierStr);
566 return ErrorP("Expected ')' in prototype");
569 getNextToken(); // eat ')'.
571 return llvm::make_unique<PrototypeAST>(FnName, std::move(ArgNames));
574 Given this, a function definition is very simple, just a prototype plus
575 an expression to implement the body:
579 /// definition ::= 'def' prototype expression
580 static std::unique_ptr<FunctionAST> ParseDefinition() {
581 getNextToken(); // eat def.
582 auto Proto = ParsePrototype();
583 if (!Proto) return nullptr;
585 if (auto E = ParseExpression())
586 return llvm::make_unique<FunctionAST>(std::move(Proto), std::move(E));
590 In addition, we support 'extern' to declare functions like 'sin' and
591 'cos' as well as to support forward declaration of user functions. These
592 'extern's are just prototypes with no body:
596 /// external ::= 'extern' prototype
597 static std::unique_ptr<PrototypeAST> ParseExtern() {
598 getNextToken(); // eat extern.
599 return ParsePrototype();
602 Finally, we'll also let the user type in arbitrary top-level expressions
603 and evaluate them on the fly. We will handle this by defining anonymous
604 nullary (zero argument) functions for them:
608 /// toplevelexpr ::= expression
609 static std::unique_ptr<FunctionAST> ParseTopLevelExpr() {
610 if (auto E = ParseExpression()) {
611 // Make an anonymous proto.
612 auto Proto = llvm::make_unique<PrototypeAST>("", std::vector<std::string>());
613 return llvm::make_unique<FunctionAST>(std::move(Proto), std::move(E));
618 Now that we have all the pieces, let's build a little driver that will
619 let us actually *execute* this code we've built!
624 The driver for this simply invokes all of the parsing pieces with a
625 top-level dispatch loop. There isn't much interesting here, so I'll just
626 include the top-level loop. See `below <#code>`_ for full code in the
627 "Top-Level Parsing" section.
631 /// top ::= definition | external | expression | ';'
632 static void MainLoop() {
634 fprintf(stderr, "ready> ");
636 case tok_eof: return;
637 case ';': getNextToken(); break; // ignore top-level semicolons.
638 case tok_def: HandleDefinition(); break;
639 case tok_extern: HandleExtern(); break;
640 default: HandleTopLevelExpression(); break;
645 The most interesting part of this is that we ignore top-level
646 semicolons. Why is this, you ask? The basic reason is that if you type
647 "4 + 5" at the command line, the parser doesn't know whether that is the
648 end of what you will type or not. For example, on the next line you
649 could type "def foo..." in which case 4+5 is the end of a top-level
650 expression. Alternatively you could type "\* 6", which would continue
651 the expression. Having top-level semicolons allows you to type "4+5;",
652 and the parser will know you are done.
657 With just under 400 lines of commented code (240 lines of non-comment,
658 non-blank code), we fully defined our minimal language, including a
659 lexer, parser, and AST builder. With this done, the executable will
660 validate Kaleidoscope code and tell us if it is grammatically invalid.
661 For example, here is a sample interaction:
666 ready> def foo(x y) x+foo(y, 4.0);
667 Parsed a function definition.
668 ready> def foo(x y) x+y y;
669 Parsed a function definition.
670 Parsed a top-level expr
671 ready> def foo(x y) x+y );
672 Parsed a function definition.
673 Error: unknown token when expecting an expression
674 ready> extern sin(a);
675 ready> Parsed an extern
679 There is a lot of room for extension here. You can define new AST nodes,
680 extend the language in many ways, etc. In the `next
681 installment <LangImpl3.html>`_, we will describe how to generate LLVM
682 Intermediate Representation (IR) from the AST.
687 Here is the complete code listing for this and the previous chapter.
688 Note that it is fully self-contained: you don't need LLVM or any
689 external libraries at all for this. (Besides the C and C++ standard
690 libraries, of course.) To build this, just compile with:
695 clang++ -g -O3 toy.cpp
701 .. literalinclude:: ../../examples/Kaleidoscope/Chapter2/toy.cpp
704 `Next: Implementing Code Generation to LLVM IR <LangImpl3.html>`_