docs/tutorial/LangImpl2.rst

   1 ===========================================
   2 Kaleidoscope: Implementing a Parser and AST
   3 ===========================================
   4
   5 .. contents::
   6    :local:
   7
   8 Written by `Chris Lattner <mailto:sabre@nondot.org>`_
   9
  10 Chapter 2 Introduction
  11 ======================
  12
  13 Welcome to Chapter 2 of the "`Implementing a language with
  14 LLVM <index.html>`_" tutorial. This chapter shows you how to use the
  15 lexer, built in `Chapter 1 <LangImpl1.html>`_, to build a full
  16 `parser <http://en.wikipedia.org/wiki/Parsing>`_ for our Kaleidoscope
  17 language. Once we have a parser, we'll define and build an `Abstract
  18 Syntax Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
  19
  20 The parser we will build uses a combination of `Recursive Descent
  21 Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
  22 `Operator-Precedence
  23 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
  24 parse the Kaleidoscope language (the latter for binary expressions and
  25 the former for everything else). Before we get to parsing though, lets
  26 talk about the output of the parser: the Abstract Syntax Tree.
  27
  28 The Abstract Syntax Tree (AST)
  29 ==============================
  30
  31 The AST for a program captures its behavior in such a way that it is
  32 easy for later stages of the compiler (e.g. code generation) to
  33 interpret. We basically want one object for each construct in the
  34 language, and the AST should closely model the language. In
  35 Kaleidoscope, we have expressions, a prototype, and a function object.
  36 We'll start with expressions first:
  37
  38 .. code-block:: c++
  39
  40     /// ExprAST - Base class for all expression nodes.
  41     class ExprAST {
  42     public:
  43       virtual ~ExprAST() {}
  44     };
  45
  46     /// NumberExprAST - Expression class for numeric literals like "1.0".
  47     class NumberExprAST : public ExprAST {
  48       double Val;
  49     public:
  50       NumberExprAST(double val) : Val(val) {}
  51     };
  52
  53 The code above shows the definition of the base ExprAST class and one
  54 subclass which we use for numeric literals. The important thing to note
  55 about this code is that the NumberExprAST class captures the numeric
  56 value of the literal as an instance variable. This allows later phases
  57 of the compiler to know what the stored numeric value is.
  58
  59 Right now we only create the AST, so there are no useful accessor
  60 methods on them. It would be very easy to add a virtual method to pretty
  61 print the code, for example. Here are the other expression AST node
  62 definitions that we'll use in the basic form of the Kaleidoscope
  63 language:
  64
  65 .. code-block:: c++
  66
  67     /// VariableExprAST - Expression class for referencing a variable, like "a".
  68     class VariableExprAST : public ExprAST {
  69       std::string Name;
  70     public:
  71       VariableExprAST(const std::string &name) : Name(name) {}
  72     };
  73
  74     /// BinaryExprAST - Expression class for a binary operator.
  75     class BinaryExprAST : public ExprAST {
  76       char Op;
  77       ExprAST *LHS, *RHS;
  78     public:
  79       BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
  80         : Op(op), LHS(lhs), RHS(rhs) {}
  81     };
  82
  83     /// CallExprAST - Expression class for function calls.
  84     class CallExprAST : public ExprAST {
  85       std::string Callee;
  86       std::vector<ExprAST*> Args;
  87     public:
  88       CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
  89         : Callee(callee), Args(args) {}
  90     };
  91
  92 This is all (intentionally) rather straight-forward: variables capture
  93 the variable name, binary operators capture their opcode (e.g. '+'), and
  94 calls capture a function name as well as a list of any argument
  95 expressions. One thing that is nice about our AST is that it captures
  96 the language features without talking about the syntax of the language.
  97 Note that there is no discussion about precedence of binary operators,
  98 lexical structure, etc.
  99
 100 For our basic language, these are all of the expression nodes we'll
 101 define. Because it doesn't have conditional control flow, it isn't
 102 Turing-complete; we'll fix that in a later installment. The two things
 103 we need next are a way to talk about the interface to a function, and a
 104 way to talk about functions themselves:
 105
 106 .. code-block:: c++
 107
 108     /// PrototypeAST - This class represents the "prototype" for a function,
 109     /// which captures its name, and its argument names (thus implicitly the number
 110     /// of arguments the function takes).
 111     class PrototypeAST {
 112       std::string Name;
 113       std::vector<std::string> Args;
 114     public:
 115       PrototypeAST(const std::string &name, const std::vector<std::string> &args)
 116         : Name(name), Args(args) {}
 117     };
 118
 119     /// FunctionAST - This class represents a function definition itself.
 120     class FunctionAST {
 121       PrototypeAST *Proto;
 122       ExprAST *Body;
 123     public:
 124       FunctionAST(PrototypeAST *proto, ExprAST *body)
 125         : Proto(proto), Body(body) {}
 126     };
 127
 128 In Kaleidoscope, functions are typed with just a count of their
 129 arguments. Since all values are double precision floating point, the
 130 type of each argument doesn't need to be stored anywhere. In a more
 131 aggressive and realistic language, the "ExprAST" class would probably
 132 have a type field.
 133
 134 With this scaffolding, we can now talk about parsing expressions and
 135 function bodies in Kaleidoscope.
 136
 137 Parser Basics
 138 =============
 139
 140 Now that we have an AST to build, we need to define the parser code to
 141 build it. The idea here is that we want to parse something like "x+y"
 142 (which is returned as three tokens by the lexer) into an AST that could
 143 be generated with calls like this:
 144
 145 .. code-block:: c++
 146
 147       ExprAST *X = new VariableExprAST("x");
 148       ExprAST *Y = new VariableExprAST("y");
 149       ExprAST *Result = new BinaryExprAST('+', X, Y);
 150
 151 In order to do this, we'll start by defining some basic helper routines:
 152
 153 .. code-block:: c++
 154
 155     /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 156     /// token the parser is looking at.  getNextToken reads another token from the
 157     /// lexer and updates CurTok with its results.
 158     static int CurTok;
 159     static int getNextToken() {
 160       return CurTok = gettok();
 161     }
 162
 163 This implements a simple token buffer around the lexer. This allows us
 164 to look one token ahead at what the lexer is returning. Every function
 165 in our parser will assume that CurTok is the current token that needs to
 166 be parsed.
 167
 168 .. code-block:: c++
 169
 170
 171     /// Error* - These are little helper functions for error handling.
 172     ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 173     PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 174     FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 175
 176 The ``Error`` routines are simple helper routines that our parser will
 177 use to handle errors. The error recovery in our parser will not be the
 178 best and is not particular user-friendly, but it will be enough for our
 179 tutorial. These routines make it easier to handle errors in routines
 180 that have various return types: they always return null.
 181
 182 With these basic helper functions, we can implement the first piece of
 183 our grammar: numeric literals.
 184
 185 Basic Expression Parsing
 186 ========================
 187
 188 We start with numeric literals, because they are the simplest to
 189 process. For each production in our grammar, we'll define a function
 190 which parses that production. For numeric literals, we have:
 191
 192 .. code-block:: c++
 193
 194     /// numberexpr ::= number
 195     static ExprAST *ParseNumberExpr() {
 196       ExprAST *Result = new NumberExprAST(NumVal);
 197       getNextToken(); // consume the number
 198       return Result;
 199     }
 200
 201 This routine is very simple: it expects to be called when the current
 202 token is a ``tok_number`` token. It takes the current number value,
 203 creates a ``NumberExprAST`` node, advances the lexer to the next token,
 204 and finally returns.
 205
 206 There are some interesting aspects to this. The most important one is
 207 that this routine eats all of the tokens that correspond to the
 208 production and returns the lexer buffer with the next token (which is
 209 not part of the grammar production) ready to go. This is a fairly
 210 standard way to go for recursive descent parsers. For a better example,
 211 the parenthesis operator is defined like this:
 212
 213 .. code-block:: c++
 214
 215     /// parenexpr ::= '(' expression ')'
 216     static ExprAST *ParseParenExpr() {
 217       getNextToken();  // eat (.
 218       ExprAST *V = ParseExpression();
 219       if (!V) return 0;
 220
 221       if (CurTok != ')')
 222         return Error("expected ')'");
 223       getNextToken();  // eat ).
 224       return V;
 225     }
 226
 227 This function illustrates a number of interesting things about the
 228 parser:
 229
 230 1) It shows how we use the Error routines. When called, this function
 231 expects that the current token is a '(' token, but after parsing the
 232 subexpression, it is possible that there is no ')' waiting. For example,
 233 if the user types in "(4 x" instead of "(4)", the parser should emit an
 234 error. Because errors can occur, the parser needs a way to indicate that
 235 they happened: in our parser, we return null on an error.
 236
 237 2) Another interesting aspect of this function is that it uses recursion
 238 by calling ``ParseExpression`` (we will soon see that
 239 ``ParseExpression`` can call ``ParseParenExpr``). This is powerful
 240 because it allows us to handle recursive grammars, and keeps each
 241 production very simple. Note that parentheses do not cause construction
 242 of AST nodes themselves. While we could do it this way, the most
 243 important role of parentheses are to guide the parser and provide
 244 grouping. Once the parser constructs the AST, parentheses are not
 245 needed.
 246
 247 The next simple production is for handling variable references and
 248 function calls:
 249
 250 .. code-block:: c++
 251
 252     /// identifierexpr
 253     ///   ::= identifier
 254     ///   ::= identifier '(' expression* ')'
 255     static ExprAST *ParseIdentifierExpr() {
 256       std::string IdName = IdentifierStr;
 257
 258       getNextToken();  // eat identifier.
 259
 260       if (CurTok != '(') // Simple variable ref.
 261         return new VariableExprAST(IdName);
 262
 263       // Call.
 264       getNextToken();  // eat (
 265       std::vector<ExprAST*> Args;
 266       if (CurTok != ')') {
 267         while (1) {
 268           ExprAST *Arg = ParseExpression();
 269           if (!Arg) return 0;
 270           Args.push_back(Arg);
 271
 272           if (CurTok == ')') break;
 273
 274           if (CurTok != ',')
 275             return Error("Expected ')' or ',' in argument list");
 276           getNextToken();
 277         }
 278       }
 279
 280       // Eat the ')'.
 281       getNextToken();
 282
 283       return new CallExprAST(IdName, Args);
 284     }
 285
 286 This routine follows the same style as the other routines. (It expects
 287 to be called if the current token is a ``tok_identifier`` token). It
 288 also has recursion and error handling. One interesting aspect of this is
 289 that it uses *look-ahead* to determine if the current identifier is a
 290 stand alone variable reference or if it is a function call expression.
 291 It handles this by checking to see if the token after the identifier is
 292 a '(' token, constructing either a ``VariableExprAST`` or
 293 ``CallExprAST`` node as appropriate.
 294
 295 Now that we have all of our simple expression-parsing logic in place, we
 296 can define a helper function to wrap it together into one entry point.
 297 We call this class of expressions "primary" expressions, for reasons
 298 that will become more clear `later in the
 299 tutorial <LangImpl6.html#unary>`_. In order to parse an arbitrary
 300 primary expression, we need to determine what sort of expression it is:
 301
 302 .. code-block:: c++
 303
 304     /// primary
 305     ///   ::= identifierexpr
 306     ///   ::= numberexpr
 307     ///   ::= parenexpr
 308     static ExprAST *ParsePrimary() {
 309       switch (CurTok) {
 310       default: return Error("unknown token when expecting an expression");
 311       case tok_identifier: return ParseIdentifierExpr();
 312       case tok_number:     return ParseNumberExpr();
 313       case '(':            return ParseParenExpr();
 314       }
 315     }
 316
 317 Now that you see the definition of this function, it is more obvious why
 318 we can assume the state of CurTok in the various functions. This uses
 319 look-ahead to determine which sort of expression is being inspected, and
 320 then parses it with a function call.
 321
 322 Now that basic expressions are handled, we need to handle binary
 323 expressions. They are a bit more complex.
 324
 325 Binary Expression Parsing
 326 =========================
 327
 328 Binary expressions are significantly harder to parse because they are
 329 often ambiguous. For example, when given the string "x+y\*z", the parser
 330 can choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common
 331 definitions from mathematics, we expect the later parse, because "\*"
 332 (multiplication) has higher *precedence* than "+" (addition).
 333
 334 There are many ways to handle this, but an elegant and efficient way is
 335 to use `Operator-Precedence
 336 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
 337 This parsing technique uses the precedence of binary operators to guide
 338 recursion. To start with, we need a table of precedences:
 339
 340 .. code-block:: c++
 341
 342     /// BinopPrecedence - This holds the precedence for each binary operator that is
 343     /// defined.
 344     static std::map<char, int> BinopPrecedence;
 345
 346     /// GetTokPrecedence - Get the precedence of the pending binary operator token.
 347     static int GetTokPrecedence() {
 348       if (!isascii(CurTok))
 349         return -1;
 350
 351       // Make sure it's a declared binop.
 352       int TokPrec = BinopPrecedence[CurTok];
 353       if (TokPrec <= 0) return -1;
 354       return TokPrec;
 355     }
 356
 357     int main() {
 358       // Install standard binary operators.
 359       // 1 is lowest precedence.
 360       BinopPrecedence['<'] = 10;
 361       BinopPrecedence['+'] = 20;
 362       BinopPrecedence['-'] = 20;
 363       BinopPrecedence['*'] = 40;  // highest.
 364       ...
 365     }
 366
 367 For the basic form of Kaleidoscope, we will only support 4 binary
 368 operators (this can obviously be extended by you, our brave and intrepid
 369 reader). The ``GetTokPrecedence`` function returns the precedence for
 370 the current token, or -1 if the token is not a binary operator. Having a
 371 map makes it easy to add new operators and makes it clear that the
 372 algorithm doesn't depend on the specific operators involved, but it
 373 would be easy enough to eliminate the map and do the comparisons in the
 374 ``GetTokPrecedence`` function. (Or just use a fixed-size array).
 375
 376 With the helper above defined, we can now start parsing binary
 377 expressions. The basic idea of operator precedence parsing is to break
 378 down an expression with potentially ambiguous binary operators into
 379 pieces. Consider ,for example, the expression "a+b+(c+d)\*e\*f+g".
 380 Operator precedence parsing considers this as a stream of primary
 381 expressions separated by binary operators. As such, it will first parse
 382 the leading primary expression "a", then it will see the pairs [+, b]
 383 [+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are
 384 primary expressions, the binary expression parser doesn't need to worry
 385 about nested subexpressions like (c+d) at all.
 386
 387 To start, an expression is a primary expression potentially followed by
 388 a sequence of [binop,primaryexpr] pairs:
 389
 390 .. code-block:: c++
 391
 392     /// expression
 393     ///   ::= primary binoprhs
 394     ///
 395     static ExprAST *ParseExpression() {
 396       ExprAST *LHS = ParsePrimary();
 397       if (!LHS) return 0;
 398
 399       return ParseBinOpRHS(0, LHS);
 400     }
 401
 402 ``ParseBinOpRHS`` is the function that parses the sequence of pairs for
 403 us. It takes a precedence and a pointer to an expression for the part
 404 that has been parsed so far. Note that "x" is a perfectly valid
 405 expression: As such, "binoprhs" is allowed to be empty, in which case it
 406 returns the expression that is passed into it. In our example above, the
 407 code passes the expression for "a" into ``ParseBinOpRHS`` and the
 408 current token is "+".
 409
 410 The precedence value passed into ``ParseBinOpRHS`` indicates the
 411 *minimal operator precedence* that the function is allowed to eat. For
 412 example, if the current pair stream is [+, x] and ``ParseBinOpRHS`` is
 413 passed in a precedence of 40, it will not consume any tokens (because
 414 the precedence of '+' is only 20). With this in mind, ``ParseBinOpRHS``
 415 starts with:
 416
 417 .. code-block:: c++
 418
 419     /// binoprhs
 420     ///   ::= ('+' primary)*
 421     static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
 422       // If this is a binop, find its precedence.
 423       while (1) {
 424         int TokPrec = GetTokPrecedence();
 425
 426         // If this is a binop that binds at least as tightly as the current binop,
 427         // consume it, otherwise we are done.
 428         if (TokPrec < ExprPrec)
 429           return LHS;
 430
 431 This code gets the precedence of the current token and checks to see if
 432 if is too low. Because we defined invalid tokens to have a precedence of
 433 -1, this check implicitly knows that the pair-stream ends when the token
 434 stream runs out of binary operators. If this check succeeds, we know
 435 that the token is a binary operator and that it will be included in this
 436 expression:
 437
 438 .. code-block:: c++
 439
 440         // Okay, we know this is a binop.
 441         int BinOp = CurTok;
 442         getNextToken();  // eat binop
 443
 444         // Parse the primary expression after the binary operator.
 445         ExprAST *RHS = ParsePrimary();
 446         if (!RHS) return 0;
 447
 448 As such, this code eats (and remembers) the binary operator and then
 449 parses the primary expression that follows. This builds up the whole
 450 pair, the first of which is [+, b] for the running example.
 451
 452 Now that we parsed the left-hand side of an expression and one pair of
 453 the RHS sequence, we have to decide which way the expression associates.
 454 In particular, we could have "(a+b) binop unparsed" or "a + (b binop
 455 unparsed)". To determine this, we look ahead at "binop" to determine its
 456 precedence and compare it to BinOp's precedence (which is '+' in this
 457 case):
 458
 459 .. code-block:: c++
 460
 461         // If BinOp binds less tightly with RHS than the operator after RHS, let
 462         // the pending operator take RHS as its LHS.
 463         int NextPrec = GetTokPrecedence();
 464         if (TokPrec < NextPrec) {
 465
 466 If the precedence of the binop to the right of "RHS" is lower or equal
 467 to the precedence of our current operator, then we know that the
 468 parentheses associate as "(a+b) binop ...". In our example, the current
 469 operator is "+" and the next operator is "+", we know that they have the
 470 same precedence. In this case we'll create the AST node for "a+b", and
 471 then continue parsing:
 472
 473 .. code-block:: c++
 474
 475           ... if body omitted ...
 476         }
 477
 478         // Merge LHS/RHS.
 479         LHS = new BinaryExprAST(BinOp, LHS, RHS);
 480       }  // loop around to the top of the while loop.
 481     }
 482
 483 In our example above, this will turn "a+b+" into "(a+b)" and execute the
 484 next iteration of the loop, with "+" as the current token. The code
 485 above will eat, remember, and parse "(c+d)" as the primary expression,
 486 which makes the current pair equal to [+, (c+d)]. It will then evaluate
 487 the 'if' conditional above with "\*" as the binop to the right of the
 488 primary. In this case, the precedence of "\*" is higher than the
 489 precedence of "+" so the if condition will be entered.
 490
 491 The critical question left here is "how can the if condition parse the
 492 right hand side in full"? In particular, to build the AST correctly for
 493 our example, it needs to get all of "(c+d)\*e\*f" as the RHS expression
 494 variable. The code to do this is surprisingly simple (code from the
 495 above two blocks duplicated for context):
 496
 497 .. code-block:: c++
 498
 499         // If BinOp binds less tightly with RHS than the operator after RHS, let
 500         // the pending operator take RHS as its LHS.
 501         int NextPrec = GetTokPrecedence();
 502         if (TokPrec < NextPrec) {
 503           RHS = ParseBinOpRHS(TokPrec+1, RHS);
 504           if (RHS == 0) return 0;
 505         }
 506         // Merge LHS/RHS.
 507         LHS = new BinaryExprAST(BinOp, LHS, RHS);
 508       }  // loop around to the top of the while loop.
 509     }
 510
 511 At this point, we know that the binary operator to the RHS of our
 512 primary has higher precedence than the binop we are currently parsing.
 513 As such, we know that any sequence of pairs whose operators are all
 514 higher precedence than "+" should be parsed together and returned as
 515 "RHS". To do this, we recursively invoke the ``ParseBinOpRHS`` function
 516 specifying "TokPrec+1" as the minimum precedence required for it to
 517 continue. In our example above, this will cause it to return the AST
 518 node for "(c+d)\*e\*f" as RHS, which is then set as the RHS of the '+'
 519 expression.
 520
 521 Finally, on the next iteration of the while loop, the "+g" piece is
 522 parsed and added to the AST. With this little bit of code (14
 523 non-trivial lines), we correctly handle fully general binary expression
 524 parsing in a very elegant way. This was a whirlwind tour of this code,
 525 and it is somewhat subtle. I recommend running through it with a few
 526 tough examples to see how it works.
 527
 528 This wraps up handling of expressions. At this point, we can point the
 529 parser at an arbitrary token stream and build an expression from it,
 530 stopping at the first token that is not part of the expression. Next up
 531 we need to handle function definitions, etc.
 532
 533 Parsing the Rest
 534 ================
 535
 536 The next thing missing is handling of function prototypes. In
 537 Kaleidoscope, these are used both for 'extern' function declarations as
 538 well as function body definitions. The code to do this is
 539 straight-forward and not very interesting (once you've survived
 540 expressions):
 541
 542 .. code-block:: c++
 543
 544     /// prototype
 545     ///   ::= id '(' id* ')'
 546     static PrototypeAST *ParsePrototype() {
 547       if (CurTok != tok_identifier)
 548         return ErrorP("Expected function name in prototype");
 549
 550       std::string FnName = IdentifierStr;
 551       getNextToken();
 552
 553       if (CurTok != '(')
 554         return ErrorP("Expected '(' in prototype");
 555
 556       // Read the list of argument names.
 557       std::vector<std::string> ArgNames;
 558       while (getNextToken() == tok_identifier)
 559         ArgNames.push_back(IdentifierStr);
 560       if (CurTok != ')')
 561         return ErrorP("Expected ')' in prototype");
 562
 563       // success.
 564       getNextToken();  // eat ')'.
 565
 566       return new PrototypeAST(FnName, ArgNames);
 567     }
 568
 569 Given this, a function definition is very simple, just a prototype plus
 570 an expression to implement the body:
 571
 572 .. code-block:: c++
 573
 574     /// definition ::= 'def' prototype expression
 575     static FunctionAST *ParseDefinition() {
 576       getNextToken();  // eat def.
 577       PrototypeAST *Proto = ParsePrototype();
 578       if (Proto == 0) return 0;
 579
 580       if (ExprAST *E = ParseExpression())
 581         return new FunctionAST(Proto, E);
 582       return 0;
 583     }
 584
 585 In addition, we support 'extern' to declare functions like 'sin' and
 586 'cos' as well as to support forward declaration of user functions. These
 587 'extern's are just prototypes with no body:
 588
 589 .. code-block:: c++
 590
 591     /// external ::= 'extern' prototype
 592     static PrototypeAST *ParseExtern() {
 593       getNextToken();  // eat extern.
 594       return ParsePrototype();
 595     }
 596
 597 Finally, we'll also let the user type in arbitrary top-level expressions
 598 and evaluate them on the fly. We will handle this by defining anonymous
 599 nullary (zero argument) functions for them:
 600
 601 .. code-block:: c++
 602
 603     /// toplevelexpr ::= expression
 604     static FunctionAST *ParseTopLevelExpr() {
 605       if (ExprAST *E = ParseExpression()) {
 606         // Make an anonymous proto.
 607         PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
 608         return new FunctionAST(Proto, E);
 609       }
 610       return 0;
 611     }
 612
 613 Now that we have all the pieces, let's build a little driver that will
 614 let us actually *execute* this code we've built!
 615
 616 The Driver
 617 ==========
 618
 619 The driver for this simply invokes all of the parsing pieces with a
 620 top-level dispatch loop. There isn't much interesting here, so I'll just
 621 include the top-level loop. See `below <#code>`_ for full code in the
 622 "Top-Level Parsing" section.
 623
 624 .. code-block:: c++
 625
 626     /// top ::= definition | external | expression | ';'
 627     static void MainLoop() {
 628       while (1) {
 629         fprintf(stderr, "ready> ");
 630         switch (CurTok) {
 631         case tok_eof:    return;
 632         case ';':        getNextToken(); break;  // ignore top-level semicolons.
 633         case tok_def:    HandleDefinition(); break;
 634         case tok_extern: HandleExtern(); break;
 635         default:         HandleTopLevelExpression(); break;
 636         }
 637       }
 638     }
 639
 640 The most interesting part of this is that we ignore top-level
 641 semicolons. Why is this, you ask? The basic reason is that if you type
 642 "4 + 5" at the command line, the parser doesn't know whether that is the
 643 end of what you will type or not. For example, on the next line you
 644 could type "def foo..." in which case 4+5 is the end of a top-level
 645 expression. Alternatively you could type "\* 6", which would continue
 646 the expression. Having top-level semicolons allows you to type "4+5;",
 647 and the parser will know you are done.
 648
 649 Conclusions
 650 ===========
 651
 652 With just under 400 lines of commented code (240 lines of non-comment,
 653 non-blank code), we fully defined our minimal language, including a
 654 lexer, parser, and AST builder. With this done, the executable will
 655 validate Kaleidoscope code and tell us if it is grammatically invalid.
 656 For example, here is a sample interaction:
 657
 658 .. code-block:: bash
 659
 660     $ ./a.out
 661     ready> def foo(x y) x+foo(y, 4.0);
 662     Parsed a function definition.
 663     ready> def foo(x y) x+y y;
 664     Parsed a function definition.
 665     Parsed a top-level expr
 666     ready> def foo(x y) x+y );
 667     Parsed a function definition.
 668     Error: unknown token when expecting an expression
 669     ready> extern sin(a);
 670     ready> Parsed an extern
 671     ready> ^D
 672     $
 673
 674 There is a lot of room for extension here. You can define new AST nodes,
 675 extend the language in many ways, etc. In the `next
 676 installment <LangImpl3.html>`_, we will describe how to generate LLVM
 677 Intermediate Representation (IR) from the AST.
 678
 679 Full Code Listing
 680 =================
 681
 682 Here is the complete code listing for this and the previous chapter.
 683 Note that it is fully self-contained: you don't need LLVM or any
 684 external libraries at all for this. (Besides the C and C++ standard
 685 libraries, of course.) To build this, just compile with:
 686
 687 .. code-block:: bash
 688
 689     # Compile
 690     clang++ -g -O3 toy.cpp
 691     # Run
 692     ./a.out
 693
 694 Here is the code:
 695
 696 .. code-block:: c++
 697
 698     #include <cstdio>
 699     #include <cstdlib>
 700     #include <string>
 701     #include <map>
 702     #include <vector>
 703
 704     //===----------------------------------------------------------------------===//
 705     // Lexer
 706     //===----------------------------------------------------------------------===//
 707
 708     // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
 709     // of these for known things.
 710     enum Token {
 711       tok_eof = -1,
 712
 713       // commands
 714       tok_def = -2, tok_extern = -3,
 715
 716       // primary
 717       tok_identifier = -4, tok_number = -5
 718     };
 719
 720     static std::string IdentifierStr;  // Filled in if tok_identifier
 721     static double NumVal;              // Filled in if tok_number
 722
 723     /// gettok - Return the next token from standard input.
 724     static int gettok() {
 725       static int LastChar = ' ';
 726
 727       // Skip any whitespace.
 728       while (isspace(LastChar))
 729         LastChar = getchar();
 730
 731       if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
 732         IdentifierStr = LastChar;
 733         while (isalnum((LastChar = getchar())))
 734           IdentifierStr += LastChar;
 735
 736         if (IdentifierStr == "def") return tok_def;
 737         if (IdentifierStr == "extern") return tok_extern;
 738         return tok_identifier;
 739       }
 740
 741       if (isdigit(LastChar) || LastChar == '.') {   // Number: [0-9.]+
 742         std::string NumStr;
 743         do {
 744           NumStr += LastChar;
 745           LastChar = getchar();
 746         } while (isdigit(LastChar) || LastChar == '.');
 747
 748         NumVal = strtod(NumStr.c_str(), 0);
 749         return tok_number;
 750       }
 751
 752       if (LastChar == '#') {
 753         // Comment until end of line.
 754         do LastChar = getchar();
 755         while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
 756
 757         if (LastChar != EOF)
 758           return gettok();
 759       }
 760
 761       // Check for end of file.  Don't eat the EOF.
 762       if (LastChar == EOF)
 763         return tok_eof;
 764
 765       // Otherwise, just return the character as its ascii value.
 766       int ThisChar = LastChar;
 767       LastChar = getchar();
 768       return ThisChar;
 769     }
 770
 771     //===----------------------------------------------------------------------===//
 772     // Abstract Syntax Tree (aka Parse Tree)
 773     //===----------------------------------------------------------------------===//
 774
 775     /// ExprAST - Base class for all expression nodes.
 776     class ExprAST {
 777     public:
 778       virtual ~ExprAST() {}
 779     };
 780
 781     /// NumberExprAST - Expression class for numeric literals like "1.0".
 782     class NumberExprAST : public ExprAST {
 783       double Val;
 784     public:
 785       NumberExprAST(double val) : Val(val) {}
 786     };
 787
 788     /// VariableExprAST - Expression class for referencing a variable, like "a".
 789     class VariableExprAST : public ExprAST {
 790       std::string Name;
 791     public:
 792       VariableExprAST(const std::string &name) : Name(name) {}
 793     };
 794
 795     /// BinaryExprAST - Expression class for a binary operator.
 796     class BinaryExprAST : public ExprAST {
 797       char Op;
 798       ExprAST *LHS, *RHS;
 799     public:
 800       BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
 801         : Op(op), LHS(lhs), RHS(rhs) {}
 802     };
 803
 804     /// CallExprAST - Expression class for function calls.
 805     class CallExprAST : public ExprAST {
 806       std::string Callee;
 807       std::vector<ExprAST*> Args;
 808     public:
 809       CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
 810         : Callee(callee), Args(args) {}
 811     };
 812
 813     /// PrototypeAST - This class represents the "prototype" for a function,
 814     /// which captures its name, and its argument names (thus implicitly the number
 815     /// of arguments the function takes).
 816     class PrototypeAST {
 817       std::string Name;
 818       std::vector<std::string> Args;
 819     public:
 820       PrototypeAST(const std::string &name, const std::vector<std::string> &args)
 821         : Name(name), Args(args) {}
 822
 823     };
 824
 825     /// FunctionAST - This class represents a function definition itself.
 826     class FunctionAST {
 827       PrototypeAST *Proto;
 828       ExprAST *Body;
 829     public:
 830       FunctionAST(PrototypeAST *proto, ExprAST *body)
 831         : Proto(proto), Body(body) {}
 832
 833     };
 834
 835     //===----------------------------------------------------------------------===//
 836     // Parser
 837     //===----------------------------------------------------------------------===//
 838
 839     /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 840     /// token the parser is looking at.  getNextToken reads another token from the
 841     /// lexer and updates CurTok with its results.
 842     static int CurTok;
 843     static int getNextToken() {
 844       return CurTok = gettok();
 845     }
 846
 847     /// BinopPrecedence - This holds the precedence for each binary operator that is
 848     /// defined.
 849     static std::map<char, int> BinopPrecedence;
 850
 851     /// GetTokPrecedence - Get the precedence of the pending binary operator token.
 852     static int GetTokPrecedence() {
 853       if (!isascii(CurTok))
 854         return -1;
 855
 856       // Make sure it's a declared binop.
 857       int TokPrec = BinopPrecedence[CurTok];
 858       if (TokPrec <= 0) return -1;
 859       return TokPrec;
 860     }
 861
 862     /// Error* - These are little helper functions for error handling.
 863     ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 864     PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 865     FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 866
 867     static ExprAST *ParseExpression();
 868
 869     /// identifierexpr
 870     ///   ::= identifier
 871     ///   ::= identifier '(' expression* ')'
 872     static ExprAST *ParseIdentifierExpr() {
 873       std::string IdName = IdentifierStr;
 874
 875       getNextToken();  // eat identifier.
 876
 877       if (CurTok != '(') // Simple variable ref.
 878         return new VariableExprAST(IdName);
 879
 880       // Call.
 881       getNextToken();  // eat (
 882       std::vector<ExprAST*> Args;
 883       if (CurTok != ')') {
 884         while (1) {
 885           ExprAST *Arg = ParseExpression();
 886           if (!Arg) return 0;
 887           Args.push_back(Arg);
 888
 889           if (CurTok == ')') break;
 890
 891           if (CurTok != ',')
 892             return Error("Expected ')' or ',' in argument list");
 893           getNextToken();
 894         }
 895       }
 896
 897       // Eat the ')'.
 898       getNextToken();
 899
 900       return new CallExprAST(IdName, Args);
 901     }
 902
 903     /// numberexpr ::= number
 904     static ExprAST *ParseNumberExpr() {
 905       ExprAST *Result = new NumberExprAST(NumVal);
 906       getNextToken(); // consume the number
 907       return Result;
 908     }
 909
 910     /// parenexpr ::= '(' expression ')'
 911     static ExprAST *ParseParenExpr() {
 912       getNextToken();  // eat (.
 913       ExprAST *V = ParseExpression();
 914       if (!V) return 0;
 915
 916       if (CurTok != ')')
 917         return Error("expected ')'");
 918       getNextToken();  // eat ).
 919       return V;
 920     }
 921
 922     /// primary
 923     ///   ::= identifierexpr
 924     ///   ::= numberexpr
 925     ///   ::= parenexpr
 926     static ExprAST *ParsePrimary() {
 927       switch (CurTok) {
 928       default: return Error("unknown token when expecting an expression");
 929       case tok_identifier: return ParseIdentifierExpr();
 930       case tok_number:     return ParseNumberExpr();
 931       case '(':            return ParseParenExpr();
 932       }
 933     }
 934
 935     /// binoprhs
 936     ///   ::= ('+' primary)*
 937     static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
 938       // If this is a binop, find its precedence.
 939       while (1) {
 940         int TokPrec = GetTokPrecedence();
 941
 942         // If this is a binop that binds at least as tightly as the current binop,
 943         // consume it, otherwise we are done.
 944         if (TokPrec < ExprPrec)
 945           return LHS;
 946
 947         // Okay, we know this is a binop.
 948         int BinOp = CurTok;
 949         getNextToken();  // eat binop
 950
 951         // Parse the primary expression after the binary operator.
 952         ExprAST *RHS = ParsePrimary();
 953         if (!RHS) return 0;
 954
 955         // If BinOp binds less tightly with RHS than the operator after RHS, let
 956         // the pending operator take RHS as its LHS.
 957         int NextPrec = GetTokPrecedence();
 958         if (TokPrec < NextPrec) {
 959           RHS = ParseBinOpRHS(TokPrec+1, RHS);
 960           if (RHS == 0) return 0;
 961         }
 962
 963         // Merge LHS/RHS.
 964         LHS = new BinaryExprAST(BinOp, LHS, RHS);
 965       }
 966     }
 967
 968     /// expression
 969     ///   ::= primary binoprhs
 970     ///
 971     static ExprAST *ParseExpression() {
 972       ExprAST *LHS = ParsePrimary();
 973       if (!LHS) return 0;
 974
 975       return ParseBinOpRHS(0, LHS);
 976     }
 977
 978     /// prototype
 979     ///   ::= id '(' id* ')'
 980     static PrototypeAST *ParsePrototype() {
 981       if (CurTok != tok_identifier)
 982         return ErrorP("Expected function name in prototype");
 983
 984       std::string FnName = IdentifierStr;
 985       getNextToken();
 986
 987       if (CurTok != '(')
 988         return ErrorP("Expected '(' in prototype");
 989
 990       std::vector<std::string> ArgNames;
 991       while (getNextToken() == tok_identifier)
 992         ArgNames.push_back(IdentifierStr);
 993       if (CurTok != ')')
 994         return ErrorP("Expected ')' in prototype");
 995
 996       // success.
 997       getNextToken();  // eat ')'.
 998
 999       return new PrototypeAST(FnName, ArgNames);
1000     }
1001
1002     /// definition ::= 'def' prototype expression
1003     static FunctionAST *ParseDefinition() {
1004       getNextToken();  // eat def.
1005       PrototypeAST *Proto = ParsePrototype();
1006       if (Proto == 0) return 0;
1007
1008       if (ExprAST *E = ParseExpression())
1009         return new FunctionAST(Proto, E);
1010       return 0;
1011     }
1012
1013     /// toplevelexpr ::= expression
1014     static FunctionAST *ParseTopLevelExpr() {
1015       if (ExprAST *E = ParseExpression()) {
1016         // Make an anonymous proto.
1017         PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
1018         return new FunctionAST(Proto, E);
1019       }
1020       return 0;
1021     }
1022
1023     /// external ::= 'extern' prototype
1024     static PrototypeAST *ParseExtern() {
1025       getNextToken();  // eat extern.
1026       return ParsePrototype();
1027     }
1028
1029     //===----------------------------------------------------------------------===//
1030     // Top-Level parsing
1031     //===----------------------------------------------------------------------===//
1032
1033     static void HandleDefinition() {
1034       if (ParseDefinition()) {
1035         fprintf(stderr, "Parsed a function definition.\n");
1036       } else {
1037         // Skip token for error recovery.
1038         getNextToken();
1039       }
1040     }
1041
1042     static void HandleExtern() {
1043       if (ParseExtern()) {
1044         fprintf(stderr, "Parsed an extern\n");
1045       } else {
1046         // Skip token for error recovery.
1047         getNextToken();
1048       }
1049     }
1050
1051     static void HandleTopLevelExpression() {
1052       // Evaluate a top-level expression into an anonymous function.
1053       if (ParseTopLevelExpr()) {
1054         fprintf(stderr, "Parsed a top-level expr\n");
1055       } else {
1056         // Skip token for error recovery.
1057         getNextToken();
1058       }
1059     }
1060
1061     /// top ::= definition | external | expression | ';'
1062     static void MainLoop() {
1063       while (1) {
1064         fprintf(stderr, "ready> ");
1065         switch (CurTok) {
1066         case tok_eof:    return;
1067         case ';':        getNextToken(); break;  // ignore top-level semicolons.
1068         case tok_def:    HandleDefinition(); break;
1069         case tok_extern: HandleExtern(); break;
1070         default:         HandleTopLevelExpression(); break;
1071         }
1072       }
1073     }
1074
1075     //===----------------------------------------------------------------------===//
1076     // Main driver code.
1077     //===----------------------------------------------------------------------===//
1078
1079     int main() {
1080       // Install standard binary operators.
1081       // 1 is lowest precedence.
1082       BinopPrecedence['<'] = 10;
1083       BinopPrecedence['+'] = 20;
1084       BinopPrecedence['-'] = 20;
1085       BinopPrecedence['*'] = 40;  // highest.
1086
1087       // Prime the first token.
1088       fprintf(stderr, "ready> ");
1089       getNextToken();
1090
1091       // Run the main "interpreter loop" now.
1092       MainLoop();
1093
1094       return 0;
1095     }
1096
1097 `Next: Implementing Code Generation to LLVM IR <LangImpl3.html>`_
1098