docs/tutorial/LangImpl2.rst

   1 ===========================================
   2 Kaleidoscope: Implementing a Parser and AST
   3 ===========================================
   4
   5 .. contents::
   6    :local:
   7
   8 Chapter 2 Introduction
   9 ======================
  10
  11 Welcome to Chapter 2 of the "`Implementing a language with
  12 LLVM <index.html>`_" tutorial. This chapter shows you how to use the
  13 lexer, built in `Chapter 1 <LangImpl1.html>`_, to build a full
  14 `parser <http://en.wikipedia.org/wiki/Parsing>`_ for our Kaleidoscope
  15 language. Once we have a parser, we'll define and build an `Abstract
  16 Syntax Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
  17
  18 The parser we will build uses a combination of `Recursive Descent
  19 Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
  20 `Operator-Precedence
  21 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
  22 parse the Kaleidoscope language (the latter for binary expressions and
  23 the former for everything else). Before we get to parsing though, lets
  24 talk about the output of the parser: the Abstract Syntax Tree.
  25
  26 The Abstract Syntax Tree (AST)
  27 ==============================
  28
  29 The AST for a program captures its behavior in such a way that it is
  30 easy for later stages of the compiler (e.g. code generation) to
  31 interpret. We basically want one object for each construct in the
  32 language, and the AST should closely model the language. In
  33 Kaleidoscope, we have expressions, a prototype, and a function object.
  34 We'll start with expressions first:
  35
  36 .. code-block:: c++
  37
  38     /// ExprAST - Base class for all expression nodes.
  39     class ExprAST {
  40     public:
  41       virtual ~ExprAST() {}
  42     };
  43
  44     /// NumberExprAST - Expression class for numeric literals like "1.0".
  45     class NumberExprAST : public ExprAST {
  46       double Val;
  47     public:
  48       NumberExprAST(double val) : Val(val) {}
  49     };
  50
  51 The code above shows the definition of the base ExprAST class and one
  52 subclass which we use for numeric literals. The important thing to note
  53 about this code is that the NumberExprAST class captures the numeric
  54 value of the literal as an instance variable. This allows later phases
  55 of the compiler to know what the stored numeric value is.
  56
  57 Right now we only create the AST, so there are no useful accessor
  58 methods on them. It would be very easy to add a virtual method to pretty
  59 print the code, for example. Here are the other expression AST node
  60 definitions that we'll use in the basic form of the Kaleidoscope
  61 language:
  62
  63 .. code-block:: c++
  64
  65     /// VariableExprAST - Expression class for referencing a variable, like "a".
  66     class VariableExprAST : public ExprAST {
  67       std::string Name;
  68     public:
  69       VariableExprAST(const std::string &name) : Name(name) {}
  70     };
  71
  72     /// BinaryExprAST - Expression class for a binary operator.
  73     class BinaryExprAST : public ExprAST {
  74       char Op;
  75       ExprAST *LHS, *RHS;
  76     public:
  77       BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
  78         : Op(op), LHS(lhs), RHS(rhs) {}
  79     };
  80
  81     /// CallExprAST - Expression class for function calls.
  82     class CallExprAST : public ExprAST {
  83       std::string Callee;
  84       std::vector<ExprAST*> Args;
  85     public:
  86       CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
  87         : Callee(callee), Args(args) {}
  88     };
  89
  90 This is all (intentionally) rather straight-forward: variables capture
  91 the variable name, binary operators capture their opcode (e.g. '+'), and
  92 calls capture a function name as well as a list of any argument
  93 expressions. One thing that is nice about our AST is that it captures
  94 the language features without talking about the syntax of the language.
  95 Note that there is no discussion about precedence of binary operators,
  96 lexical structure, etc.
  97
  98 For our basic language, these are all of the expression nodes we'll
  99 define. Because it doesn't have conditional control flow, it isn't
 100 Turing-complete; we'll fix that in a later installment. The two things
 101 we need next are a way to talk about the interface to a function, and a
 102 way to talk about functions themselves:
 103
 104 .. code-block:: c++
 105
 106     /// PrototypeAST - This class represents the "prototype" for a function,
 107     /// which captures its name, and its argument names (thus implicitly the number
 108     /// of arguments the function takes).
 109     class PrototypeAST {
 110       std::string Name;
 111       std::vector<std::string> Args;
 112     public:
 113       PrototypeAST(const std::string &name, const std::vector<std::string> &args)
 114         : Name(name), Args(args) {}
 115     };
 116
 117     /// FunctionAST - This class represents a function definition itself.
 118     class FunctionAST {
 119       PrototypeAST *Proto;
 120       ExprAST *Body;
 121     public:
 122       FunctionAST(PrototypeAST *proto, ExprAST *body)
 123         : Proto(proto), Body(body) {}
 124     };
 125
 126 In Kaleidoscope, functions are typed with just a count of their
 127 arguments. Since all values are double precision floating point, the
 128 type of each argument doesn't need to be stored anywhere. In a more
 129 aggressive and realistic language, the "ExprAST" class would probably
 130 have a type field.
 131
 132 With this scaffolding, we can now talk about parsing expressions and
 133 function bodies in Kaleidoscope.
 134
 135 Parser Basics
 136 =============
 137
 138 Now that we have an AST to build, we need to define the parser code to
 139 build it. The idea here is that we want to parse something like "x+y"
 140 (which is returned as three tokens by the lexer) into an AST that could
 141 be generated with calls like this:
 142
 143 .. code-block:: c++
 144
 145       ExprAST *X = new VariableExprAST("x");
 146       ExprAST *Y = new VariableExprAST("y");
 147       ExprAST *Result = new BinaryExprAST('+', X, Y);
 148
 149 In order to do this, we'll start by defining some basic helper routines:
 150
 151 .. code-block:: c++
 152
 153     /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 154     /// token the parser is looking at.  getNextToken reads another token from the
 155     /// lexer and updates CurTok with its results.
 156     static int CurTok;
 157     static int getNextToken() {
 158       return CurTok = gettok();
 159     }
 160
 161 This implements a simple token buffer around the lexer. This allows us
 162 to look one token ahead at what the lexer is returning. Every function
 163 in our parser will assume that CurTok is the current token that needs to
 164 be parsed.
 165
 166 .. code-block:: c++
 167
 168
 169     /// Error* - These are little helper functions for error handling.
 170     ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 171     PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 172     FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 173
 174 The ``Error`` routines are simple helper routines that our parser will
 175 use to handle errors. The error recovery in our parser will not be the
 176 best and is not particular user-friendly, but it will be enough for our
 177 tutorial. These routines make it easier to handle errors in routines
 178 that have various return types: they always return null.
 179
 180 With these basic helper functions, we can implement the first piece of
 181 our grammar: numeric literals.
 182
 183 Basic Expression Parsing
 184 ========================
 185
 186 We start with numeric literals, because they are the simplest to
 187 process. For each production in our grammar, we'll define a function
 188 which parses that production. For numeric literals, we have:
 189
 190 .. code-block:: c++
 191
 192     /// numberexpr ::= number
 193     static ExprAST *ParseNumberExpr() {
 194       ExprAST *Result = new NumberExprAST(NumVal);
 195       getNextToken(); // consume the number
 196       return Result;
 197     }
 198
 199 This routine is very simple: it expects to be called when the current
 200 token is a ``tok_number`` token. It takes the current number value,
 201 creates a ``NumberExprAST`` node, advances the lexer to the next token,
 202 and finally returns.
 203
 204 There are some interesting aspects to this. The most important one is
 205 that this routine eats all of the tokens that correspond to the
 206 production and returns the lexer buffer with the next token (which is
 207 not part of the grammar production) ready to go. This is a fairly
 208 standard way to go for recursive descent parsers. For a better example,
 209 the parenthesis operator is defined like this:
 210
 211 .. code-block:: c++
 212
 213     /// parenexpr ::= '(' expression ')'
 214     static ExprAST *ParseParenExpr() {
 215       getNextToken();  // eat (.
 216       ExprAST *V = ParseExpression();
 217       if (!V) return 0;
 218
 219       if (CurTok != ')')
 220         return Error("expected ')'");
 221       getNextToken();  // eat ).
 222       return V;
 223     }
 224
 225 This function illustrates a number of interesting things about the
 226 parser:
 227
 228 1) It shows how we use the Error routines. When called, this function
 229 expects that the current token is a '(' token, but after parsing the
 230 subexpression, it is possible that there is no ')' waiting. For example,
 231 if the user types in "(4 x" instead of "(4)", the parser should emit an
 232 error. Because errors can occur, the parser needs a way to indicate that
 233 they happened: in our parser, we return null on an error.
 234
 235 2) Another interesting aspect of this function is that it uses recursion
 236 by calling ``ParseExpression`` (we will soon see that
 237 ``ParseExpression`` can call ``ParseParenExpr``). This is powerful
 238 because it allows us to handle recursive grammars, and keeps each
 239 production very simple. Note that parentheses do not cause construction
 240 of AST nodes themselves. While we could do it this way, the most
 241 important role of parentheses are to guide the parser and provide
 242 grouping. Once the parser constructs the AST, parentheses are not
 243 needed.
 244
 245 The next simple production is for handling variable references and
 246 function calls:
 247
 248 .. code-block:: c++
 249
 250     /// identifierexpr
 251     ///   ::= identifier
 252     ///   ::= identifier '(' expression* ')'
 253     static ExprAST *ParseIdentifierExpr() {
 254       std::string IdName = IdentifierStr;
 255
 256       getNextToken();  // eat identifier.
 257
 258       if (CurTok != '(') // Simple variable ref.
 259         return new VariableExprAST(IdName);
 260
 261       // Call.
 262       getNextToken();  // eat (
 263       std::vector<ExprAST*> Args;
 264       if (CurTok != ')') {
 265         while (1) {
 266           ExprAST *Arg = ParseExpression();
 267           if (!Arg) return 0;
 268           Args.push_back(Arg);
 269
 270           if (CurTok == ')') break;
 271
 272           if (CurTok != ',')
 273             return Error("Expected ')' or ',' in argument list");
 274           getNextToken();
 275         }
 276       }
 277
 278       // Eat the ')'.
 279       getNextToken();
 280
 281       return new CallExprAST(IdName, Args);
 282     }
 283
 284 This routine follows the same style as the other routines. (It expects
 285 to be called if the current token is a ``tok_identifier`` token). It
 286 also has recursion and error handling. One interesting aspect of this is
 287 that it uses *look-ahead* to determine if the current identifier is a
 288 stand alone variable reference or if it is a function call expression.
 289 It handles this by checking to see if the token after the identifier is
 290 a '(' token, constructing either a ``VariableExprAST`` or
 291 ``CallExprAST`` node as appropriate.
 292
 293 Now that we have all of our simple expression-parsing logic in place, we
 294 can define a helper function to wrap it together into one entry point.
 295 We call this class of expressions "primary" expressions, for reasons
 296 that will become more clear `later in the
 297 tutorial <LangImpl6.html#unary>`_. In order to parse an arbitrary
 298 primary expression, we need to determine what sort of expression it is:
 299
 300 .. code-block:: c++
 301
 302     /// primary
 303     ///   ::= identifierexpr
 304     ///   ::= numberexpr
 305     ///   ::= parenexpr
 306     static ExprAST *ParsePrimary() {
 307       switch (CurTok) {
 308       default: return Error("unknown token when expecting an expression");
 309       case tok_identifier: return ParseIdentifierExpr();
 310       case tok_number:     return ParseNumberExpr();
 311       case '(':            return ParseParenExpr();
 312       }
 313     }
 314
 315 Now that you see the definition of this function, it is more obvious why
 316 we can assume the state of CurTok in the various functions. This uses
 317 look-ahead to determine which sort of expression is being inspected, and
 318 then parses it with a function call.
 319
 320 Now that basic expressions are handled, we need to handle binary
 321 expressions. They are a bit more complex.
 322
 323 Binary Expression Parsing
 324 =========================
 325
 326 Binary expressions are significantly harder to parse because they are
 327 often ambiguous. For example, when given the string "x+y\*z", the parser
 328 can choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common
 329 definitions from mathematics, we expect the later parse, because "\*"
 330 (multiplication) has higher *precedence* than "+" (addition).
 331
 332 There are many ways to handle this, but an elegant and efficient way is
 333 to use `Operator-Precedence
 334 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
 335 This parsing technique uses the precedence of binary operators to guide
 336 recursion. To start with, we need a table of precedences:
 337
 338 .. code-block:: c++
 339
 340     /// BinopPrecedence - This holds the precedence for each binary operator that is
 341     /// defined.
 342     static std::map<char, int> BinopPrecedence;
 343
 344     /// GetTokPrecedence - Get the precedence of the pending binary operator token.
 345     static int GetTokPrecedence() {
 346       if (!isascii(CurTok))
 347         return -1;
 348
 349       // Make sure it's a declared binop.
 350       int TokPrec = BinopPrecedence[CurTok];
 351       if (TokPrec <= 0) return -1;
 352       return TokPrec;
 353     }
 354
 355     int main() {
 356       // Install standard binary operators.
 357       // 1 is lowest precedence.
 358       BinopPrecedence['<'] = 10;
 359       BinopPrecedence['+'] = 20;
 360       BinopPrecedence['-'] = 20;
 361       BinopPrecedence['*'] = 40;  // highest.
 362       ...
 363     }
 364
 365 For the basic form of Kaleidoscope, we will only support 4 binary
 366 operators (this can obviously be extended by you, our brave and intrepid
 367 reader). The ``GetTokPrecedence`` function returns the precedence for
 368 the current token, or -1 if the token is not a binary operator. Having a
 369 map makes it easy to add new operators and makes it clear that the
 370 algorithm doesn't depend on the specific operators involved, but it
 371 would be easy enough to eliminate the map and do the comparisons in the
 372 ``GetTokPrecedence`` function. (Or just use a fixed-size array).
 373
 374 With the helper above defined, we can now start parsing binary
 375 expressions. The basic idea of operator precedence parsing is to break
 376 down an expression with potentially ambiguous binary operators into
 377 pieces. Consider ,for example, the expression "a+b+(c+d)\*e\*f+g".
 378 Operator precedence parsing considers this as a stream of primary
 379 expressions separated by binary operators. As such, it will first parse
 380 the leading primary expression "a", then it will see the pairs [+, b]
 381 [+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are
 382 primary expressions, the binary expression parser doesn't need to worry
 383 about nested subexpressions like (c+d) at all.
 384
 385 To start, an expression is a primary expression potentially followed by
 386 a sequence of [binop,primaryexpr] pairs:
 387
 388 .. code-block:: c++
 389
 390     /// expression
 391     ///   ::= primary binoprhs
 392     ///
 393     static ExprAST *ParseExpression() {
 394       ExprAST *LHS = ParsePrimary();
 395       if (!LHS) return 0;
 396
 397       return ParseBinOpRHS(0, LHS);
 398     }
 399
 400 ``ParseBinOpRHS`` is the function that parses the sequence of pairs for
 401 us. It takes a precedence and a pointer to an expression for the part
 402 that has been parsed so far. Note that "x" is a perfectly valid
 403 expression: As such, "binoprhs" is allowed to be empty, in which case it
 404 returns the expression that is passed into it. In our example above, the
 405 code passes the expression for "a" into ``ParseBinOpRHS`` and the
 406 current token is "+".
 407
 408 The precedence value passed into ``ParseBinOpRHS`` indicates the
 409 *minimal operator precedence* that the function is allowed to eat. For
 410 example, if the current pair stream is [+, x] and ``ParseBinOpRHS`` is
 411 passed in a precedence of 40, it will not consume any tokens (because
 412 the precedence of '+' is only 20). With this in mind, ``ParseBinOpRHS``
 413 starts with:
 414
 415 .. code-block:: c++
 416
 417     /// binoprhs
 418     ///   ::= ('+' primary)*
 419     static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
 420       // If this is a binop, find its precedence.
 421       while (1) {
 422         int TokPrec = GetTokPrecedence();
 423
 424         // If this is a binop that binds at least as tightly as the current binop,
 425         // consume it, otherwise we are done.
 426         if (TokPrec < ExprPrec)
 427           return LHS;
 428
 429 This code gets the precedence of the current token and checks to see if
 430 if is too low. Because we defined invalid tokens to have a precedence of
 431 -1, this check implicitly knows that the pair-stream ends when the token
 432 stream runs out of binary operators. If this check succeeds, we know
 433 that the token is a binary operator and that it will be included in this
 434 expression:
 435
 436 .. code-block:: c++
 437
 438         // Okay, we know this is a binop.
 439         int BinOp = CurTok;
 440         getNextToken();  // eat binop
 441
 442         // Parse the primary expression after the binary operator.
 443         ExprAST *RHS = ParsePrimary();
 444         if (!RHS) return 0;
 445
 446 As such, this code eats (and remembers) the binary operator and then
 447 parses the primary expression that follows. This builds up the whole
 448 pair, the first of which is [+, b] for the running example.
 449
 450 Now that we parsed the left-hand side of an expression and one pair of
 451 the RHS sequence, we have to decide which way the expression associates.
 452 In particular, we could have "(a+b) binop unparsed" or "a + (b binop
 453 unparsed)". To determine this, we look ahead at "binop" to determine its
 454 precedence and compare it to BinOp's precedence (which is '+' in this
 455 case):
 456
 457 .. code-block:: c++
 458
 459         // If BinOp binds less tightly with RHS than the operator after RHS, let
 460         // the pending operator take RHS as its LHS.
 461         int NextPrec = GetTokPrecedence();
 462         if (TokPrec < NextPrec) {
 463
 464 If the precedence of the binop to the right of "RHS" is lower or equal
 465 to the precedence of our current operator, then we know that the
 466 parentheses associate as "(a+b) binop ...". In our example, the current
 467 operator is "+" and the next operator is "+", we know that they have the
 468 same precedence. In this case we'll create the AST node for "a+b", and
 469 then continue parsing:
 470
 471 .. code-block:: c++
 472
 473           ... if body omitted ...
 474         }
 475
 476         // Merge LHS/RHS.
 477         LHS = new BinaryExprAST(BinOp, LHS, RHS);
 478       }  // loop around to the top of the while loop.
 479     }
 480
 481 In our example above, this will turn "a+b+" into "(a+b)" and execute the
 482 next iteration of the loop, with "+" as the current token. The code
 483 above will eat, remember, and parse "(c+d)" as the primary expression,
 484 which makes the current pair equal to [+, (c+d)]. It will then evaluate
 485 the 'if' conditional above with "\*" as the binop to the right of the
 486 primary. In this case, the precedence of "\*" is higher than the
 487 precedence of "+" so the if condition will be entered.
 488
 489 The critical question left here is "how can the if condition parse the
 490 right hand side in full"? In particular, to build the AST correctly for
 491 our example, it needs to get all of "(c+d)\*e\*f" as the RHS expression
 492 variable. The code to do this is surprisingly simple (code from the
 493 above two blocks duplicated for context):
 494
 495 .. code-block:: c++
 496
 497         // If BinOp binds less tightly with RHS than the operator after RHS, let
 498         // the pending operator take RHS as its LHS.
 499         int NextPrec = GetTokPrecedence();
 500         if (TokPrec < NextPrec) {
 501           RHS = ParseBinOpRHS(TokPrec+1, RHS);
 502           if (RHS == 0) return 0;
 503         }
 504         // Merge LHS/RHS.
 505         LHS = new BinaryExprAST(BinOp, LHS, RHS);
 506       }  // loop around to the top of the while loop.
 507     }
 508
 509 At this point, we know that the binary operator to the RHS of our
 510 primary has higher precedence than the binop we are currently parsing.
 511 As such, we know that any sequence of pairs whose operators are all
 512 higher precedence than "+" should be parsed together and returned as
 513 "RHS". To do this, we recursively invoke the ``ParseBinOpRHS`` function
 514 specifying "TokPrec+1" as the minimum precedence required for it to
 515 continue. In our example above, this will cause it to return the AST
 516 node for "(c+d)\*e\*f" as RHS, which is then set as the RHS of the '+'
 517 expression.
 518
 519 Finally, on the next iteration of the while loop, the "+g" piece is
 520 parsed and added to the AST. With this little bit of code (14
 521 non-trivial lines), we correctly handle fully general binary expression
 522 parsing in a very elegant way. This was a whirlwind tour of this code,
 523 and it is somewhat subtle. I recommend running through it with a few
 524 tough examples to see how it works.
 525
 526 This wraps up handling of expressions. At this point, we can point the
 527 parser at an arbitrary token stream and build an expression from it,
 528 stopping at the first token that is not part of the expression. Next up
 529 we need to handle function definitions, etc.
 530
 531 Parsing the Rest
 532 ================
 533
 534 The next thing missing is handling of function prototypes. In
 535 Kaleidoscope, these are used both for 'extern' function declarations as
 536 well as function body definitions. The code to do this is
 537 straight-forward and not very interesting (once you've survived
 538 expressions):
 539
 540 .. code-block:: c++
 541
 542     /// prototype
 543     ///   ::= id '(' id* ')'
 544     static PrototypeAST *ParsePrototype() {
 545       if (CurTok != tok_identifier)
 546         return ErrorP("Expected function name in prototype");
 547
 548       std::string FnName = IdentifierStr;
 549       getNextToken();
 550
 551       if (CurTok != '(')
 552         return ErrorP("Expected '(' in prototype");
 553
 554       // Read the list of argument names.
 555       std::vector<std::string> ArgNames;
 556       while (getNextToken() == tok_identifier)
 557         ArgNames.push_back(IdentifierStr);
 558       if (CurTok != ')')
 559         return ErrorP("Expected ')' in prototype");
 560
 561       // success.
 562       getNextToken();  // eat ')'.
 563
 564       return new PrototypeAST(FnName, ArgNames);
 565     }
 566
 567 Given this, a function definition is very simple, just a prototype plus
 568 an expression to implement the body:
 569
 570 .. code-block:: c++
 571
 572     /// definition ::= 'def' prototype expression
 573     static FunctionAST *ParseDefinition() {
 574       getNextToken();  // eat def.
 575       PrototypeAST *Proto = ParsePrototype();
 576       if (Proto == 0) return 0;
 577
 578       if (ExprAST *E = ParseExpression())
 579         return new FunctionAST(Proto, E);
 580       return 0;
 581     }
 582
 583 In addition, we support 'extern' to declare functions like 'sin' and
 584 'cos' as well as to support forward declaration of user functions. These
 585 'extern's are just prototypes with no body:
 586
 587 .. code-block:: c++
 588
 589     /// external ::= 'extern' prototype
 590     static PrototypeAST *ParseExtern() {
 591       getNextToken();  // eat extern.
 592       return ParsePrototype();
 593     }
 594
 595 Finally, we'll also let the user type in arbitrary top-level expressions
 596 and evaluate them on the fly. We will handle this by defining anonymous
 597 nullary (zero argument) functions for them:
 598
 599 .. code-block:: c++
 600
 601     /// toplevelexpr ::= expression
 602     static FunctionAST *ParseTopLevelExpr() {
 603       if (ExprAST *E = ParseExpression()) {
 604         // Make an anonymous proto.
 605         PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
 606         return new FunctionAST(Proto, E);
 607       }
 608       return 0;
 609     }
 610
 611 Now that we have all the pieces, let's build a little driver that will
 612 let us actually *execute* this code we've built!
 613
 614 The Driver
 615 ==========
 616
 617 The driver for this simply invokes all of the parsing pieces with a
 618 top-level dispatch loop. There isn't much interesting here, so I'll just
 619 include the top-level loop. See `below <#code>`_ for full code in the
 620 "Top-Level Parsing" section.
 621
 622 .. code-block:: c++
 623
 624     /// top ::= definition | external | expression | ';'
 625     static void MainLoop() {
 626       while (1) {
 627         fprintf(stderr, "ready> ");
 628         switch (CurTok) {
 629         case tok_eof:    return;
 630         case ';':        getNextToken(); break;  // ignore top-level semicolons.
 631         case tok_def:    HandleDefinition(); break;
 632         case tok_extern: HandleExtern(); break;
 633         default:         HandleTopLevelExpression(); break;
 634         }
 635       }
 636     }
 637
 638 The most interesting part of this is that we ignore top-level
 639 semicolons. Why is this, you ask? The basic reason is that if you type
 640 "4 + 5" at the command line, the parser doesn't know whether that is the
 641 end of what you will type or not. For example, on the next line you
 642 could type "def foo..." in which case 4+5 is the end of a top-level
 643 expression. Alternatively you could type "\* 6", which would continue
 644 the expression. Having top-level semicolons allows you to type "4+5;",
 645 and the parser will know you are done.
 646
 647 Conclusions
 648 ===========
 649
 650 With just under 400 lines of commented code (240 lines of non-comment,
 651 non-blank code), we fully defined our minimal language, including a
 652 lexer, parser, and AST builder. With this done, the executable will
 653 validate Kaleidoscope code and tell us if it is grammatically invalid.
 654 For example, here is a sample interaction:
 655
 656 .. code-block:: bash
 657
 658     $ ./a.out
 659     ready> def foo(x y) x+foo(y, 4.0);
 660     Parsed a function definition.
 661     ready> def foo(x y) x+y y;
 662     Parsed a function definition.
 663     Parsed a top-level expr
 664     ready> def foo(x y) x+y );
 665     Parsed a function definition.
 666     Error: unknown token when expecting an expression
 667     ready> extern sin(a);
 668     ready> Parsed an extern
 669     ready> ^D
 670     $
 671
 672 There is a lot of room for extension here. You can define new AST nodes,
 673 extend the language in many ways, etc. In the `next
 674 installment <LangImpl3.html>`_, we will describe how to generate LLVM
 675 Intermediate Representation (IR) from the AST.
 676
 677 Full Code Listing
 678 =================
 679
 680 Here is the complete code listing for this and the previous chapter.
 681 Note that it is fully self-contained: you don't need LLVM or any
 682 external libraries at all for this. (Besides the C and C++ standard
 683 libraries, of course.) To build this, just compile with:
 684
 685 .. code-block:: bash
 686
 687     # Compile
 688     clang++ -g -O3 toy.cpp
 689     # Run
 690     ./a.out
 691
 692 Here is the code:
 693
 694 .. code-block:: c++
 695
 696     #include <cstdio>
 697     #include <cstdlib>
 698     #include <string>
 699     #include <map>
 700     #include <vector>
 701
 702     //===----------------------------------------------------------------------===//
 703     // Lexer
 704     //===----------------------------------------------------------------------===//
 705
 706     // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
 707     // of these for known things.
 708     enum Token {
 709       tok_eof = -1,
 710
 711       // commands
 712       tok_def = -2, tok_extern = -3,
 713
 714       // primary
 715       tok_identifier = -4, tok_number = -5
 716     };
 717
 718     static std::string IdentifierStr;  // Filled in if tok_identifier
 719     static double NumVal;              // Filled in if tok_number
 720
 721     /// gettok - Return the next token from standard input.
 722     static int gettok() {
 723       static int LastChar = ' ';
 724
 725       // Skip any whitespace.
 726       while (isspace(LastChar))
 727         LastChar = getchar();
 728
 729       if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
 730         IdentifierStr = LastChar;
 731         while (isalnum((LastChar = getchar())))
 732           IdentifierStr += LastChar;
 733
 734         if (IdentifierStr == "def") return tok_def;
 735         if (IdentifierStr == "extern") return tok_extern;
 736         return tok_identifier;
 737       }
 738
 739       if (isdigit(LastChar) || LastChar == '.') {   // Number: [0-9.]+
 740         std::string NumStr;
 741         do {
 742           NumStr += LastChar;
 743           LastChar = getchar();
 744         } while (isdigit(LastChar) || LastChar == '.');
 745
 746         NumVal = strtod(NumStr.c_str(), 0);
 747         return tok_number;
 748       }
 749
 750       if (LastChar == '#') {
 751         // Comment until end of line.
 752         do LastChar = getchar();
 753         while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
 754
 755         if (LastChar != EOF)
 756           return gettok();
 757       }
 758
 759       // Check for end of file.  Don't eat the EOF.
 760       if (LastChar == EOF)
 761         return tok_eof;
 762
 763       // Otherwise, just return the character as its ascii value.
 764       int ThisChar = LastChar;
 765       LastChar = getchar();
 766       return ThisChar;
 767     }
 768
 769     //===----------------------------------------------------------------------===//
 770     // Abstract Syntax Tree (aka Parse Tree)
 771     //===----------------------------------------------------------------------===//
 772
 773     /// ExprAST - Base class for all expression nodes.
 774     class ExprAST {
 775     public:
 776       virtual ~ExprAST() {}
 777     };
 778
 779     /// NumberExprAST - Expression class for numeric literals like "1.0".
 780     class NumberExprAST : public ExprAST {
 781       double Val;
 782     public:
 783       NumberExprAST(double val) : Val(val) {}
 784     };
 785
 786     /// VariableExprAST - Expression class for referencing a variable, like "a".
 787     class VariableExprAST : public ExprAST {
 788       std::string Name;
 789     public:
 790       VariableExprAST(const std::string &name) : Name(name) {}
 791     };
 792
 793     /// BinaryExprAST - Expression class for a binary operator.
 794     class BinaryExprAST : public ExprAST {
 795       char Op;
 796       ExprAST *LHS, *RHS;
 797     public:
 798       BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
 799         : Op(op), LHS(lhs), RHS(rhs) {}
 800     };
 801
 802     /// CallExprAST - Expression class for function calls.
 803     class CallExprAST : public ExprAST {
 804       std::string Callee;
 805       std::vector<ExprAST*> Args;
 806     public:
 807       CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
 808         : Callee(callee), Args(args) {}
 809     };
 810
 811     /// PrototypeAST - This class represents the "prototype" for a function,
 812     /// which captures its name, and its argument names (thus implicitly the number
 813     /// of arguments the function takes).
 814     class PrototypeAST {
 815       std::string Name;
 816       std::vector<std::string> Args;
 817     public:
 818       PrototypeAST(const std::string &name, const std::vector<std::string> &args)
 819         : Name(name), Args(args) {}
 820
 821     };
 822
 823     /// FunctionAST - This class represents a function definition itself.
 824     class FunctionAST {
 825       PrototypeAST *Proto;
 826       ExprAST *Body;
 827     public:
 828       FunctionAST(PrototypeAST *proto, ExprAST *body)
 829         : Proto(proto), Body(body) {}
 830
 831     };
 832
 833     //===----------------------------------------------------------------------===//
 834     // Parser
 835     //===----------------------------------------------------------------------===//
 836
 837     /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 838     /// token the parser is looking at.  getNextToken reads another token from the
 839     /// lexer and updates CurTok with its results.
 840     static int CurTok;
 841     static int getNextToken() {
 842       return CurTok = gettok();
 843     }
 844
 845     /// BinopPrecedence - This holds the precedence for each binary operator that is
 846     /// defined.
 847     static std::map<char, int> BinopPrecedence;
 848
 849     /// GetTokPrecedence - Get the precedence of the pending binary operator token.
 850     static int GetTokPrecedence() {
 851       if (!isascii(CurTok))
 852         return -1;
 853
 854       // Make sure it's a declared binop.
 855       int TokPrec = BinopPrecedence[CurTok];
 856       if (TokPrec <= 0) return -1;
 857       return TokPrec;
 858     }
 859
 860     /// Error* - These are little helper functions for error handling.
 861     ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 862     PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 863     FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 864
 865     static ExprAST *ParseExpression();
 866
 867     /// identifierexpr
 868     ///   ::= identifier
 869     ///   ::= identifier '(' expression* ')'
 870     static ExprAST *ParseIdentifierExpr() {
 871       std::string IdName = IdentifierStr;
 872
 873       getNextToken();  // eat identifier.
 874
 875       if (CurTok != '(') // Simple variable ref.
 876         return new VariableExprAST(IdName);
 877
 878       // Call.
 879       getNextToken();  // eat (
 880       std::vector<ExprAST*> Args;
 881       if (CurTok != ')') {
 882         while (1) {
 883           ExprAST *Arg = ParseExpression();
 884           if (!Arg) return 0;
 885           Args.push_back(Arg);
 886
 887           if (CurTok == ')') break;
 888
 889           if (CurTok != ',')
 890             return Error("Expected ')' or ',' in argument list");
 891           getNextToken();
 892         }
 893       }
 894
 895       // Eat the ')'.
 896       getNextToken();
 897
 898       return new CallExprAST(IdName, Args);
 899     }
 900
 901     /// numberexpr ::= number
 902     static ExprAST *ParseNumberExpr() {
 903       ExprAST *Result = new NumberExprAST(NumVal);
 904       getNextToken(); // consume the number
 905       return Result;
 906     }
 907
 908     /// parenexpr ::= '(' expression ')'
 909     static ExprAST *ParseParenExpr() {
 910       getNextToken();  // eat (.
 911       ExprAST *V = ParseExpression();
 912       if (!V) return 0;
 913
 914       if (CurTok != ')')
 915         return Error("expected ')'");
 916       getNextToken();  // eat ).
 917       return V;
 918     }
 919
 920     /// primary
 921     ///   ::= identifierexpr
 922     ///   ::= numberexpr
 923     ///   ::= parenexpr
 924     static ExprAST *ParsePrimary() {
 925       switch (CurTok) {
 926       default: return Error("unknown token when expecting an expression");
 927       case tok_identifier: return ParseIdentifierExpr();
 928       case tok_number:     return ParseNumberExpr();
 929       case '(':            return ParseParenExpr();
 930       }
 931     }
 932
 933     /// binoprhs
 934     ///   ::= ('+' primary)*
 935     static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
 936       // If this is a binop, find its precedence.
 937       while (1) {
 938         int TokPrec = GetTokPrecedence();
 939
 940         // If this is a binop that binds at least as tightly as the current binop,
 941         // consume it, otherwise we are done.
 942         if (TokPrec < ExprPrec)
 943           return LHS;
 944
 945         // Okay, we know this is a binop.
 946         int BinOp = CurTok;
 947         getNextToken();  // eat binop
 948
 949         // Parse the primary expression after the binary operator.
 950         ExprAST *RHS = ParsePrimary();
 951         if (!RHS) return 0;
 952
 953         // If BinOp binds less tightly with RHS than the operator after RHS, let
 954         // the pending operator take RHS as its LHS.
 955         int NextPrec = GetTokPrecedence();
 956         if (TokPrec < NextPrec) {
 957           RHS = ParseBinOpRHS(TokPrec+1, RHS);
 958           if (RHS == 0) return 0;
 959         }
 960
 961         // Merge LHS/RHS.
 962         LHS = new BinaryExprAST(BinOp, LHS, RHS);
 963       }
 964     }
 965
 966     /// expression
 967     ///   ::= primary binoprhs
 968     ///
 969     static ExprAST *ParseExpression() {
 970       ExprAST *LHS = ParsePrimary();
 971       if (!LHS) return 0;
 972
 973       return ParseBinOpRHS(0, LHS);
 974     }
 975
 976     /// prototype
 977     ///   ::= id '(' id* ')'
 978     static PrototypeAST *ParsePrototype() {
 979       if (CurTok != tok_identifier)
 980         return ErrorP("Expected function name in prototype");
 981
 982       std::string FnName = IdentifierStr;
 983       getNextToken();
 984
 985       if (CurTok != '(')
 986         return ErrorP("Expected '(' in prototype");
 987
 988       std::vector<std::string> ArgNames;
 989       while (getNextToken() == tok_identifier)
 990         ArgNames.push_back(IdentifierStr);
 991       if (CurTok != ')')
 992         return ErrorP("Expected ')' in prototype");
 993
 994       // success.
 995       getNextToken();  // eat ')'.
 996
 997       return new PrototypeAST(FnName, ArgNames);
 998     }
 999
1000     /// definition ::= 'def' prototype expression
1001     static FunctionAST *ParseDefinition() {
1002       getNextToken();  // eat def.
1003       PrototypeAST *Proto = ParsePrototype();
1004       if (Proto == 0) return 0;
1005
1006       if (ExprAST *E = ParseExpression())
1007         return new FunctionAST(Proto, E);
1008       return 0;
1009     }
1010
1011     /// toplevelexpr ::= expression
1012     static FunctionAST *ParseTopLevelExpr() {
1013       if (ExprAST *E = ParseExpression()) {
1014         // Make an anonymous proto.
1015         PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
1016         return new FunctionAST(Proto, E);
1017       }
1018       return 0;
1019     }
1020
1021     /// external ::= 'extern' prototype
1022     static PrototypeAST *ParseExtern() {
1023       getNextToken();  // eat extern.
1024       return ParsePrototype();
1025     }
1026
1027     //===----------------------------------------------------------------------===//
1028     // Top-Level parsing
1029     //===----------------------------------------------------------------------===//
1030
1031     static void HandleDefinition() {
1032       if (ParseDefinition()) {
1033         fprintf(stderr, "Parsed a function definition.\n");
1034       } else {
1035         // Skip token for error recovery.
1036         getNextToken();
1037       }
1038     }
1039
1040     static void HandleExtern() {
1041       if (ParseExtern()) {
1042         fprintf(stderr, "Parsed an extern\n");
1043       } else {
1044         // Skip token for error recovery.
1045         getNextToken();
1046       }
1047     }
1048
1049     static void HandleTopLevelExpression() {
1050       // Evaluate a top-level expression into an anonymous function.
1051       if (ParseTopLevelExpr()) {
1052         fprintf(stderr, "Parsed a top-level expr\n");
1053       } else {
1054         // Skip token for error recovery.
1055         getNextToken();
1056       }
1057     }
1058
1059     /// top ::= definition | external | expression | ';'
1060     static void MainLoop() {
1061       while (1) {
1062         fprintf(stderr, "ready> ");
1063         switch (CurTok) {
1064         case tok_eof:    return;
1065         case ';':        getNextToken(); break;  // ignore top-level semicolons.
1066         case tok_def:    HandleDefinition(); break;
1067         case tok_extern: HandleExtern(); break;
1068         default:         HandleTopLevelExpression(); break;
1069         }
1070       }
1071     }
1072
1073     //===----------------------------------------------------------------------===//
1074     // Main driver code.
1075     //===----------------------------------------------------------------------===//
1076
1077     int main() {
1078       // Install standard binary operators.
1079       // 1 is lowest precedence.
1080       BinopPrecedence['<'] = 10;
1081       BinopPrecedence['+'] = 20;
1082       BinopPrecedence['-'] = 20;
1083       BinopPrecedence['*'] = 40;  // highest.
1084
1085       // Prime the first token.
1086       fprintf(stderr, "ready> ");
1087       getNextToken();
1088
1089       // Run the main "interpreter loop" now.
1090       MainLoop();
1091
1092       return 0;
1093     }
1094
1095 `Next: Implementing Code Generation to LLVM IR <LangImpl3.html>`_
1096