docs/tutorial/LangImpl2.html

   1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   2                       "http://www.w3.org/TR/html4/strict.dtd">
   3
   4 <html>
   5 <head>
   6   <title>Kaleidoscope: Implementing a Parser and AST</title>
   7   <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   8   <meta name="author" content="Chris Lattner">
   9   <link rel="stylesheet" href="../llvm.css" type="text/css">
  10 </head>
  11
  12 <body>
  13
  14 <div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
  15
  16 <div class="doc_author">
  17   <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
  18 </div>
  19
  20 <!-- *********************************************************************** -->
  21 <div class="doc_section"><a name="intro">Part 2 Introduction</a></div>
  22 <!-- *********************************************************************** -->
  23
  24 <div class="doc_text">
  25
  26 <p>Welcome to part 2 of the "<a href="index.html">Implementing a language with
  27 LLVM</a>" tutorial.  This chapter shows you how to use the <a
  28 href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
  29 href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
  30 our Kaleidoscope language and build an <a
  31 href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
  32 Tree</a> (AST).</p>
  33
  34 <p>The parser we will build uses a combination of <a
  35 href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
  36 Parsing</a> and <a href=
  37 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
  38 Parsing</a> to parse the Kaleidoscope language (the later for binary expression
  39 and the former for everything else).  Before we get to parsing though, lets talk
  40 about the output of the parser: the Abstract Syntax Tree.</p>
  41
  42 </div>
  43
  44 <!-- *********************************************************************** -->
  45 <div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
  46 <!-- *********************************************************************** -->
  47
  48 <div class="doc_text">
  49
  50 <p>The AST for a program captures its behavior in a way that it is easy for
  51 later stages of the compiler (e.g. code generation) to interpret.  We basically
  52 want one object for each construct in the language, and the AST should closely
  53 model the language.  In Kaleidoscope, we have expressions, a prototype, and a
  54 function object.  We'll start with expressions first:</p>
  55
  56 <div class="doc_code">
  57 <pre>
  58 /// ExprAST - Base class for all expression nodes.
  59 class ExprAST {
  60 public:
  61   virtual ~ExprAST() {}
  62 };
  63
  64 /// NumberExprAST - Expression class for numeric literals like "1.0".
  65 class NumberExprAST : public ExprAST {
  66   double Val;
  67 public:
  68   explicit NumberExprAST(double val) : Val(val) {}
  69 };
  70 </pre>
  71 </div>
  72
  73 <p>The code above shows the definition of the base ExprAST class and one
  74 subclass which we use for numeric literals.  The important thing about this is
  75 that the NumberExprAST class captures the numeric value of the literal in the
  76 class, so that later phases of the compiler can know what it is.</p>
  77
  78 <p>Right now we only create the AST,  so there are no useful accessor methods on
  79 them.  It would be very easy to add a virtual method to pretty print the code,
  80 for example.  Here are the other expression AST node definitions that we'll use
  81 in the basic form of the Kaleidoscope language.
  82 </p>
  83
  84 <div class="doc_code">
  85 <pre>
  86 /// VariableExprAST - Expression class for referencing a variable, like "a".
  87 class VariableExprAST : public ExprAST {
  88   std::string Name;
  89 public:
  90   explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
  91 };
  92
  93 /// BinaryExprAST - Expression class for a binary operator.
  94 class BinaryExprAST : public ExprAST {
  95   char Op;
  96   ExprAST *LHS, *RHS;
  97 public:
  98   BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
  99     : Op(op), LHS(lhs), RHS(rhs) {}
 100 };
 101
 102 /// CallExprAST - Expression class for function calls.
 103 class CallExprAST : public ExprAST {
 104   std::string Callee;
 105   std::vector&lt;ExprAST*&gt; Args;
 106 public:
 107   CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
 108     : Callee(callee), Args(args) {}
 109 };
 110 </pre>
 111 </div>
 112
 113 <p>This is all (intentially) rather straight-forward: variables capture the
 114 variable name, binary operators capture their opcode (e.g. '+'), and calls
 115 capture a function name and list of argument expressions.  One thing that is
 116 nice about our AST is that it captures the language features without talking
 117 about the syntax of the language.  Note that there is no discussion about
 118 precedence of binary operators, lexical structure etc.</p>
 119
 120 <p>For our basic language, these are all of the expression nodes we'll define.
 121 Because it doesn't have conditional control flow, it isn't Turing-complete;
 122 we'll fix that in a later installment.  The two things we need next are a way
 123 to talk about the interface to a function, and a way to talk about functions
 124 themselves:</p>
 125
 126 <div class="doc_code">
 127 <pre>
 128 /// PrototypeAST - This class represents the "prototype" for a function,
 129 /// which captures its argument names as well as if it is an operator.
 130 class PrototypeAST {
 131   std::string Name;
 132   std::vector&lt;std::string&gt; Args;
 133 public:
 134   PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
 135     : Name(name), Args(args) {}
 136 };
 137
 138 /// FunctionAST - This class represents a function definition itself.
 139 class FunctionAST {
 140   PrototypeAST *Proto;
 141   ExprAST *Body;
 142 public:
 143   FunctionAST(PrototypeAST *proto, ExprAST *body)
 144     : Proto(proto), Body(body) {}
 145 };
 146 </pre>
 147 </div>
 148
 149 <p>In Kaleidoscope, functions are typed with just a count of their arguments.
 150 Since all values are double precision floating point, this fact doesn't need to
 151 be captured anywhere.  In a more aggressive and realistic language, the
 152 "ExprAST" class would probably have a type field.</p>
 153
 154 <p>With this scaffolding, we can now talk about parsing expressions and function
 155 bodies in Kaleidoscope.</p>
 156
 157 </div>
 158
 159 <!-- *********************************************************************** -->
 160 <div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
 161 <!-- *********************************************************************** -->
 162
 163 <div class="doc_text">
 164
 165 <p>Now that we have an AST to build, we need to define the parser code to build
 166 it.  The idea here is that we want to parse something like "x+y" (which is
 167 returned as three tokens by the lexer) into an AST that could be generated with
 168 calls like this:</p>
 169
 170 <div class="doc_code">
 171 <pre>
 172   ExprAST *X = new VariableExprAST("x");
 173   ExprAST *Y = new VariableExprAST("y");
 174   ExprAST *Result = new BinaryExprAST('+', X, Y);
 175 </pre>
 176 </div>
 177
 178 <p>In order to do this, we'll start by defining some basic helper routines:</p>
 179
 180 <div class="doc_code">
 181 <pre>
 182 /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 183 /// token the parser it looking at.  getNextToken reads another token from the
 184 /// lexer and updates CurTok with its results.
 185 static int CurTok;
 186 static int getNextToken() {
 187   return CurTok = gettok();
 188 }
 189 </pre>
 190 </div>
 191
 192 <p>
 193 This implements a simple token buffer around the lexer.  This allows
 194 us to look one token ahead at what the lexer is returning.  Every function in
 195 our lexer will assume that CurTok is the current token that needs to be
 196 parsed.</p>
 197
 198 <p>Again, we define these with global variables; it would be better design to
 199 wrap the entire parser in a class and use instance variables for these.
 200 </p>
 201
 202 <div class="doc_code">
 203 <pre>
 204
 205 /// Error* - These are little helper functions for error handling.
 206 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 207 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 208 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 209 </pre>
 210 </div>
 211
 212 <p>
 213 The <tt>Error</tt> routines are simple helper routines that our parser will use
 214 to handle errors.  The error recovery in our parser will not be the best and
 215 are not particular user-friendly, but it will be enough for our tutorial.  These
 216 routines make it easier to handle errors in routines that have various return
 217 types: they always return null.</p>
 218
 219 <p>With these basic helper functions implemented, we can implement the first
 220 piece of our grammar: we'll start with numeric literals.</p>
 221
 222 </div>
 223
 224 <!-- *********************************************************************** -->
 225 <div class="doc_section"><a name="parserprimexprs">Basic Expression
 226  Parsing</a></div>
 227 <!-- *********************************************************************** -->
 228
 229 <div class="doc_text">
 230
 231 <p>We start with numeric literals, because they are the simplest to process.
 232 For each production in our grammar, we'll define a function which parses that
 233 production.  For numeric literals, we have:
 234 </p>
 235
 236 <div class="doc_code">
 237 <pre>
 238 /// numberexpr ::= number
 239 static ExprAST *ParseNumberExpr() {
 240   ExprAST *Result = new NumberExprAST(NumVal);
 241   getNextToken(); // consume the number
 242   return Result;
 243 }
 244 </pre>
 245 </div>
 246
 247 <p>This routine is very simple: it expects to be called when the current token
 248 is a <tt>tok_number</tt> token.  It takes the current number value, creates
 249 a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
 250 returns.</p>
 251
 252 <p>There are some interesting aspects of this.  The most important one is that
 253 this routine eats all of the tokens that correspond to the production, and
 254 returns the lexer buffer with the next token (which is not part of the grammar
 255 production) ready to go.  This is a fairly standard way to go for recursive
 256 descent parsers.  For a better example, the parenthesis operator is defined like
 257 this:</p>
 258
 259 <div class="doc_code">
 260 <pre>
 261 /// parenexpr ::= '(' expression ')'
 262 static ExprAST *ParseParenExpr() {
 263   getNextToken();  // eat (.
 264   ExprAST *V = ParseExpression();
 265   if (!V) return 0;
 266
 267   if (CurTok != ')')
 268     return Error("expected ')'");
 269   getNextToken();  // eat ).
 270   return V;
 271 }
 272 </pre>
 273 </div>
 274
 275 <p>This function illustrates a number of interesting things about the parser:
 276 1) it shows how we use the Error routines.  When called, this function expects
 277 that the current token is a '(' token, but after parsing the subexpression, it
 278 is possible that there is not a ')' waiting.  For example, if the user types in
 279 "(4 x" instead of "(4)", the parser should emit an error.  Because errors can
 280 occur, the parser needs a way to indicate that they happened: in our parser, we
 281 return null on an error.</p>
 282
 283 <p>Another interesting aspect of this function is that it uses recursion by
 284 calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
 285 <tt>ParseParenExpr</tt>).  This is powerful because it allows us to handle
 286 recursive grammars, and keeps each production very simple.  Note that
 287 parenthesis do not cause construction of AST nodes themselves.  While we could
 288 do this, the most important role of parens are to guide the parser and provide
 289 grouping.  Once the parser constructs the AST, parens are not needed.</p>
 290
 291 <p>The next simple production is for handling variable references and function
 292 calls:</p>
 293
 294 <div class="doc_code">
 295 <pre>
 296 /// identifierexpr
 297 ///   ::= identifer
 298 ///   ::= identifer '(' expression* ')'
 299 static ExprAST *ParseIdentifierExpr() {
 300   std::string IdName = IdentifierStr;
 301
 302   getNextToken();  // eat identifer.
 303
 304   if (CurTok != '(') // Simple variable ref.
 305     return new VariableExprAST(IdName);
 306
 307   // Call.
 308   getNextToken();  // eat (
 309   std::vector&lt;ExprAST*&gt; Args;
 310   while (1) {
 311     ExprAST *Arg = ParseExpression();
 312     if (!Arg) return 0;
 313     Args.push_back(Arg);
 314
 315     if (CurTok == ')') break;
 316
 317     if (CurTok != ',')
 318       return Error("Expected ')'");
 319     getNextToken();
 320   }
 321
 322   // Eat the ')'.
 323   getNextToken();
 324
 325   return new CallExprAST(IdName, Args);
 326 }
 327 </pre>
 328 </div>
 329
 330 <p>This routine follows the same style as the other routines (it expects to be
 331 called if the current token is a <tt>tok_identifier</tt> token).  It also has
 332 recursion and error handling.  One interesting aspect of this is that it uses
 333 <em>look-ahead</em> to determine if the current identifier is a stand alone
 334 variable reference or if it is a function call expression.  It handles this by
 335 checking to see if the token after the identifier is a '(' token, and constructs
 336 either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
 337 </p>
 338
 339 <p>Now that we have all of our simple expression parsing logic in place, we can
 340 define a helper function to wrap them up in a class.  We call this class of
 341 expressions "primary" expressions, for reasons that will become more clear
 342 later.  In order to parse a primary expression, we need to determine what sort
 343 of expression it is:</p>
 344
 345 <div class="doc_code">
 346 <pre>
 347 /// primary
 348 ///   ::= identifierexpr
 349 ///   ::= numberexpr
 350 ///   ::= parenexpr
 351 static ExprAST *ParsePrimary() {
 352   switch (CurTok) {
 353   default: return Error("unknown token when expecting an expression");
 354   case tok_identifier: return ParseIdentifierExpr();
 355   case tok_number:     return ParseNumberExpr();
 356   case '(':            return ParseParenExpr();
 357   }
 358 }
 359 </pre>
 360 </div>
 361
 362 <p>Now that you see the definition of this function, it makes it more obvious
 363 why we can assume the state of CurTok in the various functions.  This uses
 364 look-ahead to determine which sort of expression is being inspected, and parses
 365 it with a function call.</p>
 366
 367 <p>Now that basic expressions are handled, we need to handle binary expressions,
 368 which are a bit more complex.</p>
 369
 370 </div>
 371
 372 <!-- *********************************************************************** -->
 373 <div class="doc_section"><a name="parserbinops">Binary Expression
 374  Parsing</a></div>
 375 <!-- *********************************************************************** -->
 376
 377 <div class="doc_text">
 378
 379 <p>Binary expressions are significantly harder to parse because they are often
 380 ambiguous.  For example, when given the string "x+y*z", the parser can choose
 381 to parse it as either "(x+y)*z" or "x+(y*z)".  With common definitions from
 382 mathematics, we expect the later parse, because "*" (multiplication) has
 383 higher <em>precedence</em> than "+" (addition).</p>
 384
 385 <p>There are many ways to handle this, but an elegant and efficient way is to
 386 use <a href=
 387 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
 388 Parsing</a>.  This parsing technique uses the precedence of binary operators to
 389 guide recursion.  To start with, we need a table of precedences:</p>
 390
 391 <div class="doc_code">
 392 <pre>
 393 /// BinopPrecedence - This holds the precedence for each binary operator that is
 394 /// defined.
 395 static std::map&lt;char, int&gt; BinopPrecedence;
 396
 397 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
 398 static int GetTokPrecedence() {
 399   if (!isascii(CurTok))
 400     return -1;
 401
 402   // Make sure it's a declared binop.
 403   int TokPrec = BinopPrecedence[CurTok];
 404   if (TokPrec &lt;= 0) return -1;
 405   return TokPrec;
 406 }
 407
 408 int main() {
 409   // Install standard binary operators.
 410   // 1 is lowest precedence.
 411   BinopPrecedence['&lt;'] = 10;
 412   BinopPrecedence['+'] = 20;
 413   BinopPrecedence['-'] = 20;
 414   BinopPrecedence['*'] = 40;  // highest.
 415   ...
 416 }
 417 </pre>
 418 </div>
 419
 420 <p>For the basic form of Kaleidoscope, we will only support 4 binary operators
 421 (this can obviously be extended by you, the reader).  The
 422 <tt>GetTokPrecedence</tt> function returns the precedence for the current token,
 423 or -1 if the token is not a binary operator.  Having a map makes it easy to add
 424 new operators and makes it clear that the algorithm doesn't depend on the
 425 specific operators involved, but it would be easy enough to eliminate the map
 426 and do the comparisons in the <tt>GetTokPrecedence</tt> function.</p>
 427
 428 <p>With the helper above defined, we can now start parsing binary expressions.
 429 The basic idea of operator precedence parsing is to break down an expression
 430 with potentially ambiguous binary operators into pieces.  Consider for example
 431 the expression "a+b+(c+d)*e*f+g".  Operator precedence parsing considers this
 432 as a stream of primary expressions separated by binary operators.  As such,
 433 it will first parse the leading primary expression "a", then it will see the
 434 pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g].  Note that because parentheses
 435 are primary expressions that the binary expression parser doesn't need to worry
 436 about nested subexpressions like (c+d) at all.
 437 </p>
 438
 439 <p>
 440 To start, an expression is a primary expression potentially followed by a
 441 sequence of [binop,primaryexpr] pairs:</p>
 442
 443 <div class="doc_code">
 444 <pre>
 445 /// expression
 446 ///   ::= primary binoprhs
 447 ///
 448 static ExprAST *ParseExpression() {
 449   ExprAST *LHS = ParsePrimary();
 450   if (!LHS) return 0;
 451
 452   return ParseBinOpRHS(0, LHS);
 453 }
 454 </pre>
 455 </div>
 456
 457 <p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
 458 us.  It takes a precedence and a pointer to an expression for the part parsed
 459 so far.   Note that "x" is a perfectly valid expression: As such, "binoprhs" is
 460 allowed to be empty, in which case it returns the expression that is passed into
 461 it. In our example above, the code passes the expression for "a" into
 462 <tt>ParseBinOpRHS</tt> and the current token is "+".</p>
 463
 464 <p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
 465 minimal operator precedence</em> that the function is allowed to eat.  For
 466 example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
 467 passed in a precedence of 40, it will not consume any tokens (because the
 468 precedence of '+' is only 20).  With this in mind, <tt>ParseBinOpRHS</tt> starts
 469 with:</p>
 470
 471 <div class="doc_code">
 472 <pre>
 473 /// binoprhs
 474 ///   ::= ('+' primary)*
 475 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
 476   // If this is a binop, find its precedence.
 477   while (1) {
 478     int TokPrec = GetTokPrecedence();
 479
 480     // If this is a binop that binds at least as tightly as the current binop,
 481     // consume it, otherwise we are done.
 482     if (TokPrec &lt; ExprPrec)
 483       return LHS;
 484 </pre>
 485 </div>
 486
 487 <p>This code gets the precedence of the current token and checks to see if if is
 488 too low.  Because we defined invalid tokens to have a precedence of -1, this
 489 check implicitly knows that the pair-stream ends when the token stream runs out
 490 of binary operators.  If this check succeeds, we know that the token is a binary
 491 operator and that it will be included in this expression:</p>
 492
 493 <div class="doc_code">
 494 <pre>
 495     // Okay, we know this is a binop.
 496     int BinOp = CurTok;
 497     getNextToken();  // eat binop
 498
 499     // Parse the primary expression after the binary operator.
 500     ExprAST *RHS = ParsePrimary();
 501     if (!RHS) return 0;
 502 </pre>
 503 </div>
 504
 505 <p>As such, this code eats (and remembers) the binary operator and then parses
 506 the following primary expression.  This builds up the whole pair, the first of
 507 which is [+, b] for the running example.</p>
 508
 509 <p>Now that we parsed the left-hand side of an expression and one pair of the
 510 RHS sequence, we have to decide which way the expression associates.  In
 511 particular, we could have "(a+b) binop unparsed"  or "a + (b binop unparsed)".
 512 To determine this, we look ahead at "binop" to determine its precedence and
 513 compare it to BinOp's precedence (which is '+' in this case):</p>
 514
 515 <div class="doc_code">
 516 <pre>
 517     // If BinOp binds less tightly with RHS than the operator after RHS, let
 518     // the pending operator take RHS as its LHS.
 519     int NextPrec = GetTokPrecedence();
 520     if (TokPrec &lt; NextPrec) {
 521 </pre>
 522 </div>
 523
 524 <p>If the precedence of the binop to the right of "RHS" is lower or equal to the
 525 precedence of our current operator, then we know that the parentheses associate
 526 as "(a+b) binop ...".  In our example, since the next operator is "+" and so is
 527 our current one, we know that they have the same precedence.  In this case we'll
 528 create the AST node for "a+b", and then continue parsing:</p>
 529
 530 <div class="doc_code">
 531 <pre>
 532       ... if body omitted ...
 533     }
 534
 535     // Merge LHS/RHS.
 536     LHS = new BinaryExprAST(BinOp, LHS, RHS);
 537   }  // loop around to the top of the while loop.
 538 }
 539 </pre>
 540 </div>
 541
 542 <p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
 543 iteration of the loop, with "+" as the current token.  The code above will eat
 544 and remember it and parse "(c+d)" as the primary expression, which makes the
 545 current pair be [+, (c+d)].  It will then enter the 'if' above with "*" as the
 546 binop to the right of the primary.  In this case, the precedence of "*" is
 547 higher than the precedence of "+" so the if condition will be entered.</p>
 548
 549 <p>The critical question left here is "how can the if condition parse the right
 550 hand side in full"?  In particular, to build the AST correctly for our example,
 551 it needs to get all of "(c+d)*e*f" as the RHS expression variable.  The code to
 552 do this is surprisingly simple (code from the above two blocks duplicated for
 553 context):</p>
 554
 555 <div class="doc_code">
 556 <pre>
 557     // If BinOp binds less tightly with RHS than the operator after RHS, let
 558     // the pending operator take RHS as its LHS.
 559     int NextPrec = GetTokPrecedence();
 560     if (TokPrec &lt; NextPrec) {
 561       RHS = ParseBinOpRHS(TokPrec+1, RHS);
 562       if (RHS == 0) return 0;
 563     }
 564     // Merge LHS/RHS.
 565     LHS = new BinaryExprAST(BinOp, LHS, RHS);
 566   }  // loop around to the top of the while loop.
 567 }
 568 </pre>
 569 </div>
 570
 571 <p>At this point, we know that the binary operator to the RHS of our primary
 572 has higher precedence than the binop we are currently parsing.  As such, we know
 573 that any sequence of pairs whose operators are all higher precedence than "+"
 574 should be parsed together and returned as "RHS".  To do this, we recursively
 575 invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
 576 precedence required for it to continue.  In our example above, this will cause
 577 it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
 578 of the '+' expression.</p>
 579
 580 <p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
 581 and added to the AST.  With this little bit of code (14 non-trivial lines), we
 582 correctly handle fully general binary expression parsing in a very elegant way.
 583 </p>
 584
 585 <p>This wraps up handling of expressions.  At this point, we can point the
 586 parser at an arbitrary token stream and build an expression from them, stopping
 587 at the first token that is not part of the expression.  Next up we need to
 588 handle function definitions etc.</p>
 589
 590 </div>
 591
 592 <!-- *********************************************************************** -->
 593 <div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
 594 <!-- *********************************************************************** -->
 595
 596 <div class="doc_text">
 597
 598 <p>
 599 The first basic thing missing is that of function prototypes.  In Kaleidoscope,
 600 these are used both for 'extern' function declarations as well as function body
 601 definitions.  The code to do this is straight-forward and not very interesting
 602 (once you've survived expressions):
 603 </p>
 604
 605 <div class="doc_code">
 606 <pre>
 607 /// prototype
 608 ///   ::= id '(' id* ')'
 609 static PrototypeAST *ParsePrototype() {
 610   if (CurTok != tok_identifier)
 611     return ErrorP("Expected function name in prototype");
 612
 613   std::string FnName = IdentifierStr;
 614   getNextToken();
 615
 616   if (CurTok != '(')
 617     return ErrorP("Expected '(' in prototype");
 618
 619   std::vector&lt;std::string&gt; ArgNames;
 620   while (getNextToken() == tok_identifier)
 621     ArgNames.push_back(IdentifierStr);
 622   if (CurTok != ')')
 623     return ErrorP("Expected ')' in prototype");
 624
 625   // success.
 626   getNextToken();  // eat ')'.
 627
 628   return new PrototypeAST(FnName, ArgNames);
 629 }
 630 </pre>
 631 </div>
 632
 633 <p>Given this, a function definition is very simple, just a prototype plus
 634 and expression to implement the body:</p>
 635
 636 <div class="doc_code">
 637 <pre>
 638 /// definition ::= 'def' prototype expression
 639 static FunctionAST *ParseDefinition() {
 640   getNextToken();  // eat def.
 641   PrototypeAST *Proto = ParsePrototype();
 642   if (Proto == 0) return 0;
 643
 644   if (ExprAST *E = ParseExpression())
 645     return new FunctionAST(Proto, E);
 646   return 0;
 647 }
 648 </pre>
 649 </div>
 650
 651 <p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
 652 well as to support forward declaration of user functions.  'externs' are just
 653 prototypes with no body:</p>
 654
 655 <div class="doc_code">
 656 <pre>
 657 /// external ::= 'extern' prototype
 658 static PrototypeAST *ParseExtern() {
 659   getNextToken();  // eat extern.
 660   return ParsePrototype();
 661 }
 662 </pre>
 663 </div>
 664
 665 <p>Finally, we'll also let the user type in arbitrary top-level expressions and
 666 evaluate them on the fly.  We will handle this by defining anonymous nullary
 667 (zero argument) functions for them:</p>
 668
 669 <div class="doc_code">
 670 <pre>
 671 /// toplevelexpr ::= expression
 672 static FunctionAST *ParseTopLevelExpr() {
 673   if (ExprAST *E = ParseExpression()) {
 674     // Make an anonymous proto.
 675     PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
 676     return new FunctionAST(Proto, E);
 677   }
 678   return 0;
 679 }
 680 </pre>
 681 </div>
 682
 683 <p>Now that we have all the pieces, lets build a little driver that will let us
 684 actually <em>execute</em> this code we've built!</p>
 685
 686 </div>
 687
 688 <!-- *********************************************************************** -->
 689 <div class="doc_section"><a name="driver">The Driver</a></div>
 690 <!-- *********************************************************************** -->
 691
 692 <div class="doc_text">
 693
 694 <p>The driver for this simply invokes all of the parsing pieces with a top-level
 695 dispatch loop.  There isn't much interesting here, so I'll just include the
 696 top-level loop.  See <a href="#code">below</a> for full code in the "Top-Level
 697 Parsing" section.</p>
 698
 699 <div class="doc_code">
 700 <pre>
 701 /// top ::= definition | external | expression | ';'
 702 static void MainLoop() {
 703   while (1) {
 704     fprintf(stderr, "ready&gt; ");
 705     switch (CurTok) {
 706     case tok_eof:    return;
 707     case ';':        getNextToken(); break;  // ignore top level semicolons.
 708     case tok_def:    HandleDefinition(); break;
 709     case tok_extern: HandleExtern(); break;
 710     default:         HandleTopLevelExpression(); break;
 711     }
 712   }
 713 }
 714 </pre>
 715 </div>
 716
 717 <p>The most interesting part of this is that we ignore top-level semi colons.
 718 Why is this, you ask?  The basic reason is that if you type "4 + 5" at the
 719 command line, the parser doesn't know that that is the end of what you will
 720 type.  For example, on the next line you could type "def foo..." in which case
 721 4+5 is the end of a top-level expression.  Alternatively you could type "* 6",
 722 which would continue the expression.  Having top-level semicolons allows you to
 723 type "4+5;" and the parser will know you are done.</p>
 724
 725 </div>
 726
 727 <!-- *********************************************************************** -->
 728 <div class="doc_section"><a name="code">Conclusions and the Full Code</a></div>
 729 <!-- *********************************************************************** -->
 730
 731 <div class="doc_text">
 732
 733 <p>With just under 400 lines of commented code, we fully defined our minimal
 734 language, including a lexer, parser and AST builder.  With this done, the
 735 executable will validate code and tell us if it is gramatically invalid.  For
 736 example, here is a sample interaction:</p>
 737
 738 <div class="doc_code">
 739 <pre>
 740 $ ./a.out
 741 ready> def foo(x y) x+foo(y, 4.0);
 742 ready> Parsed an function definition.
 743 ready> def foo(x y) x+y y;
 744 ready> Parsed an function definition.
 745 ready> Parsed a top-level expr
 746 ready> def foo(x y) x+y );
 747 ready> Parsed an function definition.
 748 ready> Error: unknown token when expecting an expression
 749 ready> extern sin(a);
 750 ready> Parsed an extern
 751 ready> ^D
 752 $
 753 </pre>
 754 </div>
 755
 756 <p>There is a lot of room for extension here.  You can define new AST nodes,
 757 extend the language in many ways, etc.  In the <a href="LangImpl3.html">next
 758 installment</a>, we will describe how to generate LLVM IR from the AST.</p>
 759
 760 </div>
 761
 762 <!-- *********************************************************************** -->
 763 <div class="doc_section"><a name="code">Full Code Listing</a></div>
 764 <!-- *********************************************************************** -->
 765
 766 <div class="doc_text">
 767
 768 <p>
 769 Here is the complete code listing for this and the previous chapter.
 770 Note that it is fully self-contained: you don't need LLVM or any external
 771 libraries at all for this (other than the C and C++ standard libraries of
 772 course).  To build this, just compile with:</p>
 773
 774 <div class="doc_code">
 775 <pre>
 776    # Compile
 777    g++ -g toy.cpp
 778    # Run
 779    ./a.out
 780 </pre>
 781 </div>
 782
 783 <p>Here is the code:</p>
 784
 785 <div class="doc_code">
 786 <pre>
 787 #include &lt;cstdio&gt;
 788 #include &lt;string&gt;
 789 #include &lt;map&gt;
 790 #include &lt;vector&gt;
 791
 792 //===----------------------------------------------------------------------===//
 793 // Lexer
 794 //===----------------------------------------------------------------------===//
 795
 796 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
 797 // of these for known things.
 798 enum Token {
 799   tok_eof = -1,
 800
 801   // commands
 802   tok_def = -2, tok_extern = -3,
 803
 804   // primary
 805   tok_identifier = -4, tok_number = -5,
 806 };
 807
 808 static std::string IdentifierStr;  // Filled in if tok_identifier
 809 static double NumVal;              // Filled in if tok_number
 810
 811 /// gettok - Return the next token from standard input.
 812 static int gettok() {
 813   static int LastChar = ' ';
 814
 815   // Skip any whitespace.
 816   while (isspace(LastChar))
 817     LastChar = getchar();
 818
 819   if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
 820     IdentifierStr = LastChar;
 821     while (isalnum((LastChar = getchar())))
 822       IdentifierStr += LastChar;
 823
 824     if (IdentifierStr == "def") return tok_def;
 825     if (IdentifierStr == "extern") return tok_extern;
 826     return tok_identifier;
 827   }
 828
 829   if (isdigit(LastChar) || LastChar == '.') {   // Number: [0-9.]+
 830     std::string NumStr;
 831     do {
 832       NumStr += LastChar;
 833       LastChar = getchar();
 834     } while (isdigit(LastChar) || LastChar == '.');
 835
 836     NumVal = strtod(NumStr.c_str(), 0);
 837     return tok_number;
 838   }
 839
 840   if (LastChar == '#') {
 841     // Comment until end of line.
 842     do LastChar = getchar();
 843     while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp; LastChar != '\r');
 844
 845     if (LastChar != EOF)
 846       return gettok();
 847   }
 848
 849   // Check for end of file.  Don't eat the EOF.
 850   if (LastChar == EOF)
 851     return tok_eof;
 852
 853   // Otherwise, just return the character as its ascii value.
 854   int ThisChar = LastChar;
 855   LastChar = getchar();
 856   return ThisChar;
 857 }
 858
 859 //===----------------------------------------------------------------------===//
 860 // Abstract Syntax Tree (aka Parse Tree)
 861 //===----------------------------------------------------------------------===//
 862
 863 /// ExprAST - Base class for all expression nodes.
 864 class ExprAST {
 865 public:
 866   virtual ~ExprAST() {}
 867 };
 868
 869 /// NumberExprAST - Expression class for numeric literals like "1.0".
 870 class NumberExprAST : public ExprAST {
 871   double Val;
 872 public:
 873   explicit NumberExprAST(double val) : Val(val) {}
 874 };
 875
 876 /// VariableExprAST - Expression class for referencing a variable, like "a".
 877 class VariableExprAST : public ExprAST {
 878   std::string Name;
 879 public:
 880   explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
 881 };
 882
 883 /// BinaryExprAST - Expression class for a binary operator.
 884 class BinaryExprAST : public ExprAST {
 885   char Op;
 886   ExprAST *LHS, *RHS;
 887 public:
 888   BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
 889     : Op(op), LHS(lhs), RHS(rhs) {}
 890 };
 891
 892 /// CallExprAST - Expression class for function calls.
 893 class CallExprAST : public ExprAST {
 894   std::string Callee;
 895   std::vector&lt;ExprAST*&gt; Args;
 896 public:
 897   CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
 898     : Callee(callee), Args(args) {}
 899 };
 900
 901 /// PrototypeAST - This class represents the "prototype" for a function,
 902 /// which captures its argument names as well as if it is an operator.
 903 class PrototypeAST {
 904   std::string Name;
 905   std::vector&lt; Args;
 906 public:
 907   PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
 908     : Name(name), Args(args) {}
 909
 910 };
 911
 912 /// FunctionAST - This class represents a function definition itself.
 913 class FunctionAST {
 914   PrototypeAST *Proto;
 915   ExprAST *Body;
 916 public:
 917   FunctionAST(PrototypeAST *proto, ExprAST *body)
 918     : Proto(proto), Body(body) {}
 919
 920 };
 921
 922 //===----------------------------------------------------------------------===//
 923 // Parser
 924 //===----------------------------------------------------------------------===//
 925
 926 /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 927 /// token the parser it looking at.  getNextToken reads another token from the
 928 /// lexer and updates CurTok with its results.
 929 static int CurTok;
 930 static int getNextToken() {
 931   return CurTok = gettok();
 932 }
 933
 934 /// BinopPrecedence - This holds the precedence for each binary operator that is
 935 /// defined.
 936 static std::map&lt;char, int&gt; BinopPrecedence;
 937
 938 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
 939 static int GetTokPrecedence() {
 940   if (!isascii(CurTok))
 941     return -1;
 942
 943   // Make sure it's a declared binop.
 944   int TokPrec = BinopPrecedence[CurTok];
 945   if (TokPrec &lt;= 0) return -1;
 946   return TokPrec;
 947 }
 948
 949 /// Error* - These are little helper functions for error handling.
 950 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 951 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 952 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 953
 954 static ExprAST *ParseExpression();
 955
 956 /// identifierexpr
 957 ///   ::= identifer
 958 ///   ::= identifer '(' expression* ')'
 959 static ExprAST *ParseIdentifierExpr() {
 960   std::string IdName = IdentifierStr;
 961
 962   getNextToken();  // eat identifer.
 963
 964   if (CurTok != '(') // Simple variable ref.
 965     return new VariableExprAST(IdName);
 966
 967   // Call.
 968   getNextToken();  // eat (
 969   std::vector&lt;ExprAST*&gt; Args;
 970   while (1) {
 971     ExprAST *Arg = ParseExpression();
 972     if (!Arg) return 0;
 973     Args.push_back(Arg);
 974
 975     if (CurTok == ')') break;
 976
 977     if (CurTok != ',')
 978       return Error("Expected ')'");
 979     getNextToken();
 980   }
 981
 982   // Eat the ')'.
 983   getNextToken();
 984
 985   return new CallExprAST(IdName, Args);
 986 }
 987
 988 /// numberexpr ::= number
 989 static ExprAST *ParseNumberExpr() {
 990   ExprAST *Result = new NumberExprAST(NumVal);
 991   getNextToken(); // consume the number
 992   return Result;
 993 }
 994
 995 /// parenexpr ::= '(' expression ')'
 996 static ExprAST *ParseParenExpr() {
 997   getNextToken();  // eat (.
 998   ExprAST *V = ParseExpression();
 999   if (!V) return 0;
1000
1001   if (CurTok != ')')
1002     return Error("expected ')'");
1003   getNextToken();  // eat ).
1004   return V;
1005 }
1006
1007 /// primary
1008 ///   ::= identifierexpr
1009 ///   ::= numberexpr
1010 ///   ::= parenexpr
1011 static ExprAST *ParsePrimary() {
1012   switch (CurTok) {
1013   default: return Error("unknown token when expecting an expression");
1014   case tok_identifier: return ParseIdentifierExpr();
1015   case tok_number:     return ParseNumberExpr();
1016   case '(':            return ParseParenExpr();
1017   }
1018 }
1019
1020 /// binoprhs
1021 ///   ::= ('+' primary)*
1022 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1023   // If this is a binop, find its precedence.
1024   while (1) {
1025     int TokPrec = GetTokPrecedence();
1026
1027     // If this is a binop that binds at least as tightly as the current binop,
1028     // consume it, otherwise we are done.
1029     if (TokPrec &lt; ExprPrec)
1030       return LHS;
1031
1032     // Okay, we know this is a binop.
1033     int BinOp = CurTok;
1034     getNextToken();  // eat binop
1035
1036     // Parse the primary expression after the binary operator.
1037     ExprAST *RHS = ParsePrimary();
1038     if (!RHS) return 0;
1039
1040     // If BinOp binds less tightly with RHS than the operator after RHS, let
1041     // the pending operator take RHS as its LHS.
1042     int NextPrec = GetTokPrecedence();
1043     if (TokPrec &lt; NextPrec) {
1044       RHS = ParseBinOpRHS(TokPrec+1, RHS);
1045       if (RHS == 0) return 0;
1046     }
1047
1048     // Merge LHS/RHS.
1049     LHS = new BinaryExprAST(BinOp, LHS, RHS);
1050   }
1051 }
1052
1053 /// expression
1054 ///   ::= primary binoprhs
1055 ///
1056 static ExprAST *ParseExpression() {
1057   ExprAST *LHS = ParsePrimary();
1058   if (!LHS) return 0;
1059
1060   return ParseBinOpRHS(0, LHS);
1061 }
1062
1063 /// prototype
1064 ///   ::= id '(' id* ')'
1065 static PrototypeAST *ParsePrototype() {
1066   if (CurTok != tok_identifier)
1067     return ErrorP("Expected function name in prototype");
1068
1069   std::string FnName = IdentifierStr;
1070   getNextToken();
1071
1072   if (CurTok != '(')
1073     return ErrorP("Expected '(' in prototype");
1074
1075   std::vector&lt;std::string&gt; ArgNames;
1076   while (getNextToken() == tok_identifier)
1077     ArgNames.push_back(IdentifierStr);
1078   if (CurTok != ')')
1079     return ErrorP("Expected ')' in prototype");
1080
1081   // success.
1082   getNextToken();  // eat ')'.
1083
1084   return new PrototypeAST(FnName, ArgNames);
1085 }
1086
1087 /// definition ::= 'def' prototype expression
1088 static FunctionAST *ParseDefinition() {
1089   getNextToken();  // eat def.
1090   PrototypeAST *Proto = ParsePrototype();
1091   if (Proto == 0) return 0;
1092
1093   if (ExprAST *E = ParseExpression())
1094     return new FunctionAST(Proto, E);
1095   return 0;
1096 }
1097
1098 /// toplevelexpr ::= expression
1099 static FunctionAST *ParseTopLevelExpr() {
1100   if (ExprAST *E = ParseExpression()) {
1101     // Make an anonymous proto.
1102     PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;());
1103     return new FunctionAST(Proto, E);
1104   }
1105   return 0;
1106 }
1107
1108 /// external ::= 'extern' prototype
1109 static PrototypeAST *ParseExtern() {
1110   getNextToken();  // eat extern.
1111   return ParsePrototype();
1112 }
1113
1114 //===----------------------------------------------------------------------===//
1115 // Top-Level parsing
1116 //===----------------------------------------------------------------------===//
1117
1118 static void HandleDefinition() {
1119   if (FunctionAST *F = ParseDefinition()) {
1120     fprintf(stderr, "Parsed a function definition.\n");
1121   } else {
1122     // Skip token for error recovery.
1123     getNextToken();
1124   }
1125 }
1126
1127 static void HandleExtern() {
1128   if (PrototypeAST *P = ParseExtern()) {
1129     fprintf(stderr, "Parsed an extern\n");
1130   } else {
1131     // Skip token for error recovery.
1132     getNextToken();
1133   }
1134 }
1135
1136 static void HandleTopLevelExpression() {
1137   // Evaluate a top level expression into an anonymous function.
1138   if (FunctionAST *F = ParseTopLevelExpr()) {
1139     fprintf(stderr, "Parsed a top-level expr\n");
1140   } else {
1141     // Skip token for error recovery.
1142     getNextToken();
1143   }
1144 }
1145
1146 /// top ::= definition | external | expression | ';'
1147 static void MainLoop() {
1148   while (1) {
1149     fprintf(stderr, "ready&gt; ");
1150     switch (CurTok) {
1151     case tok_eof:    return;
1152     case ';':        getNextToken(); break;  // ignore top level semicolons.
1153     case tok_def:    HandleDefinition(); break;
1154     case tok_extern: HandleExtern(); break;
1155     default:         HandleTopLevelExpression(); break;
1156     }
1157   }
1158 }
1159
1160 //===----------------------------------------------------------------------===//
1161 // Main driver code.
1162 //===----------------------------------------------------------------------===//
1163
1164 int main() {
1165   // Install standard binary operators.
1166   // 1 is lowest precedence.
1167   BinopPrecedence['&lt;'] = 10;
1168   BinopPrecedence['+'] = 20;
1169   BinopPrecedence['-'] = 20;
1170   BinopPrecedence['*'] = 40;  // highest.
1171
1172   // Prime the first token.
1173   fprintf(stderr, "ready&gt; ");
1174   getNextToken();
1175
1176   MainLoop();
1177   return 0;
1178 }
1179 </pre>
1180 </div>
1181 </div>
1182
1183 <!-- *********************************************************************** -->
1184 <hr>
1185 <address>
1186   <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1187   src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1188   <a href="http://validator.w3.org/check/referer"><img
1189   src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!" /></a>
1190
1191   <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1192   <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
1193   Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
1194 </address>
1195 </body>
1196 </html>