docs/tutorial/LangImpl2.html

   1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   2                       "http://www.w3.org/TR/html4/strict.dtd">
   3
   4 <html>
   5 <head>
   6   <title>Kaleidoscope: Implementing a Parser and AST</title>
   7   <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   8   <meta name="author" content="Chris Lattner">
   9   <link rel="stylesheet" href="../llvm.css" type="text/css">
  10 </head>
  11
  12 <body>
  13
  14 <div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
  15
  16 <ul>
  17 <li><a href="index.html">Up to Tutorial Index</a></li>
  18 <li>Chapter 2
  19   <ol>
  20     <li><a href="#intro">Chapter 2 Introduction</a></li>
  21     <li><a href="#ast">The Abstract Syntax Tree (AST)</a></li>
  22     <li><a href="#parserbasics">Parser Basics</a></li>
  23     <li><a href="#parserprimexprs">Basic Expression Parsing</a></li>
  24     <li><a href="#parserbinops">Binary Expression Parsing</a></li>
  25     <li><a href="#parsertop">Parsing the Rest</a></li>
  26     <li><a href="#driver">The Driver</a></li>
  27     <li><a href="#conclusions">Conclusions</a></li>
  28     <li><a href="#code">Full Code Listing</a></li>
  29   </ol>
  30 </li>
  31 <li><a href="LangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li>
  32 </ul>
  33
  34 <div class="doc_author">
  35   <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
  36 </div>
  37
  38 <!-- *********************************************************************** -->
  39 <div class="doc_section"><a name="intro">Chapter 2 Introduction</a></div>
  40 <!-- *********************************************************************** -->
  41
  42 <div class="doc_text">
  43
  44 <p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language
  45 with LLVM</a>" tutorial.  This chapter shows you how to use the lexer, built in
  46 <a href="LangImpl1.html">Chapter 1</a>, to build a full <a
  47 href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
  48 our Kaleidoscope language.  Once we have a parser, we'll define and build an <a
  49 href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
  50 Tree</a> (AST).</p>
  51
  52 <p>The parser we will build uses a combination of <a
  53 href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
  54 Parsing</a> and <a href=
  55 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
  56 Parsing</a> to parse the Kaleidoscope language (the latter for
  57 binary expressions and the former for everything else).  Before we get to
  58 parsing though, lets talk about the output of the parser: the Abstract Syntax
  59 Tree.</p>
  60
  61 </div>
  62
  63 <!-- *********************************************************************** -->
  64 <div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
  65 <!-- *********************************************************************** -->
  66
  67 <div class="doc_text">
  68
  69 <p>The AST for a program captures its behavior in such a way that it is easy for
  70 later stages of the compiler (e.g. code generation) to interpret.  We basically
  71 want one object for each construct in the language, and the AST should closely
  72 model the language.  In Kaleidoscope, we have expressions, a prototype, and a
  73 function object.  We'll start with expressions first:</p>
  74
  75 <div class="doc_code">
  76 <pre>
  77 /// ExprAST - Base class for all expression nodes.
  78 class ExprAST {
  79 public:
  80   virtual ~ExprAST() {}
  81 };
  82
  83 /// NumberExprAST - Expression class for numeric literals like "1.0".
  84 class NumberExprAST : public ExprAST {
  85   double Val;
  86 public:
  87   explicit NumberExprAST(double val) : Val(val) {}
  88 };
  89 </pre>
  90 </div>
  91
  92 <p>The code above shows the definition of the base ExprAST class and one
  93 subclass which we use for numeric literals.  The important thing to note about
  94 this code is that the NumberExprAST class captures the numeric value of the
  95 literal as an instance variable. This allows later phases of the compiler to
  96 know what the stored numeric value is.</p>
  97
  98 <p>Right now we only create the AST,  so there are no useful accessor methods on
  99 them.  It would be very easy to add a virtual method to pretty print the code,
 100 for example.  Here are the other expression AST node definitions that we'll use
 101 in the basic form of the Kaleidoscope language:
 102 </p>
 103
 104 <div class="doc_code">
 105 <pre>
 106 /// VariableExprAST - Expression class for referencing a variable, like "a".
 107 class VariableExprAST : public ExprAST {
 108   std::string Name;
 109 public:
 110   explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
 111 };
 112
 113 /// BinaryExprAST - Expression class for a binary operator.
 114 class BinaryExprAST : public ExprAST {
 115   char Op;
 116   ExprAST *LHS, *RHS;
 117 public:
 118   BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
 119     : Op(op), LHS(lhs), RHS(rhs) {}
 120 };
 121
 122 /// CallExprAST - Expression class for function calls.
 123 class CallExprAST : public ExprAST {
 124   std::string Callee;
 125   std::vector&lt;ExprAST*&gt; Args;
 126 public:
 127   CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
 128     : Callee(callee), Args(args) {}
 129 };
 130 </pre>
 131 </div>
 132
 133 <p>This is all (intentionally) rather straight-forward: variables capture the
 134 variable name, binary operators capture their opcode (e.g. '+'), and calls
 135 capture a function name as well as a list of any argument expressions.  One thing
 136 that is nice about our AST is that it captures the language features without
 137 talking about the syntax of the language.  Note that there is no discussion about
 138 precedence of binary operators, lexical structure, etc.</p>
 139
 140 <p>For our basic language, these are all of the expression nodes we'll define.
 141 Because it doesn't have conditional control flow, it isn't Turing-complete;
 142 we'll fix that in a later installment.  The two things we need next are a way
 143 to talk about the interface to a function, and a way to talk about functions
 144 themselves:</p>
 145
 146 <div class="doc_code">
 147 <pre>
 148 /// PrototypeAST - This class represents the "prototype" for a function,
 149 /// which captures its name, and its argument names (thus implicitly the number
 150 /// of arguments the function takes).
 151 class PrototypeAST {
 152   std::string Name;
 153   std::vector&lt;std::string&gt; Args;
 154 public:
 155   PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
 156     : Name(name), Args(args) {}
 157 };
 158
 159 /// FunctionAST - This class represents a function definition itself.
 160 class FunctionAST {
 161   PrototypeAST *Proto;
 162   ExprAST *Body;
 163 public:
 164   FunctionAST(PrototypeAST *proto, ExprAST *body)
 165     : Proto(proto), Body(body) {}
 166 };
 167 </pre>
 168 </div>
 169
 170 <p>In Kaleidoscope, functions are typed with just a count of their arguments.
 171 Since all values are double precision floating point, the type of each argument
 172 doesn't need to be stored anywhere.  In a more aggressive and realistic
 173 language, the "ExprAST" class would probably have a type field.</p>
 174
 175 <p>With this scaffolding, we can now talk about parsing expressions and function
 176 bodies in Kaleidoscope.</p>
 177
 178 </div>
 179
 180 <!-- *********************************************************************** -->
 181 <div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
 182 <!-- *********************************************************************** -->
 183
 184 <div class="doc_text">
 185
 186 <p>Now that we have an AST to build, we need to define the parser code to build
 187 it.  The idea here is that we want to parse something like "x+y" (which is
 188 returned as three tokens by the lexer) into an AST that could be generated with
 189 calls like this:</p>
 190
 191 <div class="doc_code">
 192 <pre>
 193   ExprAST *X = new VariableExprAST("x");
 194   ExprAST *Y = new VariableExprAST("y");
 195   ExprAST *Result = new BinaryExprAST('+', X, Y);
 196 </pre>
 197 </div>
 198
 199 <p>In order to do this, we'll start by defining some basic helper routines:</p>
 200
 201 <div class="doc_code">
 202 <pre>
 203 /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 204 /// token the parser is looking at.  getNextToken reads another token from the
 205 /// lexer and updates CurTok with its results.
 206 static int CurTok;
 207 static int getNextToken() {
 208   return CurTok = gettok();
 209 }
 210 </pre>
 211 </div>
 212
 213 <p>
 214 This implements a simple token buffer around the lexer.  This allows
 215 us to look one token ahead at what the lexer is returning.  Every function in
 216 our parser will assume that CurTok is the current token that needs to be
 217 parsed.</p>
 218
 219 <div class="doc_code">
 220 <pre>
 221
 222 /// Error* - These are little helper functions for error handling.
 223 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 224 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 225 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 226 </pre>
 227 </div>
 228
 229 <p>
 230 The <tt>Error</tt> routines are simple helper routines that our parser will use
 231 to handle errors.  The error recovery in our parser will not be the best and
 232 is not particular user-friendly, but it will be enough for our tutorial.  These
 233 routines make it easier to handle errors in routines that have various return
 234 types: they always return null.</p>
 235
 236 <p>With these basic helper functions, we can implement the first
 237 piece of our grammar: numeric literals.</p>
 238
 239 </div>
 240
 241 <!-- *********************************************************************** -->
 242 <div class="doc_section"><a name="parserprimexprs">Basic Expression
 243  Parsing</a></div>
 244 <!-- *********************************************************************** -->
 245
 246 <div class="doc_text">
 247
 248 <p>We start with numeric literals, because they are the simplest to process.
 249 For each production in our grammar, we'll define a function which parses that
 250 production.  For numeric literals, we have:
 251 </p>
 252
 253 <div class="doc_code">
 254 <pre>
 255 /// numberexpr ::= number
 256 static ExprAST *ParseNumberExpr() {
 257   ExprAST *Result = new NumberExprAST(NumVal);
 258   getNextToken(); // consume the number
 259   return Result;
 260 }
 261 </pre>
 262 </div>
 263
 264 <p>This routine is very simple: it expects to be called when the current token
 265 is a <tt>tok_number</tt> token.  It takes the current number value, creates
 266 a <tt>NumberExprAST</tt> node, advances the lexer to the next token, and finally
 267 returns.</p>
 268
 269 <p>There are some interesting aspects to this.  The most important one is that
 270 this routine eats all of the tokens that correspond to the production and
 271 returns the lexer buffer with the next token (which is not part of the grammar
 272 production) ready to go.  This is a fairly standard way to go for recursive
 273 descent parsers.  For a better example, the parenthesis operator is defined like
 274 this:</p>
 275
 276 <div class="doc_code">
 277 <pre>
 278 /// parenexpr ::= '(' expression ')'
 279 static ExprAST *ParseParenExpr() {
 280   getNextToken();  // eat (.
 281   ExprAST *V = ParseExpression();
 282   if (!V) return 0;
 283
 284   if (CurTok != ')')
 285     return Error("expected ')'");
 286   getNextToken();  // eat ).
 287   return V;
 288 }
 289 </pre>
 290 </div>
 291
 292 <p>This function illustrates a number of interesting things about the
 293 parser:</p>
 294
 295 <p>
 296 1) It shows how we use the Error routines.  When called, this function expects
 297 that the current token is a '(' token, but after parsing the subexpression, it
 298 is possible that there is no ')' waiting.  For example, if the user types in
 299 "(4 x" instead of "(4)", the parser should emit an error.  Because errors can
 300 occur, the parser needs a way to indicate that they happened: in our parser, we
 301 return null on an error.</p>
 302
 303 <p>2) Another interesting aspect of this function is that it uses recursion by
 304 calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
 305 <tt>ParseParenExpr</tt>).  This is powerful because it allows us to handle
 306 recursive grammars, and keeps each production very simple.  Note that
 307 parentheses do not cause construction of AST nodes themselves.  While we could
 308 do it this way, the most important role of parentheses are to guide the parser
 309 and provide grouping.  Once the parser constructs the AST, parentheses are not
 310 needed.</p>
 311
 312 <p>The next simple production is for handling variable references and function
 313 calls:</p>
 314
 315 <div class="doc_code">
 316 <pre>
 317 /// identifierexpr
 318 ///   ::= identifier
 319 ///   ::= identifier '(' expression* ')'
 320 static ExprAST *ParseIdentifierExpr() {
 321   std::string IdName = IdentifierStr;
 322
 323   getNextToken();  // eat identifier.
 324
 325   if (CurTok != '(') // Simple variable ref.
 326     return new VariableExprAST(IdName);
 327
 328   // Call.
 329   getNextToken();  // eat (
 330   std::vector&lt;ExprAST*&gt; Args;
 331   if (CurTok != ')') {
 332     while (1) {
 333       ExprAST *Arg = ParseExpression();
 334       if (!Arg) return 0;
 335       Args.push_back(Arg);
 336
 337       if (CurTok == ')') break;
 338
 339       if (CurTok != ',')
 340         return Error("Expected ')' or ',' in argument list");
 341       getNextToken();
 342     }
 343   }
 344
 345   // Eat the ')'.
 346   getNextToken();
 347
 348   return new CallExprAST(IdName, Args);
 349 }
 350 </pre>
 351 </div>
 352
 353 <p>This routine follows the same style as the other routines.  (It expects to be
 354 called if the current token is a <tt>tok_identifier</tt> token).  It also has
 355 recursion and error handling.  One interesting aspect of this is that it uses
 356 <em>look-ahead</em> to determine if the current identifier is a stand alone
 357 variable reference or if it is a function call expression.  It handles this by
 358 checking to see if the token after the identifier is a '(' token, constructing
 359 either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
 360 </p>
 361
 362 <p>Now that we have all of our simple expression-parsing logic in place, we can
 363 define a helper function to wrap it together into one entry point.  We call this
 364 class of expressions "primary" expressions, for reasons that will become more
 365 clear <a href="LangImpl6.html#unary">later in the tutorial</a>.  In order to
 366 parse an arbitrary primary expression, we need to determine what sort of
 367 expression it is:</p>
 368
 369 <div class="doc_code">
 370 <pre>
 371 /// primary
 372 ///   ::= identifierexpr
 373 ///   ::= numberexpr
 374 ///   ::= parenexpr
 375 static ExprAST *ParsePrimary() {
 376   switch (CurTok) {
 377   default: return Error("unknown token when expecting an expression");
 378   case tok_identifier: return ParseIdentifierExpr();
 379   case tok_number:     return ParseNumberExpr();
 380   case '(':            return ParseParenExpr();
 381   }
 382 }
 383 </pre>
 384 </div>
 385
 386 <p>Now that you see the definition of this function, it is more obvious why we
 387 can assume the state of CurTok in the various functions.  This uses look-ahead
 388 to determine which sort of expression is being inspected, and then parses it
 389 with a function call.</p>
 390
 391 <p>Now that basic expressions are handled, we need to handle binary expressions.
 392 They are a bit more complex.</p>
 393
 394 </div>
 395
 396 <!-- *********************************************************************** -->
 397 <div class="doc_section"><a name="parserbinops">Binary Expression
 398  Parsing</a></div>
 399 <!-- *********************************************************************** -->
 400
 401 <div class="doc_text">
 402
 403 <p>Binary expressions are significantly harder to parse because they are often
 404 ambiguous.  For example, when given the string "x+y*z", the parser can choose
 405 to parse it as either "(x+y)*z" or "x+(y*z)".  With common definitions from
 406 mathematics, we expect the later parse, because "*" (multiplication) has
 407 higher <em>precedence</em> than "+" (addition).</p>
 408
 409 <p>There are many ways to handle this, but an elegant and efficient way is to
 410 use <a href=
 411 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
 412 Parsing</a>.  This parsing technique uses the precedence of binary operators to
 413 guide recursion.  To start with, we need a table of precedences:</p>
 414
 415 <div class="doc_code">
 416 <pre>
 417 /// BinopPrecedence - This holds the precedence for each binary operator that is
 418 /// defined.
 419 static std::map&lt;char, int&gt; BinopPrecedence;
 420
 421 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
 422 static int GetTokPrecedence() {
 423   if (!isascii(CurTok))
 424     return -1;
 425
 426   // Make sure it's a declared binop.
 427   int TokPrec = BinopPrecedence[CurTok];
 428   if (TokPrec &lt;= 0) return -1;
 429   return TokPrec;
 430 }
 431
 432 int main() {
 433   // Install standard binary operators.
 434   // 1 is lowest precedence.
 435   BinopPrecedence['&lt;'] = 10;
 436   BinopPrecedence['+'] = 20;
 437   BinopPrecedence['-'] = 20;
 438   BinopPrecedence['*'] = 40;  // highest.
 439   ...
 440 }
 441 </pre>
 442 </div>
 443
 444 <p>For the basic form of Kaleidoscope, we will only support 4 binary operators
 445 (this can obviously be extended by you, our brave and intrepid reader).  The
 446 <tt>GetTokPrecedence</tt> function returns the precedence for the current token,
 447 or -1 if the token is not a binary operator.  Having a map makes it easy to add
 448 new operators and makes it clear that the algorithm doesn't depend on the
 449 specific operators involved, but it would be easy enough to eliminate the map
 450 and do the comparisons in the <tt>GetTokPrecedence</tt> function.  (Or just use
 451 a fixed-size array).</p>
 452
 453 <p>With the helper above defined, we can now start parsing binary expressions.
 454 The basic idea of operator precedence parsing is to break down an expression
 455 with potentially ambiguous binary operators into pieces.  Consider ,for example,
 456 the expression "a+b+(c+d)*e*f+g".  Operator precedence parsing considers this
 457 as a stream of primary expressions separated by binary operators.  As such,
 458 it will first parse the leading primary expression "a", then it will see the
 459 pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g].  Note that because parentheses
 460 are primary expressions, the binary expression parser doesn't need to worry
 461 about nested subexpressions like (c+d) at all.
 462 </p>
 463
 464 <p>
 465 To start, an expression is a primary expression potentially followed by a
 466 sequence of [binop,primaryexpr] pairs:</p>
 467
 468 <div class="doc_code">
 469 <pre>
 470 /// expression
 471 ///   ::= primary binoprhs
 472 ///
 473 static ExprAST *ParseExpression() {
 474   ExprAST *LHS = ParsePrimary();
 475   if (!LHS) return 0;
 476
 477   return ParseBinOpRHS(0, LHS);
 478 }
 479 </pre>
 480 </div>
 481
 482 <p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
 483 us.  It takes a precedence and a pointer to an expression for the part that has been
 484 parsed so far.   Note that "x" is a perfectly valid expression: As such, "binoprhs" is
 485 allowed to be empty, in which case it returns the expression that is passed into
 486 it. In our example above, the code passes the expression for "a" into
 487 <tt>ParseBinOpRHS</tt> and the current token is "+".</p>
 488
 489 <p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
 490 minimal operator precedence</em> that the function is allowed to eat.  For
 491 example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
 492 passed in a precedence of 40, it will not consume any tokens (because the
 493 precedence of '+' is only 20).  With this in mind, <tt>ParseBinOpRHS</tt> starts
 494 with:</p>
 495
 496 <div class="doc_code">
 497 <pre>
 498 /// binoprhs
 499 ///   ::= ('+' primary)*
 500 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
 501   // If this is a binop, find its precedence.
 502   while (1) {
 503     int TokPrec = GetTokPrecedence();
 504
 505     // If this is a binop that binds at least as tightly as the current binop,
 506     // consume it, otherwise we are done.
 507     if (TokPrec &lt; ExprPrec)
 508       return LHS;
 509 </pre>
 510 </div>
 511
 512 <p>This code gets the precedence of the current token and checks to see if if is
 513 too low.  Because we defined invalid tokens to have a precedence of -1, this
 514 check implicitly knows that the pair-stream ends when the token stream runs out
 515 of binary operators.  If this check succeeds, we know that the token is a binary
 516 operator and that it will be included in this expression:</p>
 517
 518 <div class="doc_code">
 519 <pre>
 520     // Okay, we know this is a binop.
 521     int BinOp = CurTok;
 522     getNextToken();  // eat binop
 523
 524     // Parse the primary expression after the binary operator.
 525     ExprAST *RHS = ParsePrimary();
 526     if (!RHS) return 0;
 527 </pre>
 528 </div>
 529
 530 <p>As such, this code eats (and remembers) the binary operator and then parses
 531 the primary expression that follows.  This builds up the whole pair, the first of
 532 which is [+, b] for the running example.</p>
 533
 534 <p>Now that we parsed the left-hand side of an expression and one pair of the
 535 RHS sequence, we have to decide which way the expression associates.  In
 536 particular, we could have "(a+b) binop unparsed"  or "a + (b binop unparsed)".
 537 To determine this, we look ahead at "binop" to determine its precedence and
 538 compare it to BinOp's precedence (which is '+' in this case):</p>
 539
 540 <div class="doc_code">
 541 <pre>
 542     // If BinOp binds less tightly with RHS than the operator after RHS, let
 543     // the pending operator take RHS as its LHS.
 544     int NextPrec = GetTokPrecedence();
 545     if (TokPrec &lt; NextPrec) {
 546 </pre>
 547 </div>
 548
 549 <p>If the precedence of the binop to the right of "RHS" is lower or equal to the
 550 precedence of our current operator, then we know that the parentheses associate
 551 as "(a+b) binop ...".  In our example, the current operator is "+" and the next
 552 operator is "+", we know that they have the same precedence.  In this case we'll
 553 create the AST node for "a+b", and then continue parsing:</p>
 554
 555 <div class="doc_code">
 556 <pre>
 557       ... if body omitted ...
 558     }
 559
 560     // Merge LHS/RHS.
 561     LHS = new BinaryExprAST(BinOp, LHS, RHS);
 562   }  // loop around to the top of the while loop.
 563 }
 564 </pre>
 565 </div>
 566
 567 <p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
 568 iteration of the loop, with "+" as the current token.  The code above will eat,
 569 remember, and parse "(c+d)" as the primary expression, which makes the
 570 current pair equal to [+, (c+d)].  It will then evaluate the 'if' conditional above with
 571 "*" as the binop to the right of the primary.  In this case, the precedence of "*" is
 572 higher than the precedence of "+" so the if condition will be entered.</p>
 573
 574 <p>The critical question left here is "how can the if condition parse the right
 575 hand side in full"?  In particular, to build the AST correctly for our example,
 576 it needs to get all of "(c+d)*e*f" as the RHS expression variable.  The code to
 577 do this is surprisingly simple (code from the above two blocks duplicated for
 578 context):</p>
 579
 580 <div class="doc_code">
 581 <pre>
 582     // If BinOp binds less tightly with RHS than the operator after RHS, let
 583     // the pending operator take RHS as its LHS.
 584     int NextPrec = GetTokPrecedence();
 585     if (TokPrec &lt; NextPrec) {
 586       <b>RHS = ParseBinOpRHS(TokPrec+1, RHS);
 587       if (RHS == 0) return 0;</b>
 588     }
 589     // Merge LHS/RHS.
 590     LHS = new BinaryExprAST(BinOp, LHS, RHS);
 591   }  // loop around to the top of the while loop.
 592 }
 593 </pre>
 594 </div>
 595
 596 <p>At this point, we know that the binary operator to the RHS of our primary
 597 has higher precedence than the binop we are currently parsing.  As such, we know
 598 that any sequence of pairs whose operators are all higher precedence than "+"
 599 should be parsed together and returned as "RHS".  To do this, we recursively
 600 invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
 601 precedence required for it to continue.  In our example above, this will cause
 602 it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
 603 of the '+' expression.</p>
 604
 605 <p>Finally, on the next iteration of the while loop, the "+g" piece is parsed
 606 and added to the AST.  With this little bit of code (14 non-trivial lines), we
 607 correctly handle fully general binary expression parsing in a very elegant way.
 608 This was a whirlwind tour of this code, and it is somewhat subtle.  I recommend
 609 running through it with a few tough examples to see how it works.
 610 </p>
 611
 612 <p>This wraps up handling of expressions.  At this point, we can point the
 613 parser at an arbitrary token stream and build an expression from it, stopping
 614 at the first token that is not part of the expression.  Next up we need to
 615 handle function definitions, etc.</p>
 616
 617 </div>
 618
 619 <!-- *********************************************************************** -->
 620 <div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
 621 <!-- *********************************************************************** -->
 622
 623 <div class="doc_text">
 624
 625 <p>
 626 The next thing missing is handling of function prototypes.  In Kaleidoscope,
 627 these are used both for 'extern' function declarations as well as function body
 628 definitions.  The code to do this is straight-forward and not very interesting
 629 (once you've survived expressions):
 630 </p>
 631
 632 <div class="doc_code">
 633 <pre>
 634 /// prototype
 635 ///   ::= id '(' id* ')'
 636 static PrototypeAST *ParsePrototype() {
 637   if (CurTok != tok_identifier)
 638     return ErrorP("Expected function name in prototype");
 639
 640   std::string FnName = IdentifierStr;
 641   getNextToken();
 642
 643   if (CurTok != '(')
 644     return ErrorP("Expected '(' in prototype");
 645
 646   // Read the list of argument names.
 647   std::vector&lt;std::string&gt; ArgNames;
 648   while (getNextToken() == tok_identifier)
 649     ArgNames.push_back(IdentifierStr);
 650   if (CurTok != ')')
 651     return ErrorP("Expected ')' in prototype");
 652
 653   // success.
 654   getNextToken();  // eat ')'.
 655
 656   return new PrototypeAST(FnName, ArgNames);
 657 }
 658 </pre>
 659 </div>
 660
 661 <p>Given this, a function definition is very simple, just a prototype plus
 662 an expression to implement the body:</p>
 663
 664 <div class="doc_code">
 665 <pre>
 666 /// definition ::= 'def' prototype expression
 667 static FunctionAST *ParseDefinition() {
 668   getNextToken();  // eat def.
 669   PrototypeAST *Proto = ParsePrototype();
 670   if (Proto == 0) return 0;
 671
 672   if (ExprAST *E = ParseExpression())
 673     return new FunctionAST(Proto, E);
 674   return 0;
 675 }
 676 </pre>
 677 </div>
 678
 679 <p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
 680 well as to support forward declaration of user functions.  These 'extern's are just
 681 prototypes with no body:</p>
 682
 683 <div class="doc_code">
 684 <pre>
 685 /// external ::= 'extern' prototype
 686 static PrototypeAST *ParseExtern() {
 687   getNextToken();  // eat extern.
 688   return ParsePrototype();
 689 }
 690 </pre>
 691 </div>
 692
 693 <p>Finally, we'll also let the user type in arbitrary top-level expressions and
 694 evaluate them on the fly.  We will handle this by defining anonymous nullary
 695 (zero argument) functions for them:</p>
 696
 697 <div class="doc_code">
 698 <pre>
 699 /// toplevelexpr ::= expression
 700 static FunctionAST *ParseTopLevelExpr() {
 701   if (ExprAST *E = ParseExpression()) {
 702     // Make an anonymous proto.
 703     PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
 704     return new FunctionAST(Proto, E);
 705   }
 706   return 0;
 707 }
 708 </pre>
 709 </div>
 710
 711 <p>Now that we have all the pieces, let's build a little driver that will let us
 712 actually <em>execute</em> this code we've built!</p>
 713
 714 </div>
 715
 716 <!-- *********************************************************************** -->
 717 <div class="doc_section"><a name="driver">The Driver</a></div>
 718 <!-- *********************************************************************** -->
 719
 720 <div class="doc_text">
 721
 722 <p>The driver for this simply invokes all of the parsing pieces with a top-level
 723 dispatch loop.  There isn't much interesting here, so I'll just include the
 724 top-level loop.  See <a href="#code">below</a> for full code in the "Top-Level
 725 Parsing" section.</p>
 726
 727 <div class="doc_code">
 728 <pre>
 729 /// top ::= definition | external | expression | ';'
 730 static void MainLoop() {
 731   while (1) {
 732     fprintf(stderr, "ready&gt; ");
 733     switch (CurTok) {
 734     case tok_eof:    return;
 735     case ';':        getNextToken(); break;  // ignore top-level semicolons.
 736     case tok_def:    HandleDefinition(); break;
 737     case tok_extern: HandleExtern(); break;
 738     default:         HandleTopLevelExpression(); break;
 739     }
 740   }
 741 }
 742 </pre>
 743 </div>
 744
 745 <p>The most interesting part of this is that we ignore top-level semicolons.
 746 Why is this, you ask?  The basic reason is that if you type "4 + 5" at the
 747 command line, the parser doesn't know whether that is the end of what you will type
 748 or not.  For example, on the next line you could type "def foo..." in which case
 749 4+5 is the end of a top-level expression.  Alternatively you could type "* 6",
 750 which would continue the expression.  Having top-level semicolons allows you to
 751 type "4+5;", and the parser will know you are done.</p>
 752
 753 </div>
 754
 755 <!-- *********************************************************************** -->
 756 <div class="doc_section"><a name="conclusions">Conclusions</a></div>
 757 <!-- *********************************************************************** -->
 758
 759 <div class="doc_text">
 760
 761 <p>With just under 400 lines of commented code (240 lines of non-comment,
 762 non-blank code), we fully defined our minimal language, including a lexer,
 763 parser, and AST builder.  With this done, the executable will validate
 764 Kaleidoscope code and tell us if it is grammatically invalid.  For
 765 example, here is a sample interaction:</p>
 766
 767 <div class="doc_code">
 768 <pre>
 769 $ <b>./a.out</b>
 770 ready&gt; <b>def foo(x y) x+foo(y, 4.0);</b>
 771 Parsed a function definition.
 772 ready&gt; <b>def foo(x y) x+y y;</b>
 773 Parsed a function definition.
 774 Parsed a top-level expr
 775 ready&gt; <b>def foo(x y) x+y );</b>
 776 Parsed a function definition.
 777 Error: unknown token when expecting an expression
 778 ready&gt; <b>extern sin(a);</b>
 779 ready&gt; Parsed an extern
 780 ready&gt; <b>^D</b>
 781 $
 782 </pre>
 783 </div>
 784
 785 <p>There is a lot of room for extension here.  You can define new AST nodes,
 786 extend the language in many ways, etc.  In the <a href="LangImpl3.html">next
 787 installment</a>, we will describe how to generate LLVM Intermediate
 788 Representation (IR) from the AST.</p>
 789
 790 </div>
 791
 792 <!-- *********************************************************************** -->
 793 <div class="doc_section"><a name="code">Full Code Listing</a></div>
 794 <!-- *********************************************************************** -->
 795
 796 <div class="doc_text">
 797
 798 <p>
 799 Here is the complete code listing for this and the previous chapter.
 800 Note that it is fully self-contained: you don't need LLVM or any external
 801 libraries at all for this.  (Besides the C and C++ standard libraries, of
 802 course.)  To build this, just compile with:</p>
 803
 804 <div class="doc_code">
 805 <pre>
 806    # Compile
 807    g++ -g -O3 toy.cpp
 808    # Run
 809    ./a.out
 810 </pre>
 811 </div>
 812
 813 <p>Here is the code:</p>
 814
 815 <div class="doc_code">
 816 <pre>
 817 #include &lt;cstdio&gt;
 818 #include &lt;cstdlib&gt;
 819 #include &lt;string&gt;
 820 #include &lt;map&gt;
 821 #include &lt;vector&gt;
 822
 823 //===----------------------------------------------------------------------===//
 824 // Lexer
 825 //===----------------------------------------------------------------------===//
 826
 827 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
 828 // of these for known things.
 829 enum Token {
 830   tok_eof = -1,
 831
 832   // commands
 833   tok_def = -2, tok_extern = -3,
 834
 835   // primary
 836   tok_identifier = -4, tok_number = -5,
 837 };
 838
 839 static std::string IdentifierStr;  // Filled in if tok_identifier
 840 static double NumVal;              // Filled in if tok_number
 841
 842 /// gettok - Return the next token from standard input.
 843 static int gettok() {
 844   static int LastChar = ' ';
 845
 846   // Skip any whitespace.
 847   while (isspace(LastChar))
 848     LastChar = getchar();
 849
 850   if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
 851     IdentifierStr = LastChar;
 852     while (isalnum((LastChar = getchar())))
 853       IdentifierStr += LastChar;
 854
 855     if (IdentifierStr == "def") return tok_def;
 856     if (IdentifierStr == "extern") return tok_extern;
 857     return tok_identifier;
 858   }
 859
 860   if (isdigit(LastChar) || LastChar == '.') {   // Number: [0-9.]+
 861     std::string NumStr;
 862     do {
 863       NumStr += LastChar;
 864       LastChar = getchar();
 865     } while (isdigit(LastChar) || LastChar == '.');
 866
 867     NumVal = strtod(NumStr.c_str(), 0);
 868     return tok_number;
 869   }
 870
 871   if (LastChar == '#') {
 872     // Comment until end of line.
 873     do LastChar = getchar();
 874     while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp;&amp; LastChar != '\r');
 875
 876     if (LastChar != EOF)
 877       return gettok();
 878   }
 879
 880   // Check for end of file.  Don't eat the EOF.
 881   if (LastChar == EOF)
 882     return tok_eof;
 883
 884   // Otherwise, just return the character as its ascii value.
 885   int ThisChar = LastChar;
 886   LastChar = getchar();
 887   return ThisChar;
 888 }
 889
 890 //===----------------------------------------------------------------------===//
 891 // Abstract Syntax Tree (aka Parse Tree)
 892 //===----------------------------------------------------------------------===//
 893
 894 /// ExprAST - Base class for all expression nodes.
 895 class ExprAST {
 896 public:
 897   virtual ~ExprAST() {}
 898 };
 899
 900 /// NumberExprAST - Expression class for numeric literals like "1.0".
 901 class NumberExprAST : public ExprAST {
 902   double Val;
 903 public:
 904   explicit NumberExprAST(double val) : Val(val) {}
 905 };
 906
 907 /// VariableExprAST - Expression class for referencing a variable, like "a".
 908 class VariableExprAST : public ExprAST {
 909   std::string Name;
 910 public:
 911   explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
 912 };
 913
 914 /// BinaryExprAST - Expression class for a binary operator.
 915 class BinaryExprAST : public ExprAST {
 916   char Op;
 917   ExprAST *LHS, *RHS;
 918 public:
 919   BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
 920     : Op(op), LHS(lhs), RHS(rhs) {}
 921 };
 922
 923 /// CallExprAST - Expression class for function calls.
 924 class CallExprAST : public ExprAST {
 925   std::string Callee;
 926   std::vector&lt;ExprAST*&gt; Args;
 927 public:
 928   CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
 929     : Callee(callee), Args(args) {}
 930 };
 931
 932 /// PrototypeAST - This class represents the "prototype" for a function,
 933 /// which captures its name, and its argument names (thus implicitly the number
 934 /// of arguments the function takes).
 935 class PrototypeAST {
 936   std::string Name;
 937   std::vector&lt;std::string&gt; Args;
 938 public:
 939   PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
 940     : Name(name), Args(args) {}
 941
 942 };
 943
 944 /// FunctionAST - This class represents a function definition itself.
 945 class FunctionAST {
 946   PrototypeAST *Proto;
 947   ExprAST *Body;
 948 public:
 949   FunctionAST(PrototypeAST *proto, ExprAST *body)
 950     : Proto(proto), Body(body) {}
 951
 952 };
 953
 954 //===----------------------------------------------------------------------===//
 955 // Parser
 956 //===----------------------------------------------------------------------===//
 957
 958 /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 959 /// token the parser is looking at.  getNextToken reads another token from the
 960 /// lexer and updates CurTok with its results.
 961 static int CurTok;
 962 static int getNextToken() {
 963   return CurTok = gettok();
 964 }
 965
 966 /// BinopPrecedence - This holds the precedence for each binary operator that is
 967 /// defined.
 968 static std::map&lt;char, int&gt; BinopPrecedence;
 969
 970 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
 971 static int GetTokPrecedence() {
 972   if (!isascii(CurTok))
 973     return -1;
 974
 975   // Make sure it's a declared binop.
 976   int TokPrec = BinopPrecedence[CurTok];
 977   if (TokPrec &lt;= 0) return -1;
 978   return TokPrec;
 979 }
 980
 981 /// Error* - These are little helper functions for error handling.
 982 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 983 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 984 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 985
 986 static ExprAST *ParseExpression();
 987
 988 /// identifierexpr
 989 ///   ::= identifier
 990 ///   ::= identifier '(' expression* ')'
 991 static ExprAST *ParseIdentifierExpr() {
 992   std::string IdName = IdentifierStr;
 993
 994   getNextToken();  // eat identifier.
 995
 996   if (CurTok != '(') // Simple variable ref.
 997     return new VariableExprAST(IdName);
 998
 999   // Call.
1000   getNextToken();  // eat (
1001   std::vector&lt;ExprAST*&gt; Args;
1002   if (CurTok != ')') {
1003     while (1) {
1004       ExprAST *Arg = ParseExpression();
1005       if (!Arg) return 0;
1006       Args.push_back(Arg);
1007
1008       if (CurTok == ')') break;
1009
1010       if (CurTok != ',')
1011         return Error("Expected ')' or ',' in argument list");
1012       getNextToken();
1013     }
1014   }
1015
1016   // Eat the ')'.
1017   getNextToken();
1018
1019   return new CallExprAST(IdName, Args);
1020 }
1021
1022 /// numberexpr ::= number
1023 static ExprAST *ParseNumberExpr() {
1024   ExprAST *Result = new NumberExprAST(NumVal);
1025   getNextToken(); // consume the number
1026   return Result;
1027 }
1028
1029 /// parenexpr ::= '(' expression ')'
1030 static ExprAST *ParseParenExpr() {
1031   getNextToken();  // eat (.
1032   ExprAST *V = ParseExpression();
1033   if (!V) return 0;
1034
1035   if (CurTok != ')')
1036     return Error("expected ')'");
1037   getNextToken();  // eat ).
1038   return V;
1039 }
1040
1041 /// primary
1042 ///   ::= identifierexpr
1043 ///   ::= numberexpr
1044 ///   ::= parenexpr
1045 static ExprAST *ParsePrimary() {
1046   switch (CurTok) {
1047   default: return Error("unknown token when expecting an expression");
1048   case tok_identifier: return ParseIdentifierExpr();
1049   case tok_number:     return ParseNumberExpr();
1050   case '(':            return ParseParenExpr();
1051   }
1052 }
1053
1054 /// binoprhs
1055 ///   ::= ('+' primary)*
1056 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1057   // If this is a binop, find its precedence.
1058   while (1) {
1059     int TokPrec = GetTokPrecedence();
1060
1061     // If this is a binop that binds at least as tightly as the current binop,
1062     // consume it, otherwise we are done.
1063     if (TokPrec &lt; ExprPrec)
1064       return LHS;
1065
1066     // Okay, we know this is a binop.
1067     int BinOp = CurTok;
1068     getNextToken();  // eat binop
1069
1070     // Parse the primary expression after the binary operator.
1071     ExprAST *RHS = ParsePrimary();
1072     if (!RHS) return 0;
1073
1074     // If BinOp binds less tightly with RHS than the operator after RHS, let
1075     // the pending operator take RHS as its LHS.
1076     int NextPrec = GetTokPrecedence();
1077     if (TokPrec &lt; NextPrec) {
1078       RHS = ParseBinOpRHS(TokPrec+1, RHS);
1079       if (RHS == 0) return 0;
1080     }
1081
1082     // Merge LHS/RHS.
1083     LHS = new BinaryExprAST(BinOp, LHS, RHS);
1084   }
1085 }
1086
1087 /// expression
1088 ///   ::= primary binoprhs
1089 ///
1090 static ExprAST *ParseExpression() {
1091   ExprAST *LHS = ParsePrimary();
1092   if (!LHS) return 0;
1093
1094   return ParseBinOpRHS(0, LHS);
1095 }
1096
1097 /// prototype
1098 ///   ::= id '(' id* ')'
1099 static PrototypeAST *ParsePrototype() {
1100   if (CurTok != tok_identifier)
1101     return ErrorP("Expected function name in prototype");
1102
1103   std::string FnName = IdentifierStr;
1104   getNextToken();
1105
1106   if (CurTok != '(')
1107     return ErrorP("Expected '(' in prototype");
1108
1109   std::vector&lt;std::string&gt; ArgNames;
1110   while (getNextToken() == tok_identifier)
1111     ArgNames.push_back(IdentifierStr);
1112   if (CurTok != ')')
1113     return ErrorP("Expected ')' in prototype");
1114
1115   // success.
1116   getNextToken();  // eat ')'.
1117
1118   return new PrototypeAST(FnName, ArgNames);
1119 }
1120
1121 /// definition ::= 'def' prototype expression
1122 static FunctionAST *ParseDefinition() {
1123   getNextToken();  // eat def.
1124   PrototypeAST *Proto = ParsePrototype();
1125   if (Proto == 0) return 0;
1126
1127   if (ExprAST *E = ParseExpression())
1128     return new FunctionAST(Proto, E);
1129   return 0;
1130 }
1131
1132 /// toplevelexpr ::= expression
1133 static FunctionAST *ParseTopLevelExpr() {
1134   if (ExprAST *E = ParseExpression()) {
1135     // Make an anonymous proto.
1136     PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
1137     return new FunctionAST(Proto, E);
1138   }
1139   return 0;
1140 }
1141
1142 /// external ::= 'extern' prototype
1143 static PrototypeAST *ParseExtern() {
1144   getNextToken();  // eat extern.
1145   return ParsePrototype();
1146 }
1147
1148 //===----------------------------------------------------------------------===//
1149 // Top-Level parsing
1150 //===----------------------------------------------------------------------===//
1151
1152 static void HandleDefinition() {
1153   if (FunctionAST *F = ParseDefinition()) {
1154     fprintf(stderr, "Parsed a function definition.\n");
1155   } else {
1156     // Skip token for error recovery.
1157     getNextToken();
1158   }
1159 }
1160
1161 static void HandleExtern() {
1162   if (PrototypeAST *P = ParseExtern()) {
1163     fprintf(stderr, "Parsed an extern\n");
1164   } else {
1165     // Skip token for error recovery.
1166     getNextToken();
1167   }
1168 }
1169
1170 static void HandleTopLevelExpression() {
1171   // Evaluate a top-level expression into an anonymous function.
1172   if (FunctionAST *F = ParseTopLevelExpr()) {
1173     fprintf(stderr, "Parsed a top-level expr\n");
1174   } else {
1175     // Skip token for error recovery.
1176     getNextToken();
1177   }
1178 }
1179
1180 /// top ::= definition | external | expression | ';'
1181 static void MainLoop() {
1182   while (1) {
1183     fprintf(stderr, "ready&gt; ");
1184     switch (CurTok) {
1185     case tok_eof:    return;
1186     case ';':        getNextToken(); break;  // ignore top-level semicolons.
1187     case tok_def:    HandleDefinition(); break;
1188     case tok_extern: HandleExtern(); break;
1189     default:         HandleTopLevelExpression(); break;
1190     }
1191   }
1192 }
1193
1194 //===----------------------------------------------------------------------===//
1195 // Main driver code.
1196 //===----------------------------------------------------------------------===//
1197
1198 int main() {
1199   // Install standard binary operators.
1200   // 1 is lowest precedence.
1201   BinopPrecedence['&lt;'] = 10;
1202   BinopPrecedence['+'] = 20;
1203   BinopPrecedence['-'] = 20;
1204   BinopPrecedence['*'] = 40;  // highest.
1205
1206   // Prime the first token.
1207   fprintf(stderr, "ready&gt; ");
1208   getNextToken();
1209
1210   MainLoop();
1211   return 0;
1212 }
1213 </pre>
1214 </div>
1215 <a href="LangImpl3.html">Next: Implementing Code Generation to LLVM IR</a>
1216 </div>
1217
1218 <!-- *********************************************************************** -->
1219 <hr>
1220 <address>
1221   <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1222   src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1223   <a href="http://validator.w3.org/check/referer"><img
1224   src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
1225
1226   <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1227   <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
1228   Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
1229 </address>
1230 </body>
1231 </html>