docs/tutorial/OCamlLangImpl2.html

   1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   2                       "http://www.w3.org/TR/html4/strict.dtd">
   3
   4 <html>
   5 <head>
   6   <title>Kaleidoscope: Implementing a Parser and AST</title>
   7   <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   8   <meta name="author" content="Chris Lattner">
   9   <meta name="author" content="Erick Tryzelaar">
  10   <link rel="stylesheet" href="../llvm.css" type="text/css">
  11 </head>
  12
  13 <body>
  14
  15 <div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
  16
  17 <ul>
  18 <li><a href="index.html">Up to Tutorial Index</a></li>
  19 <li>Chapter 2
  20   <ol>
  21     <li><a href="#intro">Chapter 2 Introduction</a></li>
  22     <li><a href="#ast">The Abstract Syntax Tree (AST)</a></li>
  23     <li><a href="#parserbasics">Parser Basics</a></li>
  24     <li><a href="#parserprimexprs">Basic Expression Parsing</a></li>
  25     <li><a href="#parserbinops">Binary Expression Parsing</a></li>
  26     <li><a href="#parsertop">Parsing the Rest</a></li>
  27     <li><a href="#driver">The Driver</a></li>
  28     <li><a href="#conclusions">Conclusions</a></li>
  29     <li><a href="#code">Full Code Listing</a></li>
  30   </ol>
  31 </li>
  32 <li><a href="OCamlLangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li>
  33 </ul>
  34
  35 <div class="doc_author">
  36         <p>
  37                 Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a>
  38                 and <a href="mailto:idadesub@users.sourceforge.net">Erick Tryzelaar</a>
  39         </p>
  40 </div>
  41
  42 <!-- *********************************************************************** -->
  43 <div class="doc_section"><a name="intro">Chapter 2 Introduction</a></div>
  44 <!-- *********************************************************************** -->
  45
  46 <div class="doc_text">
  47
  48 <p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language
  49 with LLVM in Objective Caml</a>" tutorial.  This chapter shows you how to use
  50 the lexer, built in <a href="OCamlLangImpl1.html">Chapter 1</a>, to build a
  51 full <a href="http://en.wikipedia.org/wiki/Parsing">parser</a> for our
  52 Kaleidoscope language.  Once we have a parser, we'll define and build an <a
  53 href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
  54 Tree</a> (AST).</p>
  55
  56 <p>The parser we will build uses a combination of <a
  57 href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
  58 Parsing</a> and <a href=
  59 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
  60 Parsing</a> to parse the Kaleidoscope language (the latter for
  61 binary expressions and the former for everything else).  Before we get to
  62 parsing though, lets talk about the output of the parser: the Abstract Syntax
  63 Tree.</p>
  64
  65 </div>
  66
  67 <!-- *********************************************************************** -->
  68 <div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
  69 <!-- *********************************************************************** -->
  70
  71 <div class="doc_text">
  72
  73 <p>The AST for a program captures its behavior in such a way that it is easy for
  74 later stages of the compiler (e.g. code generation) to interpret.  We basically
  75 want one object for each construct in the language, and the AST should closely
  76 model the language.  In Kaleidoscope, we have expressions, a prototype, and a
  77 function object.  We'll start with expressions first:</p>
  78
  79 <div class="doc_code">
  80 <pre>
  81 (* expr - Base type for all expression nodes. *)
  82 type expr =
  83   (* variant for numeric literals like "1.0". *)
  84   | Number of float
  85 </pre>
  86 </div>
  87
  88 <p>The code above shows the definition of the base ExprAST class and one
  89 subclass which we use for numeric literals.  The important thing to note about
  90 this code is that the Number variant captures the numeric value of the
  91 literal as an instance variable. This allows later phases of the compiler to
  92 know what the stored numeric value is.</p>
  93
  94 <p>Right now we only create the AST,  so there are no useful functions on
  95 them.  It would be very easy to add a function to pretty print the code,
  96 for example.  Here are the other expression AST node definitions that we'll use
  97 in the basic form of the Kaleidoscope language:
  98 </p>
  99
 100 <div class="doc_code">
 101 <pre>
 102   (* variant for referencing a variable, like "a". *)
 103   | Variable of string
 104
 105   (* variant for a binary operator. *)
 106   | Binary of char * expr * expr
 107
 108   (* variant for function calls. *)
 109   | Call of string * expr array
 110 </pre>
 111 </div>
 112
 113 <p>This is all (intentionally) rather straight-forward: variables capture the
 114 variable name, binary operators capture their opcode (e.g. '+'), and calls
 115 capture a function name as well as a list of any argument expressions.  One thing
 116 that is nice about our AST is that it captures the language features without
 117 talking about the syntax of the language.  Note that there is no discussion about
 118 precedence of binary operators, lexical structure, etc.</p>
 119
 120 <p>For our basic language, these are all of the expression nodes we'll define.
 121 Because it doesn't have conditional control flow, it isn't Turing-complete;
 122 we'll fix that in a later installment.  The two things we need next are a way
 123 to talk about the interface to a function, and a way to talk about functions
 124 themselves:</p>
 125
 126 <div class="doc_code">
 127 <pre>
 128 (* proto - This type represents the "prototype" for a function, which captures
 129  * its name, and its argument names (thus implicitly the number of arguments the
 130  * function takes). *)
 131 type proto = Prototype of string * string array
 132
 133 (* func - This type represents a function definition itself. *)
 134 type func = Function of proto * expr
 135 </pre>
 136 </div>
 137
 138 <p>In Kaleidoscope, functions are typed with just a count of their arguments.
 139 Since all values are double precision floating point, the type of each argument
 140 doesn't need to be stored anywhere.  In a more aggressive and realistic
 141 language, the "expr" variants would probably have a type field.</p>
 142
 143 <p>With this scaffolding, we can now talk about parsing expressions and function
 144 bodies in Kaleidoscope.</p>
 145
 146 </div>
 147
 148 <!-- *********************************************************************** -->
 149 <div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
 150 <!-- *********************************************************************** -->
 151
 152 <div class="doc_text">
 153
 154 <p>Now that we have an AST to build, we need to define the parser code to build
 155 it.  The idea here is that we want to parse something like "x+y" (which is
 156 returned as three tokens by the lexer) into an AST that could be generated with
 157 calls like this:</p>
 158
 159 <div class="doc_code">
 160 <pre>
 161   let x = Variable "x" in
 162   let y = Variable "y" in
 163   let result = Binary ('+', x, y) in
 164   ...
 165 </pre>
 166 </div>
 167
 168 <p>
 169 The error handling routines make use of the builtin <tt>Stream.Failure</tt> and
 170 <tt>Stream.Error</tt>s.  <tt>Stream.Failure</tt> is raised when the parser is
 171 unable to find any matching token in the first position of a pattern.
 172 <tt>Stream.Error</tt> is raised when the first token matches, but the rest do
 173 not.  The error recovery in our parser will not be the best and is not
 174 particular user-friendly, but it will be enough for our tutorial.  These
 175 exceptions make it easier to handle errors in routines that have various return
 176 types.</p>
 177
 178 <p>With these basic types and exceptions, we can implement the first
 179 piece of our grammar: numeric literals.</p>
 180
 181 </div>
 182
 183 <!-- *********************************************************************** -->
 184 <div class="doc_section"><a name="parserprimexprs">Basic Expression
 185  Parsing</a></div>
 186 <!-- *********************************************************************** -->
 187
 188 <div class="doc_text">
 189
 190 <p>We start with numeric literals, because they are the simplest to process.
 191 For each production in our grammar, we'll define a function which parses that
 192 production.  We call this class of expressions "primary" expressions, for
 193 reasons that will become more clear <a href="OCamlLangImpl6.html#unary">
 194 later in the tutorial</a>.  In order to parse an arbitrary primary expression,
 195 we need to determine what sort of expression it is.  For numeric literals, we
 196 have:</p>
 197
 198 <div class="doc_code">
 199 <pre>
 200 (* primary
 201  *   ::= identifier
 202  *   ::= numberexpr
 203  *   ::= parenexpr *)
 204 parse_primary = parser
 205   (* numberexpr ::= number *)
 206   | [&lt; 'Token.Number n &gt;] -&gt; Ast.Number n
 207 </pre>
 208 </div>
 209
 210 <p>This routine is very simple: it expects to be called when the current token
 211 is a <tt>Token.Number</tt> token.  It takes the current number value, creates
 212 a <tt>Ast.Number</tt> node, advances the lexer to the next token, and finally
 213 returns.</p>
 214
 215 <p>There are some interesting aspects to this.  The most important one is that
 216 this routine eats all of the tokens that correspond to the production and
 217 returns the lexer buffer with the next token (which is not part of the grammar
 218 production) ready to go.  This is a fairly standard way to go for recursive
 219 descent parsers.  For a better example, the parenthesis operator is defined like
 220 this:</p>
 221
 222 <div class="doc_code">
 223 <pre>
 224   (* parenexpr ::= '(' expression ')' *)
 225   | [&lt; 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" &gt;] -&gt; e
 226 </pre>
 227 </div>
 228
 229 <p>This function illustrates a number of interesting things about the
 230 parser:</p>
 231
 232 <p>
 233 1) It shows how we use the <tt>Stream.Error</tt> exception.  When called, this
 234 function expects that the current token is a '(' token, but after parsing the
 235 subexpression, it is possible that there is no ')' waiting.  For example, if
 236 the user types in "(4 x" instead of "(4)", the parser should emit an error.
 237 Because errors can occur, the parser needs a way to indicate that they
 238 happened. In our parser, we use the camlp4 shortcut syntax <tt>token ?? "parse
 239 error"</tt>, where if the token before the <tt>??</tt> does not match, then
 240 <tt>Stream.Error "parse error"</tt> will be raised.</p>
 241
 242 <p>2) Another interesting aspect of this function is that it uses recursion by
 243 calling <tt>Parser.parse_primary</tt> (we will soon see that
 244 <tt>Parser.parse_primary</tt> can call <tt>Parser.parse_primary</tt>).  This is
 245 powerful because it allows us to handle recursive grammars, and keeps each
 246 production very simple.  Note that parentheses do not cause construction of AST
 247 nodes themselves.  While we could do it this way, the most important role of
 248 parentheses are to guide the parser and provide grouping.  Once the parser
 249 constructs the AST, parentheses are not needed.</p>
 250
 251 <p>The next simple production is for handling variable references and function
 252 calls:</p>
 253
 254 <div class="doc_code">
 255 <pre>
 256   (* identifierexpr
 257    *   ::= identifier
 258    *   ::= identifier '(' argumentexpr ')' *)
 259   | [&lt; 'Token.Ident id; stream &gt;] -&gt;
 260       let rec parse_args accumulator = parser
 261         | [&lt; e=parse_expr; stream &gt;] -&gt;
 262             begin parser
 263               | [&lt; 'Token.Kwd ','; e=parse_args (e :: accumulator) &gt;] -&gt; e
 264               | [&lt; &gt;] -&gt; e :: accumulator
 265             end stream
 266         | [&lt; &gt;] -&gt; accumulator
 267       in
 268       let rec parse_ident id = parser
 269         (* Call. *)
 270         | [&lt; 'Token.Kwd '(';
 271              args=parse_args [];
 272              'Token.Kwd ')' ?? "expected ')'"&gt;] -&gt;
 273             Ast.Call (id, Array.of_list (List.rev args))
 274
 275         (* Simple variable ref. *)
 276         | [&lt; &gt;] -&gt; Ast.Variable id
 277       in
 278       parse_ident id stream
 279 </pre>
 280 </div>
 281
 282 <p>This routine follows the same style as the other routines.  (It expects to be
 283 called if the current token is a <tt>Token.Ident</tt> token).  It also has
 284 recursion and error handling.  One interesting aspect of this is that it uses
 285 <em>look-ahead</em> to determine if the current identifier is a stand alone
 286 variable reference or if it is a function call expression.  It handles this by
 287 checking to see if the token after the identifier is a '(' token, constructing
 288 either a <tt>Ast.Variable</tt> or <tt>Ast.Call</tt> node as appropriate.
 289 </p>
 290
 291 <p>We finish up by raising an exception if we received a token we didn't
 292 expect:</p>
 293
 294 <div class="doc_code">
 295 <pre>
 296   | [&lt; &gt;] -&gt; raise (Stream.Error "unknown token when expecting an expression.")
 297 </pre>
 298 </div>
 299
 300 <p>Now that basic expressions are handled, we need to handle binary expressions.
 301 They are a bit more complex.</p>
 302
 303 </div>
 304
 305 <!-- *********************************************************************** -->
 306 <div class="doc_section"><a name="parserbinops">Binary Expression
 307  Parsing</a></div>
 308 <!-- *********************************************************************** -->
 309
 310 <div class="doc_text">
 311
 312 <p>Binary expressions are significantly harder to parse because they are often
 313 ambiguous.  For example, when given the string "x+y*z", the parser can choose
 314 to parse it as either "(x+y)*z" or "x+(y*z)".  With common definitions from
 315 mathematics, we expect the later parse, because "*" (multiplication) has
 316 higher <em>precedence</em> than "+" (addition).</p>
 317
 318 <p>There are many ways to handle this, but an elegant and efficient way is to
 319 use <a href=
 320 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
 321 Parsing</a>.  This parsing technique uses the precedence of binary operators to
 322 guide recursion.  To start with, we need a table of precedences:</p>
 323
 324 <div class="doc_code">
 325 <pre>
 326 (* binop_precedence - This holds the precedence for each binary operator that is
 327  * defined *)
 328 let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
 329
 330 (* precedence - Get the precedence of the pending binary operator token. *)
 331 let precedence c = try Hashtbl.find binop_precedence c with Not_found -&gt; -1
 332
 333 ...
 334
 335 let main () =
 336   (* Install standard binary operators.
 337    * 1 is the lowest precedence. *)
 338   Hashtbl.add Parser.binop_precedence '&lt;' 10;
 339   Hashtbl.add Parser.binop_precedence '+' 20;
 340   Hashtbl.add Parser.binop_precedence '-' 20;
 341   Hashtbl.add Parser.binop_precedence '*' 40;    (* highest. *)
 342   ...
 343 </pre>
 344 </div>
 345
 346 <p>For the basic form of Kaleidoscope, we will only support 4 binary operators
 347 (this can obviously be extended by you, our brave and intrepid reader).  The
 348 <tt>Parser.precedence</tt> function returns the precedence for the current
 349 token, or -1 if the token is not a binary operator.  Having a <tt>Hashtbl.t</tt>
 350 makes it easy to add new operators and makes it clear that the algorithm doesn't
 351 depend on the specific operators involved, but it would be easy enough to
 352 eliminate the <tt>Hashtbl.t</tt> and do the comparisons in the
 353 <tt>Parser.precedence</tt> function.  (Or just use a fixed-size array).</p>
 354
 355 <p>With the helper above defined, we can now start parsing binary expressions.
 356 The basic idea of operator precedence parsing is to break down an expression
 357 with potentially ambiguous binary operators into pieces.  Consider ,for example,
 358 the expression "a+b+(c+d)*e*f+g".  Operator precedence parsing considers this
 359 as a stream of primary expressions separated by binary operators.  As such,
 360 it will first parse the leading primary expression "a", then it will see the
 361 pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g].  Note that because parentheses
 362 are primary expressions, the binary expression parser doesn't need to worry
 363 about nested subexpressions like (c+d) at all.
 364 </p>
 365
 366 <p>
 367 To start, an expression is a primary expression potentially followed by a
 368 sequence of [binop,primaryexpr] pairs:</p>
 369
 370 <div class="doc_code">
 371 <pre>
 372 (* expression
 373  *   ::= primary binoprhs *)
 374 and parse_expr = parser
 375   | [&lt; lhs=parse_primary; stream &gt;] -&gt; parse_bin_rhs 0 lhs stream
 376 </pre>
 377 </div>
 378
 379 <p><tt>Parser.parse_bin_rhs</tt> is the function that parses the sequence of
 380 pairs for us.  It takes a precedence and a pointer to an expression for the part
 381 that has been parsed so far.   Note that "x" is a perfectly valid expression: As
 382 such, "binoprhs" is allowed to be empty, in which case it returns the expression
 383 that is passed into it. In our example above, the code passes the expression for
 384 "a" into <tt>Parser.parse_bin_rhs</tt> and the current token is "+".</p>
 385
 386 <p>The precedence value passed into <tt>Parser.parse_bin_rhs</tt> indicates the
 387 <em>minimal operator precedence</em> that the function is allowed to eat.  For
 388 example, if the current pair stream is [+, x] and <tt>Parser.parse_bin_rhs</tt>
 389 is passed in a precedence of 40, it will not consume any tokens (because the
 390 precedence of '+' is only 20).  With this in mind, <tt>Parser.parse_bin_rhs</tt>
 391 starts with:</p>
 392
 393 <div class="doc_code">
 394 <pre>
 395 (* binoprhs
 396  *   ::= ('+' primary)* *)
 397 and parse_bin_rhs expr_prec lhs stream =
 398   match Stream.peek stream with
 399   (* If this is a binop, find its precedence. *)
 400   | Some (Token.Kwd c) when Hashtbl.mem binop_precedence c -&gt;
 401       let token_prec = precedence c in
 402
 403       (* If this is a binop that binds at least as tightly as the current binop,
 404        * consume it, otherwise we are done. *)
 405       if token_prec &lt; expr_prec then lhs else begin
 406 </pre>
 407 </div>
 408
 409 <p>This code gets the precedence of the current token and checks to see if if is
 410 too low.  Because we defined invalid tokens to have a precedence of -1, this
 411 check implicitly knows that the pair-stream ends when the token stream runs out
 412 of binary operators.  If this check succeeds, we know that the token is a binary
 413 operator and that it will be included in this expression:</p>
 414
 415 <div class="doc_code">
 416 <pre>
 417         (* Eat the binop. *)
 418         Stream.junk stream;
 419
 420         (* Okay, we know this is a binop. *)
 421         let rhs =
 422           match Stream.peek stream with
 423           | Some (Token.Kwd c2) -&gt;
 424 </pre>
 425 </div>
 426
 427 <p>As such, this code eats (and remembers) the binary operator and then parses
 428 the primary expression that follows.  This builds up the whole pair, the first of
 429 which is [+, b] for the running example.</p>
 430
 431 <p>Now that we parsed the left-hand side of an expression and one pair of the
 432 RHS sequence, we have to decide which way the expression associates.  In
 433 particular, we could have "(a+b) binop unparsed"  or "a + (b binop unparsed)".
 434 To determine this, we look ahead at "binop" to determine its precedence and
 435 compare it to BinOp's precedence (which is '+' in this case):</p>
 436
 437 <div class="doc_code">
 438 <pre>
 439               (* If BinOp binds less tightly with rhs than the operator after
 440                * rhs, let the pending operator take rhs as its lhs. *)
 441               let next_prec = precedence c2 in
 442               if token_prec &lt; next_prec
 443 </pre>
 444 </div>
 445
 446 <p>If the precedence of the binop to the right of "RHS" is lower or equal to the
 447 precedence of our current operator, then we know that the parentheses associate
 448 as "(a+b) binop ...".  In our example, the current operator is "+" and the next
 449 operator is "+", we know that they have the same precedence.  In this case we'll
 450 create the AST node for "a+b", and then continue parsing:</p>
 451
 452 <div class="doc_code">
 453 <pre>
 454           ... if body omitted ...
 455         in
 456
 457         (* Merge lhs/rhs. *)
 458         let lhs = Ast.Binary (c, lhs, rhs) in
 459         parse_bin_rhs expr_prec lhs stream
 460       end
 461 </pre>
 462 </div>
 463
 464 <p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
 465 iteration of the loop, with "+" as the current token.  The code above will eat,
 466 remember, and parse "(c+d)" as the primary expression, which makes the
 467 current pair equal to [+, (c+d)].  It will then evaluate the 'if' conditional above with
 468 "*" as the binop to the right of the primary.  In this case, the precedence of "*" is
 469 higher than the precedence of "+" so the if condition will be entered.</p>
 470
 471 <p>The critical question left here is "how can the if condition parse the right
 472 hand side in full"?  In particular, to build the AST correctly for our example,
 473 it needs to get all of "(c+d)*e*f" as the RHS expression variable.  The code to
 474 do this is surprisingly simple (code from the above two blocks duplicated for
 475 context):</p>
 476
 477 <div class="doc_code">
 478 <pre>
 479           match Stream.peek stream with
 480           | Some (Token.Kwd c2) -&gt;
 481               (* If BinOp binds less tightly with rhs than the operator after
 482                * rhs, let the pending operator take rhs as its lhs. *)
 483               if token_prec &lt; precedence c2
 484               then <b>parse_bin_rhs (token_prec + 1) rhs stream</b>
 485               else rhs
 486           | _ -&gt; rhs
 487         in
 488
 489         (* Merge lhs/rhs. *)
 490         let lhs = Ast.Binary (c, lhs, rhs) in
 491         parse_bin_rhs expr_prec lhs stream
 492       end
 493 </pre>
 494 </div>
 495
 496 <p>At this point, we know that the binary operator to the RHS of our primary
 497 has higher precedence than the binop we are currently parsing.  As such, we know
 498 that any sequence of pairs whose operators are all higher precedence than "+"
 499 should be parsed together and returned as "RHS".  To do this, we recursively
 500 invoke the <tt>Parser.parse_bin_rhs</tt> function specifying "token_prec+1" as
 501 the minimum precedence required for it to continue.  In our example above, this
 502 will cause it to return the AST node for "(c+d)*e*f" as RHS, which is then set
 503 as the RHS of the '+' expression.</p>
 504
 505 <p>Finally, on the next iteration of the while loop, the "+g" piece is parsed
 506 and added to the AST.  With this little bit of code (14 non-trivial lines), we
 507 correctly handle fully general binary expression parsing in a very elegant way.
 508 This was a whirlwind tour of this code, and it is somewhat subtle.  I recommend
 509 running through it with a few tough examples to see how it works.
 510 </p>
 511
 512 <p>This wraps up handling of expressions.  At this point, we can point the
 513 parser at an arbitrary token stream and build an expression from it, stopping
 514 at the first token that is not part of the expression.  Next up we need to
 515 handle function definitions, etc.</p>
 516
 517 </div>
 518
 519 <!-- *********************************************************************** -->
 520 <div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
 521 <!-- *********************************************************************** -->
 522
 523 <div class="doc_text">
 524
 525 <p>
 526 The next thing missing is handling of function prototypes.  In Kaleidoscope,
 527 these are used both for 'extern' function declarations as well as function body
 528 definitions.  The code to do this is straight-forward and not very interesting
 529 (once you've survived expressions):
 530 </p>
 531
 532 <div class="doc_code">
 533 <pre>
 534 (* prototype
 535  *   ::= id '(' id* ')' *)
 536 let parse_prototype =
 537   let rec parse_args accumulator = parser
 538     | [&lt; 'Token.Ident id; e=parse_args (id::accumulator) &gt;] -&gt; e
 539     | [&lt; &gt;] -&gt; accumulator
 540   in
 541
 542   parser
 543   | [&lt; 'Token.Ident id;
 544        'Token.Kwd '(' ?? "expected '(' in prototype";
 545        args=parse_args [];
 546        'Token.Kwd ')' ?? "expected ')' in prototype" &gt;] -&gt;
 547       (* success. *)
 548       Ast.Prototype (id, Array.of_list (List.rev args))
 549
 550   | [&lt; &gt;] -&gt;
 551       raise (Stream.Error "expected function name in prototype")
 552 </pre>
 553 </div>
 554
 555 <p>Given this, a function definition is very simple, just a prototype plus
 556 an expression to implement the body:</p>
 557
 558 <div class="doc_code">
 559 <pre>
 560 (* definition ::= 'def' prototype expression *)
 561 let parse_definition = parser
 562   | [&lt; 'Token.Def; p=parse_prototype; e=parse_expr &gt;] -&gt;
 563       Ast.Function (p, e)
 564 </pre>
 565 </div>
 566
 567 <p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
 568 well as to support forward declaration of user functions.  These 'extern's are just
 569 prototypes with no body:</p>
 570
 571 <div class="doc_code">
 572 <pre>
 573 (*  external ::= 'extern' prototype *)
 574 let parse_extern = parser
 575   | [&lt; 'Token.Extern; e=parse_prototype &gt;] -&gt; e
 576 </pre>
 577 </div>
 578
 579 <p>Finally, we'll also let the user type in arbitrary top-level expressions and
 580 evaluate them on the fly.  We will handle this by defining anonymous nullary
 581 (zero argument) functions for them:</p>
 582
 583 <div class="doc_code">
 584 <pre>
 585 (* toplevelexpr ::= expression *)
 586 let parse_toplevel = parser
 587   | [&lt; e=parse_expr &gt;] -&gt;
 588       (* Make an anonymous proto. *)
 589       Ast.Function (Ast.Prototype ("", [||]), e)
 590 </pre>
 591 </div>
 592
 593 <p>Now that we have all the pieces, let's build a little driver that will let us
 594 actually <em>execute</em> this code we've built!</p>
 595
 596 </div>
 597
 598 <!-- *********************************************************************** -->
 599 <div class="doc_section"><a name="driver">The Driver</a></div>
 600 <!-- *********************************************************************** -->
 601
 602 <div class="doc_text">
 603
 604 <p>The driver for this simply invokes all of the parsing pieces with a top-level
 605 dispatch loop.  There isn't much interesting here, so I'll just include the
 606 top-level loop.  See <a href="#code">below</a> for full code in the "Top-Level
 607 Parsing" section.</p>
 608
 609 <div class="doc_code">
 610 <pre>
 611 (* top ::= definition | external | expression | ';' *)
 612 let rec main_loop stream =
 613   match Stream.peek stream with
 614   | None -&gt; ()
 615
 616   (* ignore top-level semicolons. *)
 617   | Some (Token.Kwd ';') -&gt;
 618       Stream.junk stream;
 619       main_loop stream
 620
 621   | Some token -&gt;
 622       begin
 623         try match token with
 624         | Token.Def -&gt;
 625             ignore(Parser.parse_definition stream);
 626             print_endline "parsed a function definition.";
 627         | Token.Extern -&gt;
 628             ignore(Parser.parse_extern stream);
 629             print_endline "parsed an extern.";
 630         | _ -&gt;
 631             (* Evaluate a top-level expression into an anonymous function. *)
 632             ignore(Parser.parse_toplevel stream);
 633             print_endline "parsed a top-level expr";
 634         with Stream.Error s -&gt;
 635           (* Skip token for error recovery. *)
 636           Stream.junk stream;
 637           print_endline s;
 638       end;
 639       print_string "ready&gt; "; flush stdout;
 640       main_loop stream
 641 </pre>
 642 </div>
 643
 644 <p>The most interesting part of this is that we ignore top-level semicolons.
 645 Why is this, you ask?  The basic reason is that if you type "4 + 5" at the
 646 command line, the parser doesn't know whether that is the end of what you will type
 647 or not.  For example, on the next line you could type "def foo..." in which case
 648 4+5 is the end of a top-level expression.  Alternatively you could type "* 6",
 649 which would continue the expression.  Having top-level semicolons allows you to
 650 type "4+5;", and the parser will know you are done.</p>
 651
 652 </div>
 653
 654 <!-- *********************************************************************** -->
 655 <div class="doc_section"><a name="conclusions">Conclusions</a></div>
 656 <!-- *********************************************************************** -->
 657
 658 <div class="doc_text">
 659
 660 <p>With just under 300 lines of commented code (240 lines of non-comment,
 661 non-blank code), we fully defined our minimal language, including a lexer,
 662 parser, and AST builder.  With this done, the executable will validate
 663 Kaleidoscope code and tell us if it is grammatically invalid.  For
 664 example, here is a sample interaction:</p>
 665
 666 <div class="doc_code">
 667 <pre>
 668 $ <b>./toy.byte</b>
 669 ready&gt; <b>def foo(x y) x+foo(y, 4.0);</b>
 670 Parsed a function definition.
 671 ready&gt; <b>def foo(x y) x+y y;</b>
 672 Parsed a function definition.
 673 Parsed a top-level expr
 674 ready&gt; <b>def foo(x y) x+y );</b>
 675 Parsed a function definition.
 676 Error: unknown token when expecting an expression
 677 ready&gt; <b>extern sin(a);</b>
 678 ready&gt; Parsed an extern
 679 ready&gt; <b>^D</b>
 680 $
 681 </pre>
 682 </div>
 683
 684 <p>There is a lot of room for extension here.  You can define new AST nodes,
 685 extend the language in many ways, etc.  In the <a href="OCamlLangImpl3.html">
 686 next installment</a>, we will describe how to generate LLVM Intermediate
 687 Representation (IR) from the AST.</p>
 688
 689 </div>
 690
 691 <!-- *********************************************************************** -->
 692 <div class="doc_section"><a name="code">Full Code Listing</a></div>
 693 <!-- *********************************************************************** -->
 694
 695 <div class="doc_text">
 696
 697 <p>
 698 Here is the complete code listing for this and the previous chapter.
 699 Note that it is fully self-contained: you don't need LLVM or any external
 700 libraries at all for this.  (Besides the ocaml standard libraries, of
 701 course.)  To build this, just compile with:</p>
 702
 703 <div class="doc_code">
 704 <pre>
 705 # Compile
 706 ocamlbuild toy.byte
 707 # Run
 708 ./toy.byte
 709 </pre>
 710 </div>
 711
 712 <p>Here is the code:</p>
 713
 714 <dl>
 715 <dt>_tags:</dt>
 716 <dd class="doc_code">
 717 <pre>
 718 &lt;{lexer,parser}.ml&gt;: use_camlp4, pp(camlp4of)
 719 </pre>
 720 </dd>
 721
 722 <dt>token.ml:</dt>
 723 <dd class="doc_code">
 724 <pre>
 725 (*===----------------------------------------------------------------------===
 726  * Lexer Tokens
 727  *===----------------------------------------------------------------------===*)
 728
 729 (* The lexer returns these 'Kwd' if it is an unknown character, otherwise one of
 730  * these others for known things. *)
 731 type token =
 732   (* commands *)
 733   | Def | Extern
 734
 735   (* primary *)
 736   | Ident of string | Number of float
 737
 738   (* unknown *)
 739   | Kwd of char
 740 </pre>
 741 </dd>
 742
 743 <dt>lexer.ml:</dt>
 744 <dd class="doc_code">
 745 <pre>
 746 (*===----------------------------------------------------------------------===
 747  * Lexer
 748  *===----------------------------------------------------------------------===*)
 749
 750 let rec lex = parser
 751   (* Skip any whitespace. *)
 752   | [&lt; ' (' ' | '\n' | '\r' | '\t'); stream &gt;] -&gt; lex stream
 753
 754   (* identifier: [a-zA-Z][a-zA-Z0-9] *)
 755   | [&lt; ' ('A' .. 'Z' | 'a' .. 'z' as c); stream &gt;] -&gt;
 756       let buffer = Buffer.create 1 in
 757       Buffer.add_char buffer c;
 758       lex_ident buffer stream
 759
 760   (* number: [0-9.]+ *)
 761   | [&lt; ' ('0' .. '9' as c); stream &gt;] -&gt;
 762       let buffer = Buffer.create 1 in
 763       Buffer.add_char buffer c;
 764       lex_number buffer stream
 765
 766   (* Comment until end of line. *)
 767   | [&lt; ' ('#'); stream &gt;] -&gt;
 768       lex_comment stream
 769
 770   (* Otherwise, just return the character as its ascii value. *)
 771   | [&lt; 'c; stream &gt;] -&gt;
 772       [&lt; 'Token.Kwd c; lex stream &gt;]
 773
 774   (* end of stream. *)
 775   | [&lt; &gt;] -&gt; [&lt; &gt;]
 776
 777 and lex_number buffer = parser
 778   | [&lt; ' ('0' .. '9' | '.' as c); stream &gt;] -&gt;
 779       Buffer.add_char buffer c;
 780       lex_number buffer stream
 781   | [&lt; stream=lex &gt;] -&gt;
 782       [&lt; 'Token.Number (float_of_string (Buffer.contents buffer)); stream &gt;]
 783
 784 and lex_ident buffer = parser
 785   | [&lt; ' ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' as c); stream &gt;] -&gt;
 786       Buffer.add_char buffer c;
 787       lex_ident buffer stream
 788   | [&lt; stream=lex &gt;] -&gt;
 789       match Buffer.contents buffer with
 790       | "def" -&gt; [&lt; 'Token.Def; stream &gt;]
 791       | "extern" -&gt; [&lt; 'Token.Extern; stream &gt;]
 792       | id -&gt; [&lt; 'Token.Ident id; stream &gt;]
 793
 794 and lex_comment = parser
 795   | [&lt; ' ('\n'); stream=lex &gt;] -&gt; stream
 796   | [&lt; 'c; e=lex_comment &gt;] -&gt; e
 797   | [&lt; &gt;] -&gt; [&lt; &gt;]
 798 </pre>
 799 </dd>
 800
 801 <dt>ast.ml:</dt>
 802 <dd class="doc_code">
 803 <pre>
 804 (*===----------------------------------------------------------------------===
 805  * Abstract Syntax Tree (aka Parse Tree)
 806  *===----------------------------------------------------------------------===*)
 807
 808 (* expr - Base type for all expression nodes. *)
 809 type expr =
 810   (* variant for numeric literals like "1.0". *)
 811   | Number of float
 812
 813   (* variant for referencing a variable, like "a". *)
 814   | Variable of string
 815
 816   (* variant for a binary operator. *)
 817   | Binary of char * expr * expr
 818
 819   (* variant for function calls. *)
 820   | Call of string * expr array
 821
 822 (* proto - This type represents the "prototype" for a function, which captures
 823  * its name, and its argument names (thus implicitly the number of arguments the
 824  * function takes). *)
 825 type proto = Prototype of string * string array
 826
 827 (* func - This type represents a function definition itself. *)
 828 type func = Function of proto * expr
 829 </pre>
 830 </dd>
 831
 832 <dt>parser.ml:</dt>
 833 <dd class="doc_code">
 834 <pre>
 835 (*===---------------------------------------------------------------------===
 836  * Parser
 837  *===---------------------------------------------------------------------===*)
 838
 839 (* binop_precedence - This holds the precedence for each binary operator that is
 840  * defined *)
 841 let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
 842
 843 (* precedence - Get the precedence of the pending binary operator token. *)
 844 let precedence c = try Hashtbl.find binop_precedence c with Not_found -&gt; -1
 845
 846 (* primary
 847  *   ::= identifier
 848  *   ::= numberexpr
 849  *   ::= parenexpr *)
 850 let rec parse_primary = parser
 851   (* numberexpr ::= number *)
 852   | [&lt; 'Token.Number n &gt;] -&gt; Ast.Number n
 853
 854   (* parenexpr ::= '(' expression ')' *)
 855   | [&lt; 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" &gt;] -&gt; e
 856
 857   (* identifierexpr
 858    *   ::= identifier
 859    *   ::= identifier '(' argumentexpr ')' *)
 860   | [&lt; 'Token.Ident id; stream &gt;] -&gt;
 861       let rec parse_args accumulator = parser
 862         | [&lt; e=parse_expr; stream &gt;] -&gt;
 863             begin parser
 864               | [&lt; 'Token.Kwd ','; e=parse_args (e :: accumulator) &gt;] -&gt; e
 865               | [&lt; &gt;] -&gt; e :: accumulator
 866             end stream
 867         | [&lt; &gt;] -&gt; accumulator
 868       in
 869       let rec parse_ident id = parser
 870         (* Call. *)
 871         | [&lt; 'Token.Kwd '(';
 872              args=parse_args [];
 873              'Token.Kwd ')' ?? "expected ')'"&gt;] -&gt;
 874             Ast.Call (id, Array.of_list (List.rev args))
 875
 876         (* Simple variable ref. *)
 877         | [&lt; &gt;] -&gt; Ast.Variable id
 878       in
 879       parse_ident id stream
 880
 881   | [&lt; &gt;] -&gt; raise (Stream.Error "unknown token when expecting an expression.")
 882
 883 (* binoprhs
 884  *   ::= ('+' primary)* *)
 885 and parse_bin_rhs expr_prec lhs stream =
 886   match Stream.peek stream with
 887   (* If this is a binop, find its precedence. *)
 888   | Some (Token.Kwd c) when Hashtbl.mem binop_precedence c -&gt;
 889       let token_prec = precedence c in
 890
 891       (* If this is a binop that binds at least as tightly as the current binop,
 892        * consume it, otherwise we are done. *)
 893       if token_prec &lt; expr_prec then lhs else begin
 894         (* Eat the binop. *)
 895         Stream.junk stream;
 896
 897         (* Parse the primary expression after the binary operator. *)
 898         let rhs = parse_primary stream in
 899
 900         (* Okay, we know this is a binop. *)
 901         let rhs =
 902           match Stream.peek stream with
 903           | Some (Token.Kwd c2) -&gt;
 904               (* If BinOp binds less tightly with rhs than the operator after
 905                * rhs, let the pending operator take rhs as its lhs. *)
 906               let next_prec = precedence c2 in
 907               if token_prec &lt; next_prec
 908               then parse_bin_rhs (token_prec + 1) rhs stream
 909               else rhs
 910           | _ -&gt; rhs
 911         in
 912
 913         (* Merge lhs/rhs. *)
 914         let lhs = Ast.Binary (c, lhs, rhs) in
 915         parse_bin_rhs expr_prec lhs stream
 916       end
 917   | _ -&gt; lhs
 918
 919 (* expression
 920  *   ::= primary binoprhs *)
 921 and parse_expr = parser
 922   | [&lt; lhs=parse_primary; stream &gt;] -&gt; parse_bin_rhs 0 lhs stream
 923
 924 (* prototype
 925  *   ::= id '(' id* ')' *)
 926 let parse_prototype =
 927   let rec parse_args accumulator = parser
 928     | [&lt; 'Token.Ident id; e=parse_args (id::accumulator) &gt;] -&gt; e
 929     | [&lt; &gt;] -&gt; accumulator
 930   in
 931
 932   parser
 933   | [&lt; 'Token.Ident id;
 934        'Token.Kwd '(' ?? "expected '(' in prototype";
 935        args=parse_args [];
 936        'Token.Kwd ')' ?? "expected ')' in prototype" &gt;] -&gt;
 937       (* success. *)
 938       Ast.Prototype (id, Array.of_list (List.rev args))
 939
 940   | [&lt; &gt;] -&gt;
 941       raise (Stream.Error "expected function name in prototype")
 942
 943 (* definition ::= 'def' prototype expression *)
 944 let parse_definition = parser
 945   | [&lt; 'Token.Def; p=parse_prototype; e=parse_expr &gt;] -&gt;
 946       Ast.Function (p, e)
 947
 948 (* toplevelexpr ::= expression *)
 949 let parse_toplevel = parser
 950   | [&lt; e=parse_expr &gt;] -&gt;
 951       (* Make an anonymous proto. *)
 952       Ast.Function (Ast.Prototype ("", [||]), e)
 953
 954 (*  external ::= 'extern' prototype *)
 955 let parse_extern = parser
 956   | [&lt; 'Token.Extern; e=parse_prototype &gt;] -&gt; e
 957 </pre>
 958 </dd>
 959
 960 <dt>toplevel.ml:</dt>
 961 <dd class="doc_code">
 962 <pre>
 963 (*===----------------------------------------------------------------------===
 964  * Top-Level parsing and JIT Driver
 965  *===----------------------------------------------------------------------===*)
 966
 967 (* top ::= definition | external | expression | ';' *)
 968 let rec main_loop stream =
 969   match Stream.peek stream with
 970   | None -&gt; ()
 971
 972   (* ignore top-level semicolons. *)
 973   | Some (Token.Kwd ';') -&gt;
 974       Stream.junk stream;
 975       main_loop stream
 976
 977   | Some token -&gt;
 978       begin
 979         try match token with
 980         | Token.Def -&gt;
 981             ignore(Parser.parse_definition stream);
 982             print_endline "parsed a function definition.";
 983         | Token.Extern -&gt;
 984             ignore(Parser.parse_extern stream);
 985             print_endline "parsed an extern.";
 986         | _ -&gt;
 987             (* Evaluate a top-level expression into an anonymous function. *)
 988             ignore(Parser.parse_toplevel stream);
 989             print_endline "parsed a top-level expr";
 990         with Stream.Error s -&gt;
 991           (* Skip token for error recovery. *)
 992           Stream.junk stream;
 993           print_endline s;
 994       end;
 995       print_string "ready&gt; "; flush stdout;
 996       main_loop stream
 997 </pre>
 998 </dd>
 999
1000 <dt>toy.ml:</dt>
1001 <dd class="doc_code">
1002 <pre>
1003 (*===----------------------------------------------------------------------===
1004  * Main driver code.
1005  *===----------------------------------------------------------------------===*)
1006
1007 let main () =
1008   (* Install standard binary operators.
1009    * 1 is the lowest precedence. *)
1010   Hashtbl.add Parser.binop_precedence '&lt;' 10;
1011   Hashtbl.add Parser.binop_precedence '+' 20;
1012   Hashtbl.add Parser.binop_precedence '-' 20;
1013   Hashtbl.add Parser.binop_precedence '*' 40;    (* highest. *)
1014
1015   (* Prime the first token. *)
1016   print_string "ready&gt; "; flush stdout;
1017   let stream = Lexer.lex (Stream.of_channel stdin) in
1018
1019   (* Run the main "interpreter loop" now. *)
1020   Toplevel.main_loop stream;
1021 ;;
1022
1023 main ()
1024 </pre>
1025 </dd>
1026 </dl>
1027
1028 <a href="OCamlLangImpl3.html">Next: Implementing Code Generation to LLVM IR</a>
1029 </div>
1030
1031 <!-- *********************************************************************** -->
1032 <hr>
1033 <address>
1034   <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1035   src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1036   <a href="http://validator.w3.org/check/referer"><img
1037   src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
1038
1039   <a href="mailto:sabre@nondot.org">Chris Lattner</a>
1040   <a href="mailto:erickt@users.sourceforge.net">Erick Tryzelaar</a><br>
1041   <a href="http://llvm.org/">The LLVM Compiler Infrastructure</a><br>
1042   Last modified: $Date$
1043 </address>
1044 </body>
1045 </html>