docs/tutorial/OCamlLangImpl2.rst

   1 ===========================================
   2 Kaleidoscope: Implementing a Parser and AST
   3 ===========================================
   4
   5 .. contents::
   6    :local:
   7
   8 Written by `Chris Lattner <mailto:sabre@nondot.org>`_ and `Erick
   9 Tryzelaar <mailto:idadesub@users.sourceforge.net>`_
  10
  11 Chapter 2 Introduction
  12 ======================
  13
  14 Welcome to Chapter 2 of the "`Implementing a language with LLVM in
  15 Objective Caml <index.html>`_" tutorial. This chapter shows you how to
  16 use the lexer, built in `Chapter 1 <OCamlLangImpl1.html>`_, to build a
  17 full `parser <http://en.wikipedia.org/wiki/Parsing>`_ for our
  18 Kaleidoscope language. Once we have a parser, we'll define and build an
  19 `Abstract Syntax
  20 Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
  21
  22 The parser we will build uses a combination of `Recursive Descent
  23 Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
  24 `Operator-Precedence
  25 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
  26 parse the Kaleidoscope language (the latter for binary expressions and
  27 the former for everything else). Before we get to parsing though, lets
  28 talk about the output of the parser: the Abstract Syntax Tree.
  29
  30 The Abstract Syntax Tree (AST)
  31 ==============================
  32
  33 The AST for a program captures its behavior in such a way that it is
  34 easy for later stages of the compiler (e.g. code generation) to
  35 interpret. We basically want one object for each construct in the
  36 language, and the AST should closely model the language. In
  37 Kaleidoscope, we have expressions, a prototype, and a function object.
  38 We'll start with expressions first:
  39
  40 .. code-block:: ocaml
  41
  42     (* expr - Base type for all expression nodes. *)
  43     type expr =
  44       (* variant for numeric literals like "1.0". *)
  45       | Number of float
  46
  47 The code above shows the definition of the base ExprAST class and one
  48 subclass which we use for numeric literals. The important thing to note
  49 about this code is that the Number variant captures the numeric value of
  50 the literal as an instance variable. This allows later phases of the
  51 compiler to know what the stored numeric value is.
  52
  53 Right now we only create the AST, so there are no useful functions on
  54 them. It would be very easy to add a function to pretty print the code,
  55 for example. Here are the other expression AST node definitions that
  56 we'll use in the basic form of the Kaleidoscope language:
  57
  58 .. code-block:: ocaml
  59
  60       (* variant for referencing a variable, like "a". *)
  61       | Variable of string
  62
  63       (* variant for a binary operator. *)
  64       | Binary of char * expr * expr
  65
  66       (* variant for function calls. *)
  67       | Call of string * expr array
  68
  69 This is all (intentionally) rather straight-forward: variables capture
  70 the variable name, binary operators capture their opcode (e.g. '+'), and
  71 calls capture a function name as well as a list of any argument
  72 expressions. One thing that is nice about our AST is that it captures
  73 the language features without talking about the syntax of the language.
  74 Note that there is no discussion about precedence of binary operators,
  75 lexical structure, etc.
  76
  77 For our basic language, these are all of the expression nodes we'll
  78 define. Because it doesn't have conditional control flow, it isn't
  79 Turing-complete; we'll fix that in a later installment. The two things
  80 we need next are a way to talk about the interface to a function, and a
  81 way to talk about functions themselves:
  82
  83 .. code-block:: ocaml
  84
  85     (* proto - This type represents the "prototype" for a function, which captures
  86      * its name, and its argument names (thus implicitly the number of arguments the
  87      * function takes). *)
  88     type proto = Prototype of string * string array
  89
  90     (* func - This type represents a function definition itself. *)
  91     type func = Function of proto * expr
  92
  93 In Kaleidoscope, functions are typed with just a count of their
  94 arguments. Since all values are double precision floating point, the
  95 type of each argument doesn't need to be stored anywhere. In a more
  96 aggressive and realistic language, the "expr" variants would probably
  97 have a type field.
  98
  99 With this scaffolding, we can now talk about parsing expressions and
 100 function bodies in Kaleidoscope.
 101
 102 Parser Basics
 103 =============
 104
 105 Now that we have an AST to build, we need to define the parser code to
 106 build it. The idea here is that we want to parse something like "x+y"
 107 (which is returned as three tokens by the lexer) into an AST that could
 108 be generated with calls like this:
 109
 110 .. code-block:: ocaml
 111
 112       let x = Variable "x" in
 113       let y = Variable "y" in
 114       let result = Binary ('+', x, y) in
 115       ...
 116
 117 The error handling routines make use of the builtin ``Stream.Failure``
 118 and ``Stream.Error``s. ``Stream.Failure`` is raised when the parser is
 119 unable to find any matching token in the first position of a pattern.
 120 ``Stream.Error`` is raised when the first token matches, but the rest do
 121 not. The error recovery in our parser will not be the best and is not
 122 particular user-friendly, but it will be enough for our tutorial. These
 123 exceptions make it easier to handle errors in routines that have various
 124 return types.
 125
 126 With these basic types and exceptions, we can implement the first piece
 127 of our grammar: numeric literals.
 128
 129 Basic Expression Parsing
 130 ========================
 131
 132 We start with numeric literals, because they are the simplest to
 133 process. For each production in our grammar, we'll define a function
 134 which parses that production. We call this class of expressions
 135 "primary" expressions, for reasons that will become more clear `later in
 136 the tutorial <OCamlLangImpl6.html#unary>`_. In order to parse an
 137 arbitrary primary expression, we need to determine what sort of
 138 expression it is. For numeric literals, we have:
 139
 140 .. code-block:: ocaml
 141
 142     (* primary
 143      *   ::= identifier
 144      *   ::= numberexpr
 145      *   ::= parenexpr *)
 146     parse_primary = parser
 147       (* numberexpr ::= number *)
 148       | [< 'Token.Number n >] -> Ast.Number n
 149
 150 This routine is very simple: it expects to be called when the current
 151 token is a ``Token.Number`` token. It takes the current number value,
 152 creates a ``Ast.Number`` node, advances the lexer to the next token, and
 153 finally returns.
 154
 155 There are some interesting aspects to this. The most important one is
 156 that this routine eats all of the tokens that correspond to the
 157 production and returns the lexer buffer with the next token (which is
 158 not part of the grammar production) ready to go. This is a fairly
 159 standard way to go for recursive descent parsers. For a better example,
 160 the parenthesis operator is defined like this:
 161
 162 .. code-block:: ocaml
 163
 164       (* parenexpr ::= '(' expression ')' *)
 165       | [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e
 166
 167 This function illustrates a number of interesting things about the
 168 parser:
 169
 170 1) It shows how we use the ``Stream.Error`` exception. When called, this
 171 function expects that the current token is a '(' token, but after
 172 parsing the subexpression, it is possible that there is no ')' waiting.
 173 For example, if the user types in "(4 x" instead of "(4)", the parser
 174 should emit an error. Because errors can occur, the parser needs a way
 175 to indicate that they happened. In our parser, we use the camlp4
 176 shortcut syntax ``token ?? "parse error"``, where if the token before
 177 the ``??`` does not match, then ``Stream.Error "parse error"`` will be
 178 raised.
 179
 180 2) Another interesting aspect of this function is that it uses recursion
 181 by calling ``Parser.parse_primary`` (we will soon see that
 182 ``Parser.parse_primary`` can call ``Parser.parse_primary``). This is
 183 powerful because it allows us to handle recursive grammars, and keeps
 184 each production very simple. Note that parentheses do not cause
 185 construction of AST nodes themselves. While we could do it this way, the
 186 most important role of parentheses are to guide the parser and provide
 187 grouping. Once the parser constructs the AST, parentheses are not
 188 needed.
 189
 190 The next simple production is for handling variable references and
 191 function calls:
 192
 193 .. code-block:: ocaml
 194
 195       (* identifierexpr
 196        *   ::= identifier
 197        *   ::= identifier '(' argumentexpr ')' *)
 198       | [< 'Token.Ident id; stream >] ->
 199           let rec parse_args accumulator = parser
 200             | [< e=parse_expr; stream >] ->
 201                 begin parser
 202                   | [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e
 203                   | [< >] -> e :: accumulator
 204                 end stream
 205             | [< >] -> accumulator
 206           in
 207           let rec parse_ident id = parser
 208             (* Call. *)
 209             | [< 'Token.Kwd '(';
 210                  args=parse_args [];
 211                  'Token.Kwd ')' ?? "expected ')'">] ->
 212                 Ast.Call (id, Array.of_list (List.rev args))
 213
 214             (* Simple variable ref. *)
 215             | [< >] -> Ast.Variable id
 216           in
 217           parse_ident id stream
 218
 219 This routine follows the same style as the other routines. (It expects
 220 to be called if the current token is a ``Token.Ident`` token). It also
 221 has recursion and error handling. One interesting aspect of this is that
 222 it uses *look-ahead* to determine if the current identifier is a stand
 223 alone variable reference or if it is a function call expression. It
 224 handles this by checking to see if the token after the identifier is a
 225 '(' token, constructing either a ``Ast.Variable`` or ``Ast.Call`` node
 226 as appropriate.
 227
 228 We finish up by raising an exception if we received a token we didn't
 229 expect:
 230
 231 .. code-block:: ocaml
 232
 233       | [< >] -> raise (Stream.Error "unknown token when expecting an expression.")
 234
 235 Now that basic expressions are handled, we need to handle binary
 236 expressions. They are a bit more complex.
 237
 238 Binary Expression Parsing
 239 =========================
 240
 241 Binary expressions are significantly harder to parse because they are
 242 often ambiguous. For example, when given the string "x+y\*z", the parser
 243 can choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common
 244 definitions from mathematics, we expect the later parse, because "\*"
 245 (multiplication) has higher *precedence* than "+" (addition).
 246
 247 There are many ways to handle this, but an elegant and efficient way is
 248 to use `Operator-Precedence
 249 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
 250 This parsing technique uses the precedence of binary operators to guide
 251 recursion. To start with, we need a table of precedences:
 252
 253 .. code-block:: ocaml
 254
 255     (* binop_precedence - This holds the precedence for each binary operator that is
 256      * defined *)
 257     let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
 258
 259     (* precedence - Get the precedence of the pending binary operator token. *)
 260     let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1
 261
 262     ...
 263
 264     let main () =
 265       (* Install standard binary operators.
 266        * 1 is the lowest precedence. *)
 267       Hashtbl.add Parser.binop_precedence '<' 10;
 268       Hashtbl.add Parser.binop_precedence '+' 20;
 269       Hashtbl.add Parser.binop_precedence '-' 20;
 270       Hashtbl.add Parser.binop_precedence '*' 40;    (* highest. *)
 271       ...
 272
 273 For the basic form of Kaleidoscope, we will only support 4 binary
 274 operators (this can obviously be extended by you, our brave and intrepid
 275 reader). The ``Parser.precedence`` function returns the precedence for
 276 the current token, or -1 if the token is not a binary operator. Having a
 277 ``Hashtbl.t`` makes it easy to add new operators and makes it clear that
 278 the algorithm doesn't depend on the specific operators involved, but it
 279 would be easy enough to eliminate the ``Hashtbl.t`` and do the
 280 comparisons in the ``Parser.precedence`` function. (Or just use a
 281 fixed-size array).
 282
 283 With the helper above defined, we can now start parsing binary
 284 expressions. The basic idea of operator precedence parsing is to break
 285 down an expression with potentially ambiguous binary operators into
 286 pieces. Consider ,for example, the expression "a+b+(c+d)\*e\*f+g".
 287 Operator precedence parsing considers this as a stream of primary
 288 expressions separated by binary operators. As such, it will first parse
 289 the leading primary expression "a", then it will see the pairs [+, b]
 290 [+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are
 291 primary expressions, the binary expression parser doesn't need to worry
 292 about nested subexpressions like (c+d) at all.
 293
 294 To start, an expression is a primary expression potentially followed by
 295 a sequence of [binop,primaryexpr] pairs:
 296
 297 .. code-block:: ocaml
 298
 299     (* expression
 300      *   ::= primary binoprhs *)
 301     and parse_expr = parser
 302       | [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream
 303
 304 ``Parser.parse_bin_rhs`` is the function that parses the sequence of
 305 pairs for us. It takes a precedence and a pointer to an expression for
 306 the part that has been parsed so far. Note that "x" is a perfectly valid
 307 expression: As such, "binoprhs" is allowed to be empty, in which case it
 308 returns the expression that is passed into it. In our example above, the
 309 code passes the expression for "a" into ``Parser.parse_bin_rhs`` and the
 310 current token is "+".
 311
 312 The precedence value passed into ``Parser.parse_bin_rhs`` indicates the
 313 *minimal operator precedence* that the function is allowed to eat. For
 314 example, if the current pair stream is [+, x] and
 315 ``Parser.parse_bin_rhs`` is passed in a precedence of 40, it will not
 316 consume any tokens (because the precedence of '+' is only 20). With this
 317 in mind, ``Parser.parse_bin_rhs`` starts with:
 318
 319 .. code-block:: ocaml
 320
 321     (* binoprhs
 322      *   ::= ('+' primary)* *)
 323     and parse_bin_rhs expr_prec lhs stream =
 324       match Stream.peek stream with
 325       (* If this is a binop, find its precedence. *)
 326       | Some (Token.Kwd c) when Hashtbl.mem binop_precedence c ->
 327           let token_prec = precedence c in
 328
 329           (* If this is a binop that binds at least as tightly as the current binop,
 330            * consume it, otherwise we are done. *)
 331           if token_prec < expr_prec then lhs else begin
 332
 333 This code gets the precedence of the current token and checks to see if
 334 if is too low. Because we defined invalid tokens to have a precedence of
 335 -1, this check implicitly knows that the pair-stream ends when the token
 336 stream runs out of binary operators. If this check succeeds, we know
 337 that the token is a binary operator and that it will be included in this
 338 expression:
 339
 340 .. code-block:: ocaml
 341
 342             (* Eat the binop. *)
 343             Stream.junk stream;
 344
 345             (* Okay, we know this is a binop. *)
 346             let rhs =
 347               match Stream.peek stream with
 348               | Some (Token.Kwd c2) ->
 349
 350 As such, this code eats (and remembers) the binary operator and then
 351 parses the primary expression that follows. This builds up the whole
 352 pair, the first of which is [+, b] for the running example.
 353
 354 Now that we parsed the left-hand side of an expression and one pair of
 355 the RHS sequence, we have to decide which way the expression associates.
 356 In particular, we could have "(a+b) binop unparsed" or "a + (b binop
 357 unparsed)". To determine this, we look ahead at "binop" to determine its
 358 precedence and compare it to BinOp's precedence (which is '+' in this
 359 case):
 360
 361 .. code-block:: ocaml
 362
 363                   (* If BinOp binds less tightly with rhs than the operator after
 364                    * rhs, let the pending operator take rhs as its lhs. *)
 365                   let next_prec = precedence c2 in
 366                   if token_prec < next_prec
 367
 368 If the precedence of the binop to the right of "RHS" is lower or equal
 369 to the precedence of our current operator, then we know that the
 370 parentheses associate as "(a+b) binop ...". In our example, the current
 371 operator is "+" and the next operator is "+", we know that they have the
 372 same precedence. In this case we'll create the AST node for "a+b", and
 373 then continue parsing:
 374
 375 .. code-block:: ocaml
 376
 377               ... if body omitted ...
 378             in
 379
 380             (* Merge lhs/rhs. *)
 381             let lhs = Ast.Binary (c, lhs, rhs) in
 382             parse_bin_rhs expr_prec lhs stream
 383           end
 384
 385 In our example above, this will turn "a+b+" into "(a+b)" and execute the
 386 next iteration of the loop, with "+" as the current token. The code
 387 above will eat, remember, and parse "(c+d)" as the primary expression,
 388 which makes the current pair equal to [+, (c+d)]. It will then evaluate
 389 the 'if' conditional above with "\*" as the binop to the right of the
 390 primary. In this case, the precedence of "\*" is higher than the
 391 precedence of "+" so the if condition will be entered.
 392
 393 The critical question left here is "how can the if condition parse the
 394 right hand side in full"? In particular, to build the AST correctly for
 395 our example, it needs to get all of "(c+d)\*e\*f" as the RHS expression
 396 variable. The code to do this is surprisingly simple (code from the
 397 above two blocks duplicated for context):
 398
 399 .. code-block:: ocaml
 400
 401               match Stream.peek stream with
 402               | Some (Token.Kwd c2) ->
 403                   (* If BinOp binds less tightly with rhs than the operator after
 404                    * rhs, let the pending operator take rhs as its lhs. *)
 405                   if token_prec < precedence c2
 406                   then parse_bin_rhs (token_prec + 1) rhs stream
 407                   else rhs
 408               | _ -> rhs
 409             in
 410
 411             (* Merge lhs/rhs. *)
 412             let lhs = Ast.Binary (c, lhs, rhs) in
 413             parse_bin_rhs expr_prec lhs stream
 414           end
 415
 416 At this point, we know that the binary operator to the RHS of our
 417 primary has higher precedence than the binop we are currently parsing.
 418 As such, we know that any sequence of pairs whose operators are all
 419 higher precedence than "+" should be parsed together and returned as
 420 "RHS". To do this, we recursively invoke the ``Parser.parse_bin_rhs``
 421 function specifying "token\_prec+1" as the minimum precedence required
 422 for it to continue. In our example above, this will cause it to return
 423 the AST node for "(c+d)\*e\*f" as RHS, which is then set as the RHS of
 424 the '+' expression.
 425
 426 Finally, on the next iteration of the while loop, the "+g" piece is
 427 parsed and added to the AST. With this little bit of code (14
 428 non-trivial lines), we correctly handle fully general binary expression
 429 parsing in a very elegant way. This was a whirlwind tour of this code,
 430 and it is somewhat subtle. I recommend running through it with a few
 431 tough examples to see how it works.
 432
 433 This wraps up handling of expressions. At this point, we can point the
 434 parser at an arbitrary token stream and build an expression from it,
 435 stopping at the first token that is not part of the expression. Next up
 436 we need to handle function definitions, etc.
 437
 438 Parsing the Rest
 439 ================
 440
 441 The next thing missing is handling of function prototypes. In
 442 Kaleidoscope, these are used both for 'extern' function declarations as
 443 well as function body definitions. The code to do this is
 444 straight-forward and not very interesting (once you've survived
 445 expressions):
 446
 447 .. code-block:: ocaml
 448
 449     (* prototype
 450      *   ::= id '(' id* ')' *)
 451     let parse_prototype =
 452       let rec parse_args accumulator = parser
 453         | [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e
 454         | [< >] -> accumulator
 455       in
 456
 457       parser
 458       | [< 'Token.Ident id;
 459            'Token.Kwd '(' ?? "expected '(' in prototype";
 460            args=parse_args [];
 461            'Token.Kwd ')' ?? "expected ')' in prototype" >] ->
 462           (* success. *)
 463           Ast.Prototype (id, Array.of_list (List.rev args))
 464
 465       | [< >] ->
 466           raise (Stream.Error "expected function name in prototype")
 467
 468 Given this, a function definition is very simple, just a prototype plus
 469 an expression to implement the body:
 470
 471 .. code-block:: ocaml
 472
 473     (* definition ::= 'def' prototype expression *)
 474     let parse_definition = parser
 475       | [< 'Token.Def; p=parse_prototype; e=parse_expr >] ->
 476           Ast.Function (p, e)
 477
 478 In addition, we support 'extern' to declare functions like 'sin' and
 479 'cos' as well as to support forward declaration of user functions. These
 480 'extern's are just prototypes with no body:
 481
 482 .. code-block:: ocaml
 483
 484     (*  external ::= 'extern' prototype *)
 485     let parse_extern = parser
 486       | [< 'Token.Extern; e=parse_prototype >] -> e
 487
 488 Finally, we'll also let the user type in arbitrary top-level expressions
 489 and evaluate them on the fly. We will handle this by defining anonymous
 490 nullary (zero argument) functions for them:
 491
 492 .. code-block:: ocaml
 493
 494     (* toplevelexpr ::= expression *)
 495     let parse_toplevel = parser
 496       | [< e=parse_expr >] ->
 497           (* Make an anonymous proto. *)
 498           Ast.Function (Ast.Prototype ("", [||]), e)
 499
 500 Now that we have all the pieces, let's build a little driver that will
 501 let us actually *execute* this code we've built!
 502
 503 The Driver
 504 ==========
 505
 506 The driver for this simply invokes all of the parsing pieces with a
 507 top-level dispatch loop. There isn't much interesting here, so I'll just
 508 include the top-level loop. See `below <#code>`_ for full code in the
 509 "Top-Level Parsing" section.
 510
 511 .. code-block:: ocaml
 512
 513     (* top ::= definition | external | expression | ';' *)
 514     let rec main_loop stream =
 515       match Stream.peek stream with
 516       | None -> ()
 517
 518       (* ignore top-level semicolons. *)
 519       | Some (Token.Kwd ';') ->
 520           Stream.junk stream;
 521           main_loop stream
 522
 523       | Some token ->
 524           begin
 525             try match token with
 526             | Token.Def ->
 527                 ignore(Parser.parse_definition stream);
 528                 print_endline "parsed a function definition.";
 529             | Token.Extern ->
 530                 ignore(Parser.parse_extern stream);
 531                 print_endline "parsed an extern.";
 532             | _ ->
 533                 (* Evaluate a top-level expression into an anonymous function. *)
 534                 ignore(Parser.parse_toplevel stream);
 535                 print_endline "parsed a top-level expr";
 536             with Stream.Error s ->
 537               (* Skip token for error recovery. *)
 538               Stream.junk stream;
 539               print_endline s;
 540           end;
 541           print_string "ready> "; flush stdout;
 542           main_loop stream
 543
 544 The most interesting part of this is that we ignore top-level
 545 semicolons. Why is this, you ask? The basic reason is that if you type
 546 "4 + 5" at the command line, the parser doesn't know whether that is the
 547 end of what you will type or not. For example, on the next line you
 548 could type "def foo..." in which case 4+5 is the end of a top-level
 549 expression. Alternatively you could type "\* 6", which would continue
 550 the expression. Having top-level semicolons allows you to type "4+5;",
 551 and the parser will know you are done.
 552
 553 Conclusions
 554 ===========
 555
 556 With just under 300 lines of commented code (240 lines of non-comment,
 557 non-blank code), we fully defined our minimal language, including a
 558 lexer, parser, and AST builder. With this done, the executable will
 559 validate Kaleidoscope code and tell us if it is grammatically invalid.
 560 For example, here is a sample interaction:
 561
 562 .. code-block:: bash
 563
 564     $ ./toy.byte
 565     ready> def foo(x y) x+foo(y, 4.0);
 566     Parsed a function definition.
 567     ready> def foo(x y) x+y y;
 568     Parsed a function definition.
 569     Parsed a top-level expr
 570     ready> def foo(x y) x+y );
 571     Parsed a function definition.
 572     Error: unknown token when expecting an expression
 573     ready> extern sin(a);
 574     ready> Parsed an extern
 575     ready> ^D
 576     $
 577
 578 There is a lot of room for extension here. You can define new AST nodes,
 579 extend the language in many ways, etc. In the `next
 580 installment <OCamlLangImpl3.html>`_, we will describe how to generate
 581 LLVM Intermediate Representation (IR) from the AST.
 582
 583 Full Code Listing
 584 =================
 585
 586 Here is the complete code listing for this and the previous chapter.
 587 Note that it is fully self-contained: you don't need LLVM or any
 588 external libraries at all for this. (Besides the ocaml standard
 589 libraries, of course.) To build this, just compile with:
 590
 591 .. code-block:: bash
 592
 593     # Compile
 594     ocamlbuild toy.byte
 595     # Run
 596     ./toy.byte
 597
 598 Here is the code:
 599
 600 \_tags:
 601     ::
 602
 603         <{lexer,parser}.ml>: use_camlp4, pp(camlp4of)
 604
 605 token.ml:
 606     .. code-block:: ocaml
 607
 608         (*===----------------------------------------------------------------------===
 609          * Lexer Tokens
 610          *===----------------------------------------------------------------------===*)
 611
 612         (* The lexer returns these 'Kwd' if it is an unknown character, otherwise one of
 613          * these others for known things. *)
 614         type token =
 615           (* commands *)
 616           | Def | Extern
 617
 618           (* primary *)
 619           | Ident of string | Number of float
 620
 621           (* unknown *)
 622           | Kwd of char
 623
 624 lexer.ml:
 625     .. code-block:: ocaml
 626
 627         (*===----------------------------------------------------------------------===
 628          * Lexer
 629          *===----------------------------------------------------------------------===*)
 630
 631         let rec lex = parser
 632           (* Skip any whitespace. *)
 633           | [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream
 634
 635           (* identifier: [a-zA-Z][a-zA-Z0-9] *)
 636           | [< ' ('A' .. 'Z' | 'a' .. 'z' as c); stream >] ->
 637               let buffer = Buffer.create 1 in
 638               Buffer.add_char buffer c;
 639               lex_ident buffer stream
 640
 641           (* number: [0-9.]+ *)
 642           | [< ' ('0' .. '9' as c); stream >] ->
 643               let buffer = Buffer.create 1 in
 644               Buffer.add_char buffer c;
 645               lex_number buffer stream
 646
 647           (* Comment until end of line. *)
 648           | [< ' ('#'); stream >] ->
 649               lex_comment stream
 650
 651           (* Otherwise, just return the character as its ascii value. *)
 652           | [< 'c; stream >] ->
 653               [< 'Token.Kwd c; lex stream >]
 654
 655           (* end of stream. *)
 656           | [< >] -> [< >]
 657
 658         and lex_number buffer = parser
 659           | [< ' ('0' .. '9' | '.' as c); stream >] ->
 660               Buffer.add_char buffer c;
 661               lex_number buffer stream
 662           | [< stream=lex >] ->
 663               [< 'Token.Number (float_of_string (Buffer.contents buffer)); stream >]
 664
 665         and lex_ident buffer = parser
 666           | [< ' ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' as c); stream >] ->
 667               Buffer.add_char buffer c;
 668               lex_ident buffer stream
 669           | [< stream=lex >] ->
 670               match Buffer.contents buffer with
 671               | "def" -> [< 'Token.Def; stream >]
 672               | "extern" -> [< 'Token.Extern; stream >]
 673               | id -> [< 'Token.Ident id; stream >]
 674
 675         and lex_comment = parser
 676           | [< ' ('\n'); stream=lex >] -> stream
 677           | [< 'c; e=lex_comment >] -> e
 678           | [< >] -> [< >]
 679
 680 ast.ml:
 681     .. code-block:: ocaml
 682
 683         (*===----------------------------------------------------------------------===
 684          * Abstract Syntax Tree (aka Parse Tree)
 685          *===----------------------------------------------------------------------===*)
 686
 687         (* expr - Base type for all expression nodes. *)
 688         type expr =
 689           (* variant for numeric literals like "1.0". *)
 690           | Number of float
 691
 692           (* variant for referencing a variable, like "a". *)
 693           | Variable of string
 694
 695           (* variant for a binary operator. *)
 696           | Binary of char * expr * expr
 697
 698           (* variant for function calls. *)
 699           | Call of string * expr array
 700
 701         (* proto - This type represents the "prototype" for a function, which captures
 702          * its name, and its argument names (thus implicitly the number of arguments the
 703          * function takes). *)
 704         type proto = Prototype of string * string array
 705
 706         (* func - This type represents a function definition itself. *)
 707         type func = Function of proto * expr
 708
 709 parser.ml:
 710     .. code-block:: ocaml
 711
 712         (*===---------------------------------------------------------------------===
 713          * Parser
 714          *===---------------------------------------------------------------------===*)
 715
 716         (* binop_precedence - This holds the precedence for each binary operator that is
 717          * defined *)
 718         let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
 719
 720         (* precedence - Get the precedence of the pending binary operator token. *)
 721         let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1
 722
 723         (* primary
 724          *   ::= identifier
 725          *   ::= numberexpr
 726          *   ::= parenexpr *)
 727         let rec parse_primary = parser
 728           (* numberexpr ::= number *)
 729           | [< 'Token.Number n >] -> Ast.Number n
 730
 731           (* parenexpr ::= '(' expression ')' *)
 732           | [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e
 733
 734           (* identifierexpr
 735            *   ::= identifier
 736            *   ::= identifier '(' argumentexpr ')' *)
 737           | [< 'Token.Ident id; stream >] ->
 738               let rec parse_args accumulator = parser
 739                 | [< e=parse_expr; stream >] ->
 740                     begin parser
 741                       | [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e
 742                       | [< >] -> e :: accumulator
 743                     end stream
 744                 | [< >] -> accumulator
 745               in
 746               let rec parse_ident id = parser
 747                 (* Call. *)
 748                 | [< 'Token.Kwd '(';
 749                      args=parse_args [];
 750                      'Token.Kwd ')' ?? "expected ')'">] ->
 751                     Ast.Call (id, Array.of_list (List.rev args))
 752
 753                 (* Simple variable ref. *)
 754                 | [< >] -> Ast.Variable id
 755               in
 756               parse_ident id stream
 757
 758           | [< >] -> raise (Stream.Error "unknown token when expecting an expression.")
 759
 760         (* binoprhs
 761          *   ::= ('+' primary)* *)
 762         and parse_bin_rhs expr_prec lhs stream =
 763           match Stream.peek stream with
 764           (* If this is a binop, find its precedence. *)
 765           | Some (Token.Kwd c) when Hashtbl.mem binop_precedence c ->
 766               let token_prec = precedence c in
 767
 768               (* If this is a binop that binds at least as tightly as the current binop,
 769                * consume it, otherwise we are done. *)
 770               if token_prec < expr_prec then lhs else begin
 771                 (* Eat the binop. *)
 772                 Stream.junk stream;
 773
 774                 (* Parse the primary expression after the binary operator. *)
 775                 let rhs = parse_primary stream in
 776
 777                 (* Okay, we know this is a binop. *)
 778                 let rhs =
 779                   match Stream.peek stream with
 780                   | Some (Token.Kwd c2) ->
 781                       (* If BinOp binds less tightly with rhs than the operator after
 782                        * rhs, let the pending operator take rhs as its lhs. *)
 783                       let next_prec = precedence c2 in
 784                       if token_prec < next_prec
 785                       then parse_bin_rhs (token_prec + 1) rhs stream
 786                       else rhs
 787                   | _ -> rhs
 788                 in
 789
 790                 (* Merge lhs/rhs. *)
 791                 let lhs = Ast.Binary (c, lhs, rhs) in
 792                 parse_bin_rhs expr_prec lhs stream
 793               end
 794           | _ -> lhs
 795
 796         (* expression
 797          *   ::= primary binoprhs *)
 798         and parse_expr = parser
 799           | [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream
 800
 801         (* prototype
 802          *   ::= id '(' id* ')' *)
 803         let parse_prototype =
 804           let rec parse_args accumulator = parser
 805             | [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e
 806             | [< >] -> accumulator
 807           in
 808
 809           parser
 810           | [< 'Token.Ident id;
 811                'Token.Kwd '(' ?? "expected '(' in prototype";
 812                args=parse_args [];
 813                'Token.Kwd ')' ?? "expected ')' in prototype" >] ->
 814               (* success. *)
 815               Ast.Prototype (id, Array.of_list (List.rev args))
 816
 817           | [< >] ->
 818               raise (Stream.Error "expected function name in prototype")
 819
 820         (* definition ::= 'def' prototype expression *)
 821         let parse_definition = parser
 822           | [< 'Token.Def; p=parse_prototype; e=parse_expr >] ->
 823               Ast.Function (p, e)
 824
 825         (* toplevelexpr ::= expression *)
 826         let parse_toplevel = parser
 827           | [< e=parse_expr >] ->
 828               (* Make an anonymous proto. *)
 829               Ast.Function (Ast.Prototype ("", [||]), e)
 830
 831         (*  external ::= 'extern' prototype *)
 832         let parse_extern = parser
 833           | [< 'Token.Extern; e=parse_prototype >] -> e
 834
 835 toplevel.ml:
 836     .. code-block:: ocaml
 837
 838         (*===----------------------------------------------------------------------===
 839          * Top-Level parsing and JIT Driver
 840          *===----------------------------------------------------------------------===*)
 841
 842         (* top ::= definition | external | expression | ';' *)
 843         let rec main_loop stream =
 844           match Stream.peek stream with
 845           | None -> ()
 846
 847           (* ignore top-level semicolons. *)
 848           | Some (Token.Kwd ';') ->
 849               Stream.junk stream;
 850               main_loop stream
 851
 852           | Some token ->
 853               begin
 854                 try match token with
 855                 | Token.Def ->
 856                     ignore(Parser.parse_definition stream);
 857                     print_endline "parsed a function definition.";
 858                 | Token.Extern ->
 859                     ignore(Parser.parse_extern stream);
 860                     print_endline "parsed an extern.";
 861                 | _ ->
 862                     (* Evaluate a top-level expression into an anonymous function. *)
 863                     ignore(Parser.parse_toplevel stream);
 864                     print_endline "parsed a top-level expr";
 865                 with Stream.Error s ->
 866                   (* Skip token for error recovery. *)
 867                   Stream.junk stream;
 868                   print_endline s;
 869               end;
 870               print_string "ready> "; flush stdout;
 871               main_loop stream
 872
 873 toy.ml:
 874     .. code-block:: ocaml
 875
 876         (*===----------------------------------------------------------------------===
 877          * Main driver code.
 878          *===----------------------------------------------------------------------===*)
 879
 880         let main () =
 881           (* Install standard binary operators.
 882            * 1 is the lowest precedence. *)
 883           Hashtbl.add Parser.binop_precedence '<' 10;
 884           Hashtbl.add Parser.binop_precedence '+' 20;
 885           Hashtbl.add Parser.binop_precedence '-' 20;
 886           Hashtbl.add Parser.binop_precedence '*' 40;    (* highest. *)
 887
 888           (* Prime the first token. *)
 889           print_string "ready> "; flush stdout;
 890           let stream = Lexer.lex (Stream.of_channel stdin) in
 891
 892           (* Run the main "interpreter loop" now. *)
 893           Toplevel.main_loop stream;
 894         ;;
 895
 896         main ()
 897
 898 `Next: Implementing Code Generation to LLVM IR <OCamlLangImpl3.html>`_
 899