docs/tutorial/OCamlLangImpl2.rst

   1 ===========================================
   2 Kaleidoscope: Implementing a Parser and AST
   3 ===========================================
   4
   5 .. contents::
   6    :local:
   7
   8 Chapter 2 Introduction
   9 ======================
  10
  11 Welcome to Chapter 2 of the "`Implementing a language with LLVM in
  12 Objective Caml <index.html>`_" tutorial. This chapter shows you how to
  13 use the lexer, built in `Chapter 1 <OCamlLangImpl1.html>`_, to build a
  14 full `parser <http://en.wikipedia.org/wiki/Parsing>`_ for our
  15 Kaleidoscope language. Once we have a parser, we'll define and build an
  16 `Abstract Syntax
  17 Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
  18
  19 The parser we will build uses a combination of `Recursive Descent
  20 Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
  21 `Operator-Precedence
  22 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
  23 parse the Kaleidoscope language (the latter for binary expressions and
  24 the former for everything else). Before we get to parsing though, lets
  25 talk about the output of the parser: the Abstract Syntax Tree.
  26
  27 The Abstract Syntax Tree (AST)
  28 ==============================
  29
  30 The AST for a program captures its behavior in such a way that it is
  31 easy for later stages of the compiler (e.g. code generation) to
  32 interpret. We basically want one object for each construct in the
  33 language, and the AST should closely model the language. In
  34 Kaleidoscope, we have expressions, a prototype, and a function object.
  35 We'll start with expressions first:
  36
  37 .. code-block:: ocaml
  38
  39     (* expr - Base type for all expression nodes. *)
  40     type expr =
  41       (* variant for numeric literals like "1.0". *)
  42       | Number of float
  43
  44 The code above shows the definition of the base ExprAST class and one
  45 subclass which we use for numeric literals. The important thing to note
  46 about this code is that the Number variant captures the numeric value of
  47 the literal as an instance variable. This allows later phases of the
  48 compiler to know what the stored numeric value is.
  49
  50 Right now we only create the AST, so there are no useful functions on
  51 them. It would be very easy to add a function to pretty print the code,
  52 for example. Here are the other expression AST node definitions that
  53 we'll use in the basic form of the Kaleidoscope language:
  54
  55 .. code-block:: ocaml
  56
  57       (* variant for referencing a variable, like "a". *)
  58       | Variable of string
  59
  60       (* variant for a binary operator. *)
  61       | Binary of char * expr * expr
  62
  63       (* variant for function calls. *)
  64       | Call of string * expr array
  65
  66 This is all (intentionally) rather straight-forward: variables capture
  67 the variable name, binary operators capture their opcode (e.g. '+'), and
  68 calls capture a function name as well as a list of any argument
  69 expressions. One thing that is nice about our AST is that it captures
  70 the language features without talking about the syntax of the language.
  71 Note that there is no discussion about precedence of binary operators,
  72 lexical structure, etc.
  73
  74 For our basic language, these are all of the expression nodes we'll
  75 define. Because it doesn't have conditional control flow, it isn't
  76 Turing-complete; we'll fix that in a later installment. The two things
  77 we need next are a way to talk about the interface to a function, and a
  78 way to talk about functions themselves:
  79
  80 .. code-block:: ocaml
  81
  82     (* proto - This type represents the "prototype" for a function, which captures
  83      * its name, and its argument names (thus implicitly the number of arguments the
  84      * function takes). *)
  85     type proto = Prototype of string * string array
  86
  87     (* func - This type represents a function definition itself. *)
  88     type func = Function of proto * expr
  89
  90 In Kaleidoscope, functions are typed with just a count of their
  91 arguments. Since all values are double precision floating point, the
  92 type of each argument doesn't need to be stored anywhere. In a more
  93 aggressive and realistic language, the "expr" variants would probably
  94 have a type field.
  95
  96 With this scaffolding, we can now talk about parsing expressions and
  97 function bodies in Kaleidoscope.
  98
  99 Parser Basics
 100 =============
 101
 102 Now that we have an AST to build, we need to define the parser code to
 103 build it. The idea here is that we want to parse something like "x+y"
 104 (which is returned as three tokens by the lexer) into an AST that could
 105 be generated with calls like this:
 106
 107 .. code-block:: ocaml
 108
 109       let x = Variable "x" in
 110       let y = Variable "y" in
 111       let result = Binary ('+', x, y) in
 112       ...
 113
 114 The error handling routines make use of the builtin ``Stream.Failure``
 115 and ``Stream.Error``s. ``Stream.Failure`` is raised when the parser is
 116 unable to find any matching token in the first position of a pattern.
 117 ``Stream.Error`` is raised when the first token matches, but the rest do
 118 not. The error recovery in our parser will not be the best and is not
 119 particular user-friendly, but it will be enough for our tutorial. These
 120 exceptions make it easier to handle errors in routines that have various
 121 return types.
 122
 123 With these basic types and exceptions, we can implement the first piece
 124 of our grammar: numeric literals.
 125
 126 Basic Expression Parsing
 127 ========================
 128
 129 We start with numeric literals, because they are the simplest to
 130 process. For each production in our grammar, we'll define a function
 131 which parses that production. We call this class of expressions
 132 "primary" expressions, for reasons that will become more clear `later in
 133 the tutorial <OCamlLangImpl6.html#unary>`_. In order to parse an
 134 arbitrary primary expression, we need to determine what sort of
 135 expression it is. For numeric literals, we have:
 136
 137 .. code-block:: ocaml
 138
 139     (* primary
 140      *   ::= identifier
 141      *   ::= numberexpr
 142      *   ::= parenexpr *)
 143     parse_primary = parser
 144       (* numberexpr ::= number *)
 145       | [< 'Token.Number n >] -> Ast.Number n
 146
 147 This routine is very simple: it expects to be called when the current
 148 token is a ``Token.Number`` token. It takes the current number value,
 149 creates a ``Ast.Number`` node, advances the lexer to the next token, and
 150 finally returns.
 151
 152 There are some interesting aspects to this. The most important one is
 153 that this routine eats all of the tokens that correspond to the
 154 production and returns the lexer buffer with the next token (which is
 155 not part of the grammar production) ready to go. This is a fairly
 156 standard way to go for recursive descent parsers. For a better example,
 157 the parenthesis operator is defined like this:
 158
 159 .. code-block:: ocaml
 160
 161       (* parenexpr ::= '(' expression ')' *)
 162       | [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e
 163
 164 This function illustrates a number of interesting things about the
 165 parser:
 166
 167 1) It shows how we use the ``Stream.Error`` exception. When called, this
 168 function expects that the current token is a '(' token, but after
 169 parsing the subexpression, it is possible that there is no ')' waiting.
 170 For example, if the user types in "(4 x" instead of "(4)", the parser
 171 should emit an error. Because errors can occur, the parser needs a way
 172 to indicate that they happened. In our parser, we use the camlp4
 173 shortcut syntax ``token ?? "parse error"``, where if the token before
 174 the ``??`` does not match, then ``Stream.Error "parse error"`` will be
 175 raised.
 176
 177 2) Another interesting aspect of this function is that it uses recursion
 178 by calling ``Parser.parse_primary`` (we will soon see that
 179 ``Parser.parse_primary`` can call ``Parser.parse_primary``). This is
 180 powerful because it allows us to handle recursive grammars, and keeps
 181 each production very simple. Note that parentheses do not cause
 182 construction of AST nodes themselves. While we could do it this way, the
 183 most important role of parentheses are to guide the parser and provide
 184 grouping. Once the parser constructs the AST, parentheses are not
 185 needed.
 186
 187 The next simple production is for handling variable references and
 188 function calls:
 189
 190 .. code-block:: ocaml
 191
 192       (* identifierexpr
 193        *   ::= identifier
 194        *   ::= identifier '(' argumentexpr ')' *)
 195       | [< 'Token.Ident id; stream >] ->
 196           let rec parse_args accumulator = parser
 197             | [< e=parse_expr; stream >] ->
 198                 begin parser
 199                   | [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e
 200                   | [< >] -> e :: accumulator
 201                 end stream
 202             | [< >] -> accumulator
 203           in
 204           let rec parse_ident id = parser
 205             (* Call. *)
 206             | [< 'Token.Kwd '(';
 207                  args=parse_args [];
 208                  'Token.Kwd ')' ?? "expected ')'">] ->
 209                 Ast.Call (id, Array.of_list (List.rev args))
 210
 211             (* Simple variable ref. *)
 212             | [< >] -> Ast.Variable id
 213           in
 214           parse_ident id stream
 215
 216 This routine follows the same style as the other routines. (It expects
 217 to be called if the current token is a ``Token.Ident`` token). It also
 218 has recursion and error handling. One interesting aspect of this is that
 219 it uses *look-ahead* to determine if the current identifier is a stand
 220 alone variable reference or if it is a function call expression. It
 221 handles this by checking to see if the token after the identifier is a
 222 '(' token, constructing either a ``Ast.Variable`` or ``Ast.Call`` node
 223 as appropriate.
 224
 225 We finish up by raising an exception if we received a token we didn't
 226 expect:
 227
 228 .. code-block:: ocaml
 229
 230       | [< >] -> raise (Stream.Error "unknown token when expecting an expression.")
 231
 232 Now that basic expressions are handled, we need to handle binary
 233 expressions. They are a bit more complex.
 234
 235 Binary Expression Parsing
 236 =========================
 237
 238 Binary expressions are significantly harder to parse because they are
 239 often ambiguous. For example, when given the string "x+y\*z", the parser
 240 can choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common
 241 definitions from mathematics, we expect the later parse, because "\*"
 242 (multiplication) has higher *precedence* than "+" (addition).
 243
 244 There are many ways to handle this, but an elegant and efficient way is
 245 to use `Operator-Precedence
 246 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
 247 This parsing technique uses the precedence of binary operators to guide
 248 recursion. To start with, we need a table of precedences:
 249
 250 .. code-block:: ocaml
 251
 252     (* binop_precedence - This holds the precedence for each binary operator that is
 253      * defined *)
 254     let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
 255
 256     (* precedence - Get the precedence of the pending binary operator token. *)
 257     let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1
 258
 259     ...
 260
 261     let main () =
 262       (* Install standard binary operators.
 263        * 1 is the lowest precedence. *)
 264       Hashtbl.add Parser.binop_precedence '<' 10;
 265       Hashtbl.add Parser.binop_precedence '+' 20;
 266       Hashtbl.add Parser.binop_precedence '-' 20;
 267       Hashtbl.add Parser.binop_precedence '*' 40;    (* highest. *)
 268       ...
 269
 270 For the basic form of Kaleidoscope, we will only support 4 binary
 271 operators (this can obviously be extended by you, our brave and intrepid
 272 reader). The ``Parser.precedence`` function returns the precedence for
 273 the current token, or -1 if the token is not a binary operator. Having a
 274 ``Hashtbl.t`` makes it easy to add new operators and makes it clear that
 275 the algorithm doesn't depend on the specific operators involved, but it
 276 would be easy enough to eliminate the ``Hashtbl.t`` and do the
 277 comparisons in the ``Parser.precedence`` function. (Or just use a
 278 fixed-size array).
 279
 280 With the helper above defined, we can now start parsing binary
 281 expressions. The basic idea of operator precedence parsing is to break
 282 down an expression with potentially ambiguous binary operators into
 283 pieces. Consider ,for example, the expression "a+b+(c+d)\*e\*f+g".
 284 Operator precedence parsing considers this as a stream of primary
 285 expressions separated by binary operators. As such, it will first parse
 286 the leading primary expression "a", then it will see the pairs [+, b]
 287 [+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are
 288 primary expressions, the binary expression parser doesn't need to worry
 289 about nested subexpressions like (c+d) at all.
 290
 291 To start, an expression is a primary expression potentially followed by
 292 a sequence of [binop,primaryexpr] pairs:
 293
 294 .. code-block:: ocaml
 295
 296     (* expression
 297      *   ::= primary binoprhs *)
 298     and parse_expr = parser
 299       | [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream
 300
 301 ``Parser.parse_bin_rhs`` is the function that parses the sequence of
 302 pairs for us. It takes a precedence and a pointer to an expression for
 303 the part that has been parsed so far. Note that "x" is a perfectly valid
 304 expression: As such, "binoprhs" is allowed to be empty, in which case it
 305 returns the expression that is passed into it. In our example above, the
 306 code passes the expression for "a" into ``Parser.parse_bin_rhs`` and the
 307 current token is "+".
 308
 309 The precedence value passed into ``Parser.parse_bin_rhs`` indicates the
 310 *minimal operator precedence* that the function is allowed to eat. For
 311 example, if the current pair stream is [+, x] and
 312 ``Parser.parse_bin_rhs`` is passed in a precedence of 40, it will not
 313 consume any tokens (because the precedence of '+' is only 20). With this
 314 in mind, ``Parser.parse_bin_rhs`` starts with:
 315
 316 .. code-block:: ocaml
 317
 318     (* binoprhs
 319      *   ::= ('+' primary)* *)
 320     and parse_bin_rhs expr_prec lhs stream =
 321       match Stream.peek stream with
 322       (* If this is a binop, find its precedence. *)
 323       | Some (Token.Kwd c) when Hashtbl.mem binop_precedence c ->
 324           let token_prec = precedence c in
 325
 326           (* If this is a binop that binds at least as tightly as the current binop,
 327            * consume it, otherwise we are done. *)
 328           if token_prec < expr_prec then lhs else begin
 329
 330 This code gets the precedence of the current token and checks to see if
 331 if is too low. Because we defined invalid tokens to have a precedence of
 332 -1, this check implicitly knows that the pair-stream ends when the token
 333 stream runs out of binary operators. If this check succeeds, we know
 334 that the token is a binary operator and that it will be included in this
 335 expression:
 336
 337 .. code-block:: ocaml
 338
 339             (* Eat the binop. *)
 340             Stream.junk stream;
 341
 342             (* Okay, we know this is a binop. *)
 343             let rhs =
 344               match Stream.peek stream with
 345               | Some (Token.Kwd c2) ->
 346
 347 As such, this code eats (and remembers) the binary operator and then
 348 parses the primary expression that follows. This builds up the whole
 349 pair, the first of which is [+, b] for the running example.
 350
 351 Now that we parsed the left-hand side of an expression and one pair of
 352 the RHS sequence, we have to decide which way the expression associates.
 353 In particular, we could have "(a+b) binop unparsed" or "a + (b binop
 354 unparsed)". To determine this, we look ahead at "binop" to determine its
 355 precedence and compare it to BinOp's precedence (which is '+' in this
 356 case):
 357
 358 .. code-block:: ocaml
 359
 360                   (* If BinOp binds less tightly with rhs than the operator after
 361                    * rhs, let the pending operator take rhs as its lhs. *)
 362                   let next_prec = precedence c2 in
 363                   if token_prec < next_prec
 364
 365 If the precedence of the binop to the right of "RHS" is lower or equal
 366 to the precedence of our current operator, then we know that the
 367 parentheses associate as "(a+b) binop ...". In our example, the current
 368 operator is "+" and the next operator is "+", we know that they have the
 369 same precedence. In this case we'll create the AST node for "a+b", and
 370 then continue parsing:
 371
 372 .. code-block:: ocaml
 373
 374               ... if body omitted ...
 375             in
 376
 377             (* Merge lhs/rhs. *)
 378             let lhs = Ast.Binary (c, lhs, rhs) in
 379             parse_bin_rhs expr_prec lhs stream
 380           end
 381
 382 In our example above, this will turn "a+b+" into "(a+b)" and execute the
 383 next iteration of the loop, with "+" as the current token. The code
 384 above will eat, remember, and parse "(c+d)" as the primary expression,
 385 which makes the current pair equal to [+, (c+d)]. It will then evaluate
 386 the 'if' conditional above with "\*" as the binop to the right of the
 387 primary. In this case, the precedence of "\*" is higher than the
 388 precedence of "+" so the if condition will be entered.
 389
 390 The critical question left here is "how can the if condition parse the
 391 right hand side in full"? In particular, to build the AST correctly for
 392 our example, it needs to get all of "(c+d)\*e\*f" as the RHS expression
 393 variable. The code to do this is surprisingly simple (code from the
 394 above two blocks duplicated for context):
 395
 396 .. code-block:: ocaml
 397
 398               match Stream.peek stream with
 399               | Some (Token.Kwd c2) ->
 400                   (* If BinOp binds less tightly with rhs than the operator after
 401                    * rhs, let the pending operator take rhs as its lhs. *)
 402                   if token_prec < precedence c2
 403                   then parse_bin_rhs (token_prec + 1) rhs stream
 404                   else rhs
 405               | _ -> rhs
 406             in
 407
 408             (* Merge lhs/rhs. *)
 409             let lhs = Ast.Binary (c, lhs, rhs) in
 410             parse_bin_rhs expr_prec lhs stream
 411           end
 412
 413 At this point, we know that the binary operator to the RHS of our
 414 primary has higher precedence than the binop we are currently parsing.
 415 As such, we know that any sequence of pairs whose operators are all
 416 higher precedence than "+" should be parsed together and returned as
 417 "RHS". To do this, we recursively invoke the ``Parser.parse_bin_rhs``
 418 function specifying "token\_prec+1" as the minimum precedence required
 419 for it to continue. In our example above, this will cause it to return
 420 the AST node for "(c+d)\*e\*f" as RHS, which is then set as the RHS of
 421 the '+' expression.
 422
 423 Finally, on the next iteration of the while loop, the "+g" piece is
 424 parsed and added to the AST. With this little bit of code (14
 425 non-trivial lines), we correctly handle fully general binary expression
 426 parsing in a very elegant way. This was a whirlwind tour of this code,
 427 and it is somewhat subtle. I recommend running through it with a few
 428 tough examples to see how it works.
 429
 430 This wraps up handling of expressions. At this point, we can point the
 431 parser at an arbitrary token stream and build an expression from it,
 432 stopping at the first token that is not part of the expression. Next up
 433 we need to handle function definitions, etc.
 434
 435 Parsing the Rest
 436 ================
 437
 438 The next thing missing is handling of function prototypes. In
 439 Kaleidoscope, these are used both for 'extern' function declarations as
 440 well as function body definitions. The code to do this is
 441 straight-forward and not very interesting (once you've survived
 442 expressions):
 443
 444 .. code-block:: ocaml
 445
 446     (* prototype
 447      *   ::= id '(' id* ')' *)
 448     let parse_prototype =
 449       let rec parse_args accumulator = parser
 450         | [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e
 451         | [< >] -> accumulator
 452       in
 453
 454       parser
 455       | [< 'Token.Ident id;
 456            'Token.Kwd '(' ?? "expected '(' in prototype";
 457            args=parse_args [];
 458            'Token.Kwd ')' ?? "expected ')' in prototype" >] ->
 459           (* success. *)
 460           Ast.Prototype (id, Array.of_list (List.rev args))
 461
 462       | [< >] ->
 463           raise (Stream.Error "expected function name in prototype")
 464
 465 Given this, a function definition is very simple, just a prototype plus
 466 an expression to implement the body:
 467
 468 .. code-block:: ocaml
 469
 470     (* definition ::= 'def' prototype expression *)
 471     let parse_definition = parser
 472       | [< 'Token.Def; p=parse_prototype; e=parse_expr >] ->
 473           Ast.Function (p, e)
 474
 475 In addition, we support 'extern' to declare functions like 'sin' and
 476 'cos' as well as to support forward declaration of user functions. These
 477 'extern's are just prototypes with no body:
 478
 479 .. code-block:: ocaml
 480
 481     (*  external ::= 'extern' prototype *)
 482     let parse_extern = parser
 483       | [< 'Token.Extern; e=parse_prototype >] -> e
 484
 485 Finally, we'll also let the user type in arbitrary top-level expressions
 486 and evaluate them on the fly. We will handle this by defining anonymous
 487 nullary (zero argument) functions for them:
 488
 489 .. code-block:: ocaml
 490
 491     (* toplevelexpr ::= expression *)
 492     let parse_toplevel = parser
 493       | [< e=parse_expr >] ->
 494           (* Make an anonymous proto. *)
 495           Ast.Function (Ast.Prototype ("", [||]), e)
 496
 497 Now that we have all the pieces, let's build a little driver that will
 498 let us actually *execute* this code we've built!
 499
 500 The Driver
 501 ==========
 502
 503 The driver for this simply invokes all of the parsing pieces with a
 504 top-level dispatch loop. There isn't much interesting here, so I'll just
 505 include the top-level loop. See `below <#code>`_ for full code in the
 506 "Top-Level Parsing" section.
 507
 508 .. code-block:: ocaml
 509
 510     (* top ::= definition | external | expression | ';' *)
 511     let rec main_loop stream =
 512       match Stream.peek stream with
 513       | None -> ()
 514
 515       (* ignore top-level semicolons. *)
 516       | Some (Token.Kwd ';') ->
 517           Stream.junk stream;
 518           main_loop stream
 519
 520       | Some token ->
 521           begin
 522             try match token with
 523             | Token.Def ->
 524                 ignore(Parser.parse_definition stream);
 525                 print_endline "parsed a function definition.";
 526             | Token.Extern ->
 527                 ignore(Parser.parse_extern stream);
 528                 print_endline "parsed an extern.";
 529             | _ ->
 530                 (* Evaluate a top-level expression into an anonymous function. *)
 531                 ignore(Parser.parse_toplevel stream);
 532                 print_endline "parsed a top-level expr";
 533             with Stream.Error s ->
 534               (* Skip token for error recovery. *)
 535               Stream.junk stream;
 536               print_endline s;
 537           end;
 538           print_string "ready> "; flush stdout;
 539           main_loop stream
 540
 541 The most interesting part of this is that we ignore top-level
 542 semicolons. Why is this, you ask? The basic reason is that if you type
 543 "4 + 5" at the command line, the parser doesn't know whether that is the
 544 end of what you will type or not. For example, on the next line you
 545 could type "def foo..." in which case 4+5 is the end of a top-level
 546 expression. Alternatively you could type "\* 6", which would continue
 547 the expression. Having top-level semicolons allows you to type "4+5;",
 548 and the parser will know you are done.
 549
 550 Conclusions
 551 ===========
 552
 553 With just under 300 lines of commented code (240 lines of non-comment,
 554 non-blank code), we fully defined our minimal language, including a
 555 lexer, parser, and AST builder. With this done, the executable will
 556 validate Kaleidoscope code and tell us if it is grammatically invalid.
 557 For example, here is a sample interaction:
 558
 559 .. code-block:: bash
 560
 561     $ ./toy.byte
 562     ready> def foo(x y) x+foo(y, 4.0);
 563     Parsed a function definition.
 564     ready> def foo(x y) x+y y;
 565     Parsed a function definition.
 566     Parsed a top-level expr
 567     ready> def foo(x y) x+y );
 568     Parsed a function definition.
 569     Error: unknown token when expecting an expression
 570     ready> extern sin(a);
 571     ready> Parsed an extern
 572     ready> ^D
 573     $
 574
 575 There is a lot of room for extension here. You can define new AST nodes,
 576 extend the language in many ways, etc. In the `next
 577 installment <OCamlLangImpl3.html>`_, we will describe how to generate
 578 LLVM Intermediate Representation (IR) from the AST.
 579
 580 Full Code Listing
 581 =================
 582
 583 Here is the complete code listing for this and the previous chapter.
 584 Note that it is fully self-contained: you don't need LLVM or any
 585 external libraries at all for this. (Besides the ocaml standard
 586 libraries, of course.) To build this, just compile with:
 587
 588 .. code-block:: bash
 589
 590     # Compile
 591     ocamlbuild toy.byte
 592     # Run
 593     ./toy.byte
 594
 595 Here is the code:
 596
 597 \_tags:
 598     ::
 599
 600         <{lexer,parser}.ml>: use_camlp4, pp(camlp4of)
 601
 602 token.ml:
 603     .. code-block:: ocaml
 604
 605         (*===----------------------------------------------------------------------===
 606          * Lexer Tokens
 607          *===----------------------------------------------------------------------===*)
 608
 609         (* The lexer returns these 'Kwd' if it is an unknown character, otherwise one of
 610          * these others for known things. *)
 611         type token =
 612           (* commands *)
 613           | Def | Extern
 614
 615           (* primary *)
 616           | Ident of string | Number of float
 617
 618           (* unknown *)
 619           | Kwd of char
 620
 621 lexer.ml:
 622     .. code-block:: ocaml
 623
 624         (*===----------------------------------------------------------------------===
 625          * Lexer
 626          *===----------------------------------------------------------------------===*)
 627
 628         let rec lex = parser
 629           (* Skip any whitespace. *)
 630           | [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream
 631
 632           (* identifier: [a-zA-Z][a-zA-Z0-9] *)
 633           | [< ' ('A' .. 'Z' | 'a' .. 'z' as c); stream >] ->
 634               let buffer = Buffer.create 1 in
 635               Buffer.add_char buffer c;
 636               lex_ident buffer stream
 637
 638           (* number: [0-9.]+ *)
 639           | [< ' ('0' .. '9' as c); stream >] ->
 640               let buffer = Buffer.create 1 in
 641               Buffer.add_char buffer c;
 642               lex_number buffer stream
 643
 644           (* Comment until end of line. *)
 645           | [< ' ('#'); stream >] ->
 646               lex_comment stream
 647
 648           (* Otherwise, just return the character as its ascii value. *)
 649           | [< 'c; stream >] ->
 650               [< 'Token.Kwd c; lex stream >]
 651
 652           (* end of stream. *)
 653           | [< >] -> [< >]
 654
 655         and lex_number buffer = parser
 656           | [< ' ('0' .. '9' | '.' as c); stream >] ->
 657               Buffer.add_char buffer c;
 658               lex_number buffer stream
 659           | [< stream=lex >] ->
 660               [< 'Token.Number (float_of_string (Buffer.contents buffer)); stream >]
 661
 662         and lex_ident buffer = parser
 663           | [< ' ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' as c); stream >] ->
 664               Buffer.add_char buffer c;
 665               lex_ident buffer stream
 666           | [< stream=lex >] ->
 667               match Buffer.contents buffer with
 668               | "def" -> [< 'Token.Def; stream >]
 669               | "extern" -> [< 'Token.Extern; stream >]
 670               | id -> [< 'Token.Ident id; stream >]
 671
 672         and lex_comment = parser
 673           | [< ' ('\n'); stream=lex >] -> stream
 674           | [< 'c; e=lex_comment >] -> e
 675           | [< >] -> [< >]
 676
 677 ast.ml:
 678     .. code-block:: ocaml
 679
 680         (*===----------------------------------------------------------------------===
 681          * Abstract Syntax Tree (aka Parse Tree)
 682          *===----------------------------------------------------------------------===*)
 683
 684         (* expr - Base type for all expression nodes. *)
 685         type expr =
 686           (* variant for numeric literals like "1.0". *)
 687           | Number of float
 688
 689           (* variant for referencing a variable, like "a". *)
 690           | Variable of string
 691
 692           (* variant for a binary operator. *)
 693           | Binary of char * expr * expr
 694
 695           (* variant for function calls. *)
 696           | Call of string * expr array
 697
 698         (* proto - This type represents the "prototype" for a function, which captures
 699          * its name, and its argument names (thus implicitly the number of arguments the
 700          * function takes). *)
 701         type proto = Prototype of string * string array
 702
 703         (* func - This type represents a function definition itself. *)
 704         type func = Function of proto * expr
 705
 706 parser.ml:
 707     .. code-block:: ocaml
 708
 709         (*===---------------------------------------------------------------------===
 710          * Parser
 711          *===---------------------------------------------------------------------===*)
 712
 713         (* binop_precedence - This holds the precedence for each binary operator that is
 714          * defined *)
 715         let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
 716
 717         (* precedence - Get the precedence of the pending binary operator token. *)
 718         let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1
 719
 720         (* primary
 721          *   ::= identifier
 722          *   ::= numberexpr
 723          *   ::= parenexpr *)
 724         let rec parse_primary = parser
 725           (* numberexpr ::= number *)
 726           | [< 'Token.Number n >] -> Ast.Number n
 727
 728           (* parenexpr ::= '(' expression ')' *)
 729           | [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e
 730
 731           (* identifierexpr
 732            *   ::= identifier
 733            *   ::= identifier '(' argumentexpr ')' *)
 734           | [< 'Token.Ident id; stream >] ->
 735               let rec parse_args accumulator = parser
 736                 | [< e=parse_expr; stream >] ->
 737                     begin parser
 738                       | [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e
 739                       | [< >] -> e :: accumulator
 740                     end stream
 741                 | [< >] -> accumulator
 742               in
 743               let rec parse_ident id = parser
 744                 (* Call. *)
 745                 | [< 'Token.Kwd '(';
 746                      args=parse_args [];
 747                      'Token.Kwd ')' ?? "expected ')'">] ->
 748                     Ast.Call (id, Array.of_list (List.rev args))
 749
 750                 (* Simple variable ref. *)
 751                 | [< >] -> Ast.Variable id
 752               in
 753               parse_ident id stream
 754
 755           | [< >] -> raise (Stream.Error "unknown token when expecting an expression.")
 756
 757         (* binoprhs
 758          *   ::= ('+' primary)* *)
 759         and parse_bin_rhs expr_prec lhs stream =
 760           match Stream.peek stream with
 761           (* If this is a binop, find its precedence. *)
 762           | Some (Token.Kwd c) when Hashtbl.mem binop_precedence c ->
 763               let token_prec = precedence c in
 764
 765               (* If this is a binop that binds at least as tightly as the current binop,
 766                * consume it, otherwise we are done. *)
 767               if token_prec < expr_prec then lhs else begin
 768                 (* Eat the binop. *)
 769                 Stream.junk stream;
 770
 771                 (* Parse the primary expression after the binary operator. *)
 772                 let rhs = parse_primary stream in
 773
 774                 (* Okay, we know this is a binop. *)
 775                 let rhs =
 776                   match Stream.peek stream with
 777                   | Some (Token.Kwd c2) ->
 778                       (* If BinOp binds less tightly with rhs than the operator after
 779                        * rhs, let the pending operator take rhs as its lhs. *)
 780                       let next_prec = precedence c2 in
 781                       if token_prec < next_prec
 782                       then parse_bin_rhs (token_prec + 1) rhs stream
 783                       else rhs
 784                   | _ -> rhs
 785                 in
 786
 787                 (* Merge lhs/rhs. *)
 788                 let lhs = Ast.Binary (c, lhs, rhs) in
 789                 parse_bin_rhs expr_prec lhs stream
 790               end
 791           | _ -> lhs
 792
 793         (* expression
 794          *   ::= primary binoprhs *)
 795         and parse_expr = parser
 796           | [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream
 797
 798         (* prototype
 799          *   ::= id '(' id* ')' *)
 800         let parse_prototype =
 801           let rec parse_args accumulator = parser
 802             | [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e
 803             | [< >] -> accumulator
 804           in
 805
 806           parser
 807           | [< 'Token.Ident id;
 808                'Token.Kwd '(' ?? "expected '(' in prototype";
 809                args=parse_args [];
 810                'Token.Kwd ')' ?? "expected ')' in prototype" >] ->
 811               (* success. *)
 812               Ast.Prototype (id, Array.of_list (List.rev args))
 813
 814           | [< >] ->
 815               raise (Stream.Error "expected function name in prototype")
 816
 817         (* definition ::= 'def' prototype expression *)
 818         let parse_definition = parser
 819           | [< 'Token.Def; p=parse_prototype; e=parse_expr >] ->
 820               Ast.Function (p, e)
 821
 822         (* toplevelexpr ::= expression *)
 823         let parse_toplevel = parser
 824           | [< e=parse_expr >] ->
 825               (* Make an anonymous proto. *)
 826               Ast.Function (Ast.Prototype ("", [||]), e)
 827
 828         (*  external ::= 'extern' prototype *)
 829         let parse_extern = parser
 830           | [< 'Token.Extern; e=parse_prototype >] -> e
 831
 832 toplevel.ml:
 833     .. code-block:: ocaml
 834
 835         (*===----------------------------------------------------------------------===
 836          * Top-Level parsing and JIT Driver
 837          *===----------------------------------------------------------------------===*)
 838
 839         (* top ::= definition | external | expression | ';' *)
 840         let rec main_loop stream =
 841           match Stream.peek stream with
 842           | None -> ()
 843
 844           (* ignore top-level semicolons. *)
 845           | Some (Token.Kwd ';') ->
 846               Stream.junk stream;
 847               main_loop stream
 848
 849           | Some token ->
 850               begin
 851                 try match token with
 852                 | Token.Def ->
 853                     ignore(Parser.parse_definition stream);
 854                     print_endline "parsed a function definition.";
 855                 | Token.Extern ->
 856                     ignore(Parser.parse_extern stream);
 857                     print_endline "parsed an extern.";
 858                 | _ ->
 859                     (* Evaluate a top-level expression into an anonymous function. *)
 860                     ignore(Parser.parse_toplevel stream);
 861                     print_endline "parsed a top-level expr";
 862                 with Stream.Error s ->
 863                   (* Skip token for error recovery. *)
 864                   Stream.junk stream;
 865                   print_endline s;
 866               end;
 867               print_string "ready> "; flush stdout;
 868               main_loop stream
 869
 870 toy.ml:
 871     .. code-block:: ocaml
 872
 873         (*===----------------------------------------------------------------------===
 874          * Main driver code.
 875          *===----------------------------------------------------------------------===*)
 876
 877         let main () =
 878           (* Install standard binary operators.
 879            * 1 is the lowest precedence. *)
 880           Hashtbl.add Parser.binop_precedence '<' 10;
 881           Hashtbl.add Parser.binop_precedence '+' 20;
 882           Hashtbl.add Parser.binop_precedence '-' 20;
 883           Hashtbl.add Parser.binop_precedence '*' 40;    (* highest. *)
 884
 885           (* Prime the first token. *)
 886           print_string "ready> "; flush stdout;
 887           let stream = Lexer.lex (Stream.of_channel stdin) in
 888
 889           (* Run the main "interpreter loop" now. *)
 890           Toplevel.main_loop stream;
 891         ;;
 892
 893         main ()
 894
 895 `Next: Implementing Code Generation to LLVM IR <OCamlLangImpl3.html>`_
 896