1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
6 <title>Kaleidoscope: Implementing a Parser and AST</title>
7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
8 <meta name="author" content="Chris Lattner">
9 <link rel="stylesheet" href="../llvm.css" type="text/css">
14 <div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
16 <div class="doc_author">
17 <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
20 <!-- *********************************************************************** -->
21 <div class="doc_section"><a name="intro">Part 2 Introduction</a></div>
22 <!-- *********************************************************************** -->
24 <div class="doc_text">
26 <p>Welcome to part 2 of the "<a href="index.html">Implementing a language with
27 LLVM</a>" tutorial. This chapter shows you how to use the <a
28 href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
29 href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
30 our Kaleidoscope language and build an <a
31 href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
34 <p>The parser we will build uses a combination of <a
35 href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
36 Parsing</a> and <a href=
37 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
38 Parsing</a> to parse the Kaleidoscope language (the later for binary expression
39 and the former for everything else). Before we get to parsing though, lets talk
40 about the output of the parser: the Abstract Syntax Tree.</p>
44 <!-- *********************************************************************** -->
45 <div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
46 <!-- *********************************************************************** -->
48 <div class="doc_text">
50 <p>The AST for a program captures its behavior in a way that it is easy for
51 later stages of the compiler (e.g. code generation) to interpret. We basically
52 want one object for each construct in the language, and the AST should closely
53 model the language. In Kaleidoscope, we have expressions, a prototype, and a
54 function object. We'll start with expressions first:</p>
56 <div class="doc_code">
58 /// ExprAST - Base class for all expression nodes.
64 /// NumberExprAST - Expression class for numeric literals like "1.0".
65 class NumberExprAST : public ExprAST {
68 explicit NumberExprAST(double val) : Val(val) {}
73 <p>The code above shows the definition of the base ExprAST class and one
74 subclass which we use for numeric literals. The important thing about this is
75 that the NumberExprAST class captures the numeric value of the literal in the
76 class, so that later phases of the compiler can know what it is.</p>
78 <p>Right now we only create the AST, so there are no useful accessor methods on
79 them. It would be very easy to add a virtual method to pretty print the code,
80 for example. Here are the other expression AST node definitions that we'll use
81 in the basic form of the Kaleidoscope language.
84 <div class="doc_code">
86 /// VariableExprAST - Expression class for referencing a variable, like "a".
87 class VariableExprAST : public ExprAST {
90 explicit VariableExprAST(const std::string &name) : Name(name) {}
93 /// BinaryExprAST - Expression class for a binary operator.
94 class BinaryExprAST : public ExprAST {
98 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
99 : Op(op), LHS(lhs), RHS(rhs) {}
102 /// CallExprAST - Expression class for function calls.
103 class CallExprAST : public ExprAST {
105 std::vector<ExprAST*> Args;
107 CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
108 : Callee(callee), Args(args) {}
113 <p>This is all (intentially) rather straight-forward: variables capture the
114 variable name, binary operators capture their opcode (e.g. '+'), and calls
115 capture a function name and list of argument expressions. One thing that is
116 nice about our AST is that it captures the language features without talking
117 about the syntax of the language. Note that there is no discussion about
118 precedence of binary operators, lexical structure etc.</p>
120 <p>For our basic language, these are all of the expression nodes we'll define.
121 Because it doesn't have conditional control flow, it isn't Turing-complete;
122 we'll fix that in a later installment. The two things we need next are a way
123 to talk about the interface to a function, and a way to talk about functions
126 <div class="doc_code">
128 /// PrototypeAST - This class represents the "prototype" for a function,
129 /// which captures its argument names as well as if it is an operator.
132 std::vector<std::string> Args;
134 PrototypeAST(const std::string &name, const std::vector<std::string> &args)
135 : Name(name), Args(args) {}
138 /// FunctionAST - This class represents a function definition itself.
143 FunctionAST(PrototypeAST *proto, ExprAST *body)
144 : Proto(proto), Body(body) {}
149 <p>In Kaleidoscope, functions are typed with just a count of their arguments.
150 Since all values are double precision floating point, this fact doesn't need to
151 be captured anywhere. In a more aggressive and realistic language, the
152 "ExprAST" class would probably have a type field.</p>
154 <p>With this scaffolding, we can now talk about parsing expressions and function
155 bodies in Kaleidoscope.</p>
159 <!-- *********************************************************************** -->
160 <div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
161 <!-- *********************************************************************** -->
163 <div class="doc_text">
165 <p>Now that we have an AST to build, we need to define the parser code to build
166 it. The idea here is that we want to parse something like "x+y" (which is
167 returned as three tokens by the lexer) into an AST that could be generated with
170 <div class="doc_code">
172 ExprAST *X = new VariableExprAST("x");
173 ExprAST *Y = new VariableExprAST("y");
174 ExprAST *Result = new BinaryExprAST('+', X, Y);
178 <p>In order to do this, we'll start by defining some basic helper routines:</p>
180 <div class="doc_code">
182 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
183 /// token the parser it looking at. getNextToken reads another token from the
184 /// lexer and updates CurTok with its results.
186 static int getNextToken() {
187 return CurTok = gettok();
193 This implements a simple token buffer around the lexer. This allows
194 us to look one token ahead at what the lexer is returning. Every function in
195 our lexer will assume that CurTok is the current token that needs to be
198 <p>Again, we define these with global variables; it would be better design to
199 wrap the entire parser in a class and use instance variables for these.
202 <div class="doc_code">
205 /// Error* - These are little helper functions for error handling.
206 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
207 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
208 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
213 The <tt>Error</tt> routines are simple helper routines that our parser will use
214 to handle errors. The error recovery in our parser will not be the best and
215 are not particular user-friendly, but it will be enough for our tutorial. These
216 routines make it easier to handle errors in routines that have various return
217 types: they always return null.</p>
219 <p>With these basic helper functions implemented, we can implement the first
220 piece of our grammar: we'll start with numeric literals.</p>
224 <!-- *********************************************************************** -->
225 <div class="doc_section"><a name="parserprimexprs">Basic Expression
227 <!-- *********************************************************************** -->
229 <div class="doc_text">
231 <p>We start with numeric literals, because they are the simplest to process.
232 For each production in our grammar, we'll define a function which parses that
233 production. For numeric literals, we have:
236 <div class="doc_code">
238 /// numberexpr ::= number
239 static ExprAST *ParseNumberExpr() {
240 ExprAST *Result = new NumberExprAST(NumVal);
241 getNextToken(); // consume the number
247 <p>This routine is very simple: it expects to be called when the current token
248 is a <tt>tok_number</tt> token. It takes the current number value, creates
249 a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
252 <p>There are some interesting aspects of this. The most important one is that
253 this routine eats all of the tokens that correspond to the production, and
254 returns the lexer buffer with the next token (which is not part of the grammar
255 production) ready to go. This is a fairly standard way to go for recursive
256 descent parsers. For a better example, the parenthesis operator is defined like
259 <div class="doc_code">
261 /// parenexpr ::= '(' expression ')'
262 static ExprAST *ParseParenExpr() {
263 getNextToken(); // eat (.
264 ExprAST *V = ParseExpression();
268 return Error("expected ')'");
269 getNextToken(); // eat ).
275 <p>This function illustrates a number of interesting things about the parser:
276 1) it shows how we use the Error routines. When called, this function expects
277 that the current token is a '(' token, but after parsing the subexpression, it
278 is possible that there is not a ')' waiting. For example, if the user types in
279 "(4 x" instead of "(4)", the parser should emit an error. Because errors can
280 occur, the parser needs a way to indicate that they happened: in our parser, we
281 return null on an error.</p>
283 <p>Another interesting aspect of this function is that it uses recursion by
284 calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
285 <tt>ParseParenExpr</tt>). This is powerful because it allows us to handle
286 recursive grammars, and keeps each production very simple. Note that
287 parenthesis do not cause construction of AST nodes themselves. While we could
288 do this, the most important role of parens are to guide the parser and provide
289 grouping. Once the parser constructs the AST, parens are not needed.</p>
291 <p>The next simple production is for handling variable references and function
294 <div class="doc_code">
298 /// ::= identifer '(' expression* ')'
299 static ExprAST *ParseIdentifierExpr() {
300 std::string IdName = IdentifierStr;
302 getNextToken(); // eat identifer.
304 if (CurTok != '(') // Simple variable ref.
305 return new VariableExprAST(IdName);
308 getNextToken(); // eat (
309 std::vector<ExprAST*> Args;
311 ExprAST *Arg = ParseExpression();
315 if (CurTok == ')') break;
318 return Error("Expected ')'");
325 return new CallExprAST(IdName, Args);
330 <p>This routine follows the same style as the other routines (it expects to be
331 called if the current token is a <tt>tok_identifier</tt> token). It also has
332 recursion and error handling. One interesting aspect of this is that it uses
333 <em>look-ahead</em> to determine if the current identifier is a stand alone
334 variable reference or if it is a function call expression. It handles this by
335 checking to see if the token after the identifier is a '(' token, and constructs
336 either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
339 <p>Now that we have all of our simple expression parsing logic in place, we can
340 define a helper function to wrap them up in a class. We call this class of
341 expressions "primary" expressions, for reasons that will become more clear
342 later. In order to parse a primary expression, we need to determine what sort
343 of expression it is:</p>
345 <div class="doc_code">
348 /// ::= identifierexpr
351 static ExprAST *ParsePrimary() {
353 default: return Error("unknown token when expecting an expression");
354 case tok_identifier: return ParseIdentifierExpr();
355 case tok_number: return ParseNumberExpr();
356 case '(': return ParseParenExpr();
362 <p>Now that you see the definition of this function, it makes it more obvious
363 why we can assume the state of CurTok in the various functions. This uses
364 look-ahead to determine which sort of expression is being inspected, and parses
365 it with a function call.</p>
367 <p>Now that basic expressions are handled, we need to handle binary expressions,
368 which are a bit more complex.</p>
372 <!-- *********************************************************************** -->
373 <div class="doc_section"><a name="parserbinops">Binary Expression
375 <!-- *********************************************************************** -->
377 <div class="doc_text">
379 <p>Binary expressions are significantly harder to parse because they are often
380 ambiguous. For example, when given the string "x+y*z", the parser can choose
381 to parse it as either "(x+y)*z" or "x+(y*z)". With common definitions from
382 mathematics, we expect the later parse, because "*" (multiplication) has
383 higher <em>precedence</em> than "+" (addition).</p>
385 <p>There are many ways to handle this, but an elegant and efficient way is to
387 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
388 Parsing</a>. This parsing technique uses the precedence of binary operators to
389 guide recursion. To start with, we need a table of precedences:</p>
391 <div class="doc_code">
393 /// BinopPrecedence - This holds the precedence for each binary operator that is
395 static std::map<char, int> BinopPrecedence;
397 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
398 static int GetTokPrecedence() {
399 if (!isascii(CurTok))
402 // Make sure it's a declared binop.
403 int TokPrec = BinopPrecedence[CurTok];
404 if (TokPrec <= 0) return -1;
409 // Install standard binary operators.
410 // 1 is lowest precedence.
411 BinopPrecedence['<'] = 10;
412 BinopPrecedence['+'] = 20;
413 BinopPrecedence['-'] = 20;
414 BinopPrecedence['*'] = 40; // highest.
420 <p>For the basic form of Kaleidoscope, we will only support 4 binary operators
421 (this can obviously be extended by you, the reader). The
422 <tt>GetTokPrecedence</tt> function returns the precedence for the current token,
423 or -1 if the token is not a binary operator. Having a map makes it easy to add
424 new operators and makes it clear that the algorithm doesn't depend on the
425 specific operators involved, but it would be easy enough to eliminate the map
426 and do the comparisons in the <tt>GetTokPrecedence</tt> function.</p>
428 <p>With the helper above defined, we can now start parsing binary expressions.
429 The basic idea of operator precedence parsing is to break down an expression
430 with potentially ambiguous binary operators into pieces. Consider for example
431 the expression "a+b+(c+d)*e*f+g". Operator precedence parsing considers this
432 as a stream of primary expressions separated by binary operators. As such,
433 it will first parse the leading primary expression "a", then it will see the
434 pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g]. Note that because parentheses
435 are primary expressions that the binary expression parser doesn't need to worry
436 about nested subexpressions like (c+d) at all.
440 To start, an expression is a primary expression potentially followed by a
441 sequence of [binop,primaryexpr] pairs:</p>
443 <div class="doc_code">
446 /// ::= primary binoprhs
448 static ExprAST *ParseExpression() {
449 ExprAST *LHS = ParsePrimary();
452 return ParseBinOpRHS(0, LHS);
457 <p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
458 us. It takes a precedence and a pointer to an expression for the part parsed
459 so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
460 allowed to be empty, in which case it returns the expression that is passed into
461 it. In our example above, the code passes the expression for "a" into
462 <tt>ParseBinOpRHS</tt> and the current token is "+".</p>
464 <p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
465 minimal operator precedence</em> that the function is allowed to eat. For
466 example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
467 passed in a precedence of 40, it will not consume any tokens (because the
468 precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts
471 <div class="doc_code">
474 /// ::= ('+' primary)*
475 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
476 // If this is a binop, find its precedence.
478 int TokPrec = GetTokPrecedence();
480 // If this is a binop that binds at least as tightly as the current binop,
481 // consume it, otherwise we are done.
482 if (TokPrec < ExprPrec)
487 <p>This code gets the precedence of the current token and checks to see if if is
488 too low. Because we defined invalid tokens to have a precedence of -1, this
489 check implicitly knows that the pair-stream ends when the token stream runs out
490 of binary operators. If this check succeeds, we know that the token is a binary
491 operator and that it will be included in this expression:</p>
493 <div class="doc_code">
495 // Okay, we know this is a binop.
497 getNextToken(); // eat binop
499 // Parse the primary expression after the binary operator.
500 ExprAST *RHS = ParsePrimary();
505 <p>As such, this code eats (and remembers) the binary operator and then parses
506 the following primary expression. This builds up the whole pair, the first of
507 which is [+, b] for the running example.</p>
509 <p>Now that we parsed the left-hand side of an expression and one pair of the
510 RHS sequence, we have to decide which way the expression associates. In
511 particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
512 To determine this, we look ahead at "binop" to determine its precedence and
513 compare it to BinOp's precedence (which is '+' in this case):</p>
515 <div class="doc_code">
517 // If BinOp binds less tightly with RHS than the operator after RHS, let
518 // the pending operator take RHS as its LHS.
519 int NextPrec = GetTokPrecedence();
520 if (TokPrec < NextPrec) {
524 <p>If the precedence of the binop to the right of "RHS" is lower or equal to the
525 precedence of our current operator, then we know that the parentheses associate
526 as "(a+b) binop ...". In our example, since the next operator is "+" and so is
527 our current one, we know that they have the same precedence. In this case we'll
528 create the AST node for "a+b", and then continue parsing:</p>
530 <div class="doc_code">
532 ... if body omitted ...
536 LHS = new BinaryExprAST(BinOp, LHS, RHS);
537 } // loop around to the top of the while loop.
542 <p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
543 iteration of the loop, with "+" as the current token. The code above will eat
544 and remember it and parse "(c+d)" as the primary expression, which makes the
545 current pair be [+, (c+d)]. It will then enter the 'if' above with "*" as the
546 binop to the right of the primary. In this case, the precedence of "*" is
547 higher than the precedence of "+" so the if condition will be entered.</p>
549 <p>The critical question left here is "how can the if condition parse the right
550 hand side in full"? In particular, to build the AST correctly for our example,
551 it needs to get all of "(c+d)*e*f" as the RHS expression variable. The code to
552 do this is surprisingly simple (code from the above two blocks duplicated for
555 <div class="doc_code">
557 // If BinOp binds less tightly with RHS than the operator after RHS, let
558 // the pending operator take RHS as its LHS.
559 int NextPrec = GetTokPrecedence();
560 if (TokPrec < NextPrec) {
561 RHS = ParseBinOpRHS(TokPrec+1, RHS);
562 if (RHS == 0) return 0;
565 LHS = new BinaryExprAST(BinOp, LHS, RHS);
566 } // loop around to the top of the while loop.
571 <p>At this point, we know that the binary operator to the RHS of our primary
572 has higher precedence than the binop we are currently parsing. As such, we know
573 that any sequence of pairs whose operators are all higher precedence than "+"
574 should be parsed together and returned as "RHS". To do this, we recursively
575 invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
576 precedence required for it to continue. In our example above, this will cause
577 it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
578 of the '+' expression.</p>
580 <p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
581 and added to the AST. With this little bit of code (14 non-trivial lines), we
582 correctly handle fully general binary expression parsing in a very elegant way.
585 <p>This wraps up handling of expressions. At this point, we can point the
586 parser at an arbitrary token stream and build an expression from them, stopping
587 at the first token that is not part of the expression. Next up we need to
588 handle function definitions etc.</p>
592 <!-- *********************************************************************** -->
593 <div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
594 <!-- *********************************************************************** -->
596 <div class="doc_text">
599 The first basic thing missing is that of function prototypes. In Kaleidoscope,
600 these are used both for 'extern' function declarations as well as function body
601 definitions. The code to do this is straight-forward and not very interesting
602 (once you've survived expressions):
605 <div class="doc_code">
608 /// ::= id '(' id* ')'
609 static PrototypeAST *ParsePrototype() {
610 if (CurTok != tok_identifier)
611 return ErrorP("Expected function name in prototype");
613 std::string FnName = IdentifierStr;
617 return ErrorP("Expected '(' in prototype");
619 std::vector<std::string> ArgNames;
620 while (getNextToken() == tok_identifier)
621 ArgNames.push_back(IdentifierStr);
623 return ErrorP("Expected ')' in prototype");
626 getNextToken(); // eat ')'.
628 return new PrototypeAST(FnName, ArgNames);
633 <p>Given this, a function definition is very simple, just a prototype plus
634 and expression to implement the body:</p>
636 <div class="doc_code">
638 /// definition ::= 'def' prototype expression
639 static FunctionAST *ParseDefinition() {
640 getNextToken(); // eat def.
641 PrototypeAST *Proto = ParsePrototype();
642 if (Proto == 0) return 0;
644 if (ExprAST *E = ParseExpression())
645 return new FunctionAST(Proto, E);
651 <p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
652 well as to support forward declaration of user functions. 'externs' are just
653 prototypes with no body:</p>
655 <div class="doc_code">
657 /// external ::= 'extern' prototype
658 static PrototypeAST *ParseExtern() {
659 getNextToken(); // eat extern.
660 return ParsePrototype();
665 <p>Finally, we'll also let the user type in arbitrary top-level expressions and
666 evaluate them on the fly. We will handle this by defining anonymous nullary
667 (zero argument) functions for them:</p>
669 <div class="doc_code">
671 /// toplevelexpr ::= expression
672 static FunctionAST *ParseTopLevelExpr() {
673 if (ExprAST *E = ParseExpression()) {
674 // Make an anonymous proto.
675 PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
676 return new FunctionAST(Proto, E);
683 <p>Now that we have all the pieces, lets build a little driver that will let us
684 actually <em>execute</em> this code we've built!</p>
688 <!-- *********************************************************************** -->
689 <div class="doc_section"><a name="driver">The Driver</a></div>
690 <!-- *********************************************************************** -->
692 <div class="doc_text">
694 <p>The driver for this simply invokes all of the parsing pieces with a top-level
695 dispatch loop. There isn't much interesting here, so I'll just include the
696 top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
697 Parsing" section.</p>
699 <div class="doc_code">
701 /// top ::= definition | external | expression | ';'
702 static void MainLoop() {
704 fprintf(stderr, "ready> ");
706 case tok_eof: return;
707 case ';': getNextToken(); break; // ignore top level semicolons.
708 case tok_def: HandleDefinition(); break;
709 case tok_extern: HandleExtern(); break;
710 default: HandleTopLevelExpression(); break;
717 <p>The most interesting part of this is that we ignore top-level semi colons.
718 Why is this, you ask? The basic reason is that if you type "4 + 5" at the
719 command line, the parser doesn't know that that is the end of what you will
720 type. For example, on the next line you could type "def foo..." in which case
721 4+5 is the end of a top-level expression. Alternatively you could type "* 6",
722 which would continue the expression. Having top-level semicolons allows you to
723 type "4+5;" and the parser will know you are done.</p>
727 <!-- *********************************************************************** -->
728 <div class="doc_section"><a name="code">Conclusions and the Full Code</a></div>
729 <!-- *********************************************************************** -->
731 <div class="doc_text">
733 <p>With just under 400 lines of commented code, we fully defined our minimal
734 language, including a lexer, parser and AST builder. With this done, the
735 executable will validate code and tell us if it is gramatically invalid. For
736 example, here is a sample interaction:</p>
738 <div class="doc_code">
741 ready> def foo(x y) x+foo(y, 4.0);
742 ready> Parsed an function definition.
743 ready> def foo(x y) x+y y;
744 ready> Parsed an function definition.
745 ready> Parsed a top-level expr
746 ready> def foo(x y) x+y );
747 ready> Parsed an function definition.
748 ready> Error: unknown token when expecting an expression
749 ready> extern sin(a);
750 ready> Parsed an extern
756 <p>There is a lot of room for extension here. You can define new AST nodes,
757 extend the language in many ways, etc. In the <a href="LangImpl3.html">next
758 installment</a>, we will describe how to generate LLVM IR from the AST.</p>
762 <!-- *********************************************************************** -->
763 <div class="doc_section"><a name="code">Full Code Listing</a></div>
764 <!-- *********************************************************************** -->
766 <div class="doc_text">
769 Here is the complete code listing for this and the previous chapter.
770 Note that it is fully self-contained: you don't need LLVM or any external
771 libraries at all for this (other than the C and C++ standard libraries of
772 course). To build this, just compile with:</p>
774 <div class="doc_code">
783 <p>Here is the code:</p>
785 <div class="doc_code">
787 #include <cstdio>
788 #include <string>
790 #include <vector>
792 //===----------------------------------------------------------------------===//
794 //===----------------------------------------------------------------------===//
796 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
797 // of these for known things.
802 tok_def = -2, tok_extern = -3,
805 tok_identifier = -4, tok_number = -5,
808 static std::string IdentifierStr; // Filled in if tok_identifier
809 static double NumVal; // Filled in if tok_number
811 /// gettok - Return the next token from standard input.
812 static int gettok() {
813 static int LastChar = ' ';
815 // Skip any whitespace.
816 while (isspace(LastChar))
817 LastChar = getchar();
819 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
820 IdentifierStr = LastChar;
821 while (isalnum((LastChar = getchar())))
822 IdentifierStr += LastChar;
824 if (IdentifierStr == "def") return tok_def;
825 if (IdentifierStr == "extern") return tok_extern;
826 return tok_identifier;
829 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
833 LastChar = getchar();
834 } while (isdigit(LastChar) || LastChar == '.');
836 NumVal = strtod(NumStr.c_str(), 0);
840 if (LastChar == '#') {
841 // Comment until end of line.
842 do LastChar = getchar();
843 while (LastChar != EOF && LastChar != '\n' & LastChar != '\r');
849 // Check for end of file. Don't eat the EOF.
853 // Otherwise, just return the character as its ascii value.
854 int ThisChar = LastChar;
855 LastChar = getchar();
859 //===----------------------------------------------------------------------===//
860 // Abstract Syntax Tree (aka Parse Tree)
861 //===----------------------------------------------------------------------===//
863 /// ExprAST - Base class for all expression nodes.
866 virtual ~ExprAST() {}
869 /// NumberExprAST - Expression class for numeric literals like "1.0".
870 class NumberExprAST : public ExprAST {
873 explicit NumberExprAST(double val) : Val(val) {}
876 /// VariableExprAST - Expression class for referencing a variable, like "a".
877 class VariableExprAST : public ExprAST {
880 explicit VariableExprAST(const std::string &name) : Name(name) {}
883 /// BinaryExprAST - Expression class for a binary operator.
884 class BinaryExprAST : public ExprAST {
888 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
889 : Op(op), LHS(lhs), RHS(rhs) {}
892 /// CallExprAST - Expression class for function calls.
893 class CallExprAST : public ExprAST {
895 std::vector<ExprAST*> Args;
897 CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
898 : Callee(callee), Args(args) {}
901 /// PrototypeAST - This class represents the "prototype" for a function,
902 /// which captures its argument names as well as if it is an operator.
905 std::vector< Args;
907 PrototypeAST(const std::string &name, const std::vector<std::string> &args)
908 : Name(name), Args(args) {}
912 /// FunctionAST - This class represents a function definition itself.
917 FunctionAST(PrototypeAST *proto, ExprAST *body)
918 : Proto(proto), Body(body) {}
922 //===----------------------------------------------------------------------===//
924 //===----------------------------------------------------------------------===//
926 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
927 /// token the parser it looking at. getNextToken reads another token from the
928 /// lexer and updates CurTok with its results.
930 static int getNextToken() {
931 return CurTok = gettok();
934 /// BinopPrecedence - This holds the precedence for each binary operator that is
936 static std::map<char, int> BinopPrecedence;
938 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
939 static int GetTokPrecedence() {
940 if (!isascii(CurTok))
943 // Make sure it's a declared binop.
944 int TokPrec = BinopPrecedence[CurTok];
945 if (TokPrec <= 0) return -1;
949 /// Error* - These are little helper functions for error handling.
950 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
951 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
952 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
954 static ExprAST *ParseExpression();
958 /// ::= identifer '(' expression* ')'
959 static ExprAST *ParseIdentifierExpr() {
960 std::string IdName = IdentifierStr;
962 getNextToken(); // eat identifer.
964 if (CurTok != '(') // Simple variable ref.
965 return new VariableExprAST(IdName);
968 getNextToken(); // eat (
969 std::vector<ExprAST*> Args;
971 ExprAST *Arg = ParseExpression();
975 if (CurTok == ')') break;
978 return Error("Expected ')'");
985 return new CallExprAST(IdName, Args);
988 /// numberexpr ::= number
989 static ExprAST *ParseNumberExpr() {
990 ExprAST *Result = new NumberExprAST(NumVal);
991 getNextToken(); // consume the number
995 /// parenexpr ::= '(' expression ')'
996 static ExprAST *ParseParenExpr() {
997 getNextToken(); // eat (.
998 ExprAST *V = ParseExpression();
1002 return Error("expected ')'");
1003 getNextToken(); // eat ).
1008 /// ::= identifierexpr
1011 static ExprAST *ParsePrimary() {
1013 default: return Error("unknown token when expecting an expression");
1014 case tok_identifier: return ParseIdentifierExpr();
1015 case tok_number: return ParseNumberExpr();
1016 case '(': return ParseParenExpr();
1021 /// ::= ('+' primary)*
1022 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1023 // If this is a binop, find its precedence.
1025 int TokPrec = GetTokPrecedence();
1027 // If this is a binop that binds at least as tightly as the current binop,
1028 // consume it, otherwise we are done.
1029 if (TokPrec < ExprPrec)
1032 // Okay, we know this is a binop.
1034 getNextToken(); // eat binop
1036 // Parse the primary expression after the binary operator.
1037 ExprAST *RHS = ParsePrimary();
1040 // If BinOp binds less tightly with RHS than the operator after RHS, let
1041 // the pending operator take RHS as its LHS.
1042 int NextPrec = GetTokPrecedence();
1043 if (TokPrec < NextPrec) {
1044 RHS = ParseBinOpRHS(TokPrec+1, RHS);
1045 if (RHS == 0) return 0;
1049 LHS = new BinaryExprAST(BinOp, LHS, RHS);
1054 /// ::= primary binoprhs
1056 static ExprAST *ParseExpression() {
1057 ExprAST *LHS = ParsePrimary();
1060 return ParseBinOpRHS(0, LHS);
1064 /// ::= id '(' id* ')'
1065 static PrototypeAST *ParsePrototype() {
1066 if (CurTok != tok_identifier)
1067 return ErrorP("Expected function name in prototype");
1069 std::string FnName = IdentifierStr;
1073 return ErrorP("Expected '(' in prototype");
1075 std::vector<std::string> ArgNames;
1076 while (getNextToken() == tok_identifier)
1077 ArgNames.push_back(IdentifierStr);
1079 return ErrorP("Expected ')' in prototype");
1082 getNextToken(); // eat ')'.
1084 return new PrototypeAST(FnName, ArgNames);
1087 /// definition ::= 'def' prototype expression
1088 static FunctionAST *ParseDefinition() {
1089 getNextToken(); // eat def.
1090 PrototypeAST *Proto = ParsePrototype();
1091 if (Proto == 0) return 0;
1093 if (ExprAST *E = ParseExpression())
1094 return new FunctionAST(Proto, E);
1098 /// toplevelexpr ::= expression
1099 static FunctionAST *ParseTopLevelExpr() {
1100 if (ExprAST *E = ParseExpression()) {
1101 // Make an anonymous proto.
1102 PrototypeAST *Proto = new PrototypeAST("", std::vector<());
1103 return new FunctionAST(Proto, E);
1108 /// external ::= 'extern' prototype
1109 static PrototypeAST *ParseExtern() {
1110 getNextToken(); // eat extern.
1111 return ParsePrototype();
1114 //===----------------------------------------------------------------------===//
1115 // Top-Level parsing
1116 //===----------------------------------------------------------------------===//
1118 static void HandleDefinition() {
1119 if (FunctionAST *F = ParseDefinition()) {
1120 fprintf(stderr, "Parsed a function definition.\n");
1122 // Skip token for error recovery.
1127 static void HandleExtern() {
1128 if (PrototypeAST *P = ParseExtern()) {
1129 fprintf(stderr, "Parsed an extern\n");
1131 // Skip token for error recovery.
1136 static void HandleTopLevelExpression() {
1137 // Evaluate a top level expression into an anonymous function.
1138 if (FunctionAST *F = ParseTopLevelExpr()) {
1139 fprintf(stderr, "Parsed a top-level expr\n");
1141 // Skip token for error recovery.
1146 /// top ::= definition | external | expression | ';'
1147 static void MainLoop() {
1149 fprintf(stderr, "ready> ");
1151 case tok_eof: return;
1152 case ';': getNextToken(); break; // ignore top level semicolons.
1153 case tok_def: HandleDefinition(); break;
1154 case tok_extern: HandleExtern(); break;
1155 default: HandleTopLevelExpression(); break;
1160 //===----------------------------------------------------------------------===//
1161 // Main driver code.
1162 //===----------------------------------------------------------------------===//
1165 // Install standard binary operators.
1166 // 1 is lowest precedence.
1167 BinopPrecedence['<'] = 10;
1168 BinopPrecedence['+'] = 20;
1169 BinopPrecedence['-'] = 20;
1170 BinopPrecedence['*'] = 40; // highest.
1172 // Prime the first token.
1173 fprintf(stderr, "ready> ");
1183 <!-- *********************************************************************** -->
1186 <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1187 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1188 <a href="http://validator.w3.org/check/referer"><img
1189 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!" /></a>
1191 <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1192 <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
1193 Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $