X-Git-Url: http://plrg.eecs.uci.edu/git/?a=blobdiff_plain;ds=sidebyside;f=docs%2FStacker.html;h=b8431d2ebe81206fd1ff99d26f65c2aa3d5d11ff;hb=e5109fae718c996ce52b485328f8108728097843;hp=eabccdf6cf10b9aeb3557876bd6aa741502c4d1b;hpb=07e89e43df34ea6c1bfff9e247040f07f59d0d6c;p=oota-llvm.git diff --git a/docs/Stacker.html b/docs/Stacker.html index eabccdf6cf1..b8431d2ebe8 100644 --- a/docs/Stacker.html +++ b/docs/Stacker.html @@ -25,8 +25,10 @@
@@ -40,6 +42,8 @@This document is another way to learn about LLVM. Unlike the LLVM Reference Manual or -LLVM Programmer's Manual, this -document walks you through the implementation of a programming language -named Stacker. Stacker was invented specifically as a demonstration of +LLVM Programmer's Manual, here we learn +about LLVM through the experience of creating a simple programming language +named Stacker. Stacker was invented specifically as a demonstration of LLVM. The emphasis in this document is not on describing the -intricacies of LLVM itself, but on how to use it to build your own +intricacies of LLVM itself but on how to use it to build your own compiler system.
Amongst other things, LLVM is a platform for compiler writers. Because of its exceptionally clean and small IR (intermediate representation), compiler writing with LLVM is much easier than with -other system. As proof, the author of Stacker wrote the entire -compiler (language definition, lexer, parser, code generator, etc.) in -about four days! That's important to know because it shows -how quickly you can get a new -language up when using LLVM. Furthermore, this was the first +other system. As proof, I wrote the entire compiler (language definition, +lexer, parser, code generator, etc.) in about four days! +That's important to know because it shows how quickly you can get a new +language running when using LLVM. Furthermore, this was the first language the author ever created using LLVM. The learning curve is included in that four days.
The language described here, Stacker, is Forth-like. Programs -are simple collections of word definitions and the only thing definitions +are simple collections of word definitions, and the only thing definitions can do is manipulate a stack or generate I/O. Stacker is not a "real" -programming language; its very simple. Although it is computationally +programming language; it's very simple. Although it is computationally complete, you wouldn't use it for your next big project. However, -the fact that it is complete, its simple, and it doesn't have +the fact that it is complete, it's simple, and it doesn't have a C-like syntax make it useful for demonstration purposes. It shows -that LLVM could be applied to a wide variety of language syntaxes.
+that LLVM could be applied to a wide variety of languages.The basic notions behind stacker is very simple. There's a stack of integers (or character pointers) that the program manipulates. Pretty much the only thing the program can do is manipulate the stack and do @@ -92,11 +95,11 @@ program in Stacker:
: MAIN hello_world ;This has two "definitions" (Stacker manipulates words, not
functions and words have definitions): MAIN
and
-hello_world
. The MAIN
definition is standard, it
+hello_world. The MAIN
definition is standard; it
tells Stacker where to start. Here, MAIN
is defined to
simply invoke the word hello_world
. The
hello_world
definition tells stacker to push the
-"Hello, World!"
string onto the stack, print it out
+"Hello, World!"
string on to the stack, print it out
(>s
), pop it off the stack (DROP
), and
finally print a carriage return (CR
). Although
hello_world
uses the stack, its net effect is null. Well
@@ -106,59 +109,67 @@ written Stacker definitions have that characteristic.
Stacker was written for two purposes: (a) to get the author over the -learning curve and (b) to provide a simple example of how to write a compiler -using LLVM. During the development of Stacker, many lessons about LLVM were +
Stacker was written for two purposes:
+During the development of Stacker, many lessons about LLVM were learned. Those lessons are described in the following subsections.
Although I knew that LLVM used a Single Static Assignment (SSA) format, +
Although I knew that LLVM uses a Single Static Assignment (SSA) format, it wasn't obvious to me how prevalent this idea was in LLVM until I really -started using it. Reading the Programmer's Manual and Language Reference I -noted that most of the important LLVM IR (Intermediate Representation) C++ +started using it. Reading the +Programmer's Manual and Language Reference, +I noted that most of the important LLVM IR (Intermediate Representation) C++ classes were derived from the Value class. The full power of that simple design only became fully understood once I started constructing executable expressions for Stacker.
This really makes your programming go faster. Think about compiling code -for the following C/C++ expression: (a|b)*((x+1)/(y+1)). You could write a -function using LLVM that does exactly that, this way:
+for the following C/C++ expression:(a|b)*((x+1)/(y+1))
. Assuming
+the values are on the stack in the order a, b, x, y, this could be
+expressed in stacker as: 1 + SWAP 1 + / ROT2 OR *
.
+You could write a function using LLVM that computes this expression like this:
Value*
-expression(BasicBlock*bb, Value* a, Value* b, Value* x, Value* y )
+expression(BasicBlock* bb, Value* a, Value* b, Value* x, Value* y )
{
Instruction* tail = bb->getTerminator();
ConstantSInt* one = ConstantSInt::get( Type::IntTy, 1);
BinaryOperator* or1 =
- new BinaryOperator::create( Instruction::Or, a, b, "", tail );
+ BinaryOperator::create( Instruction::Or, a, b, "", tail );
BinaryOperator* add1 =
- new BinaryOperator::create( Instruction::Add, x, one, "", tail );
+ BinaryOperator::create( Instruction::Add, x, one, "", tail );
BinaryOperator* add2 =
- new BinaryOperator::create( Instruction::Add, y, one, "", tail );
+ BinaryOperator::create( Instruction::Add, y, one, "", tail );
BinaryOperator* div1 =
- new BinaryOperator::create( Instruction::Div, add1, add2, "", tail);
+ BinaryOperator::create( Instruction::Div, add1, add2, "", tail);
BinaryOperator* mult1 =
- new BinaryOperator::create( Instruction::Mul, or1, div1, "", tail );
+ BinaryOperator::create( Instruction::Mul, or1, div1, "", tail );
return mult1;
}
-"Okay, big deal," you say. It is a big deal. Here's why. Note that I didn't +
"Okay, big deal," you say? It is a big deal. Here's why. Note that I didn't have to tell this function which kinds of Values are being passed in. They could be -instructions, Constants, Global Variables, etc. Furthermore, if you specify Values -that are incorrect for this sequence of operations, LLVM will either notice right -away (at compilation time) or the LLVM Verifier will pick up the inconsistency -when the compiler runs. In no case will you make a type error that gets passed -through to the generated program. This really helps you write a compiler -that always generates correct code!
+Instruction
s, Constant
s, GlobalVariable
s, or
+any of the other subclasses of Value
that LLVM supports.
+Furthermore, if you specify Values that are incorrect for this sequence of
+operations, LLVM will either notice right away (at compilation time) or the LLVM
+Verifier will pick up the inconsistency when the compiler runs. In either case
+LLVM prevents you from making a type error that gets passed through to the
+generated program. This really helps you write a compiler that
+always generates correct code!
The second point is that we don't have to worry about branching, registers, stack variables, saving partial results, etc. The instructions we create are the values we use. Note that all that was created in the above code is a Constant value and five operators. Each of the instructions is -the resulting value of that instruction.
+the resulting value of that instruction. This saves a lot of time.The lesson is this: SSA form is very powerful: there is no difference - between a value and the instruction that created it. This is fully +between a value and the instruction that created it. This is fully enforced by the LLVM IR. Use it to your best advantage.
After a little initial fumbling around, I quickly caught on to how blocks -should be constructed. The use of the standard template library really helps -simply the interface. In general, here's what I learned: +should be constructed. In general, here's what I learned:
getTerminator()
method on a BasicBlock
), it can
always be used as the insert_before
argument to your instruction
constructors. This causes the instruction to automatically be inserted in
- the RightPlace&tm; place, just before the terminating instruction. The
+ the RightPlace™ place, just before the terminating instruction. The
nice thing about this design is that you can pass blocks around and insert
- new instructions into them without ever known what instructions came
+ new instructions into them without ever knowing what instructions came
before. This makes for some very clean compiler design.The foregoing is such an important principal, its worth making an idiom:
-
-
+
BasicBlock* bb = new BasicBlock();
bb->getInstList().push_back( new Branch( ... ) );
new Instruction(..., bb->getTerminator() );
-
-
+
To make this clear, consider the typical if-then-else statement (see StackerCompiler::handle_if() method). We can set this up in a single function using LLVM in the following way:
@@ -228,45 +236,47 @@ BasicBlock* MyCompiler::handle_if( BasicBlock* bb, SetCondInst* condition ) { // Create the blocks to contain code in the structure of if/then/else - BasicBlock* then = new BasicBlock(); - BasicBlock* else = new BasicBlock(); - BasicBlock* exit = new BasicBlock(); + BasicBlock* then_bb = new BasicBlock(); + BasicBlock* else_bb = new BasicBlock(); + BasicBlock* exit_bb = new BasicBlock(); // Insert the branch instruction for the "if" - bb->getInstList().push_back( new BranchInst( then, else, condition ) ); + bb->getInstList().push_back( new BranchInst( then_bb, else_bb, condition ) ); // Set up the terminating instructions - then->getInstList().push_back( new BranchInst( exit ) ); - else->getInstList().push_back( new BranchInst( exit ) ); + then->getInstList().push_back( new BranchInst( exit_bb ) ); + else->getInstList().push_back( new BranchInst( exit_bb ) ); // Fill in the then part .. details excised for brevity - this->fill_in( then ); + this->fill_in( then_bb ); // Fill in the else part .. details excised for brevity - this->fill_in( else ); + this->fill_in( else_bb ); // Return a block to the caller that can be filled in with the code // that follows the if/then/else construct. - return exit; + return exit_bb; }Presumably in the foregoing, the calls to the "fill_in" method would add
the instructions for the "then" and "else" parts. They would use the third part
of the idiom almost exclusively (inserting new instructions before the
terminator). Furthermore, they could even recurse back to handle_if
-should they encounter another if/then/else statement and it will all "just work".
-
+should they encounter another if/then/else statement, and it will just work.
Note how cleanly this all works out. In particular, the push_back methods on
the BasicBlock
's instruction list. These are lists of type
-Instruction
which also happen to be Value
s. To create
+Instruction
(which is also of type Value
). To create
the "if" branch we merely instantiate a BranchInst
that takes as
-arguments the blocks to branch to and the condition to branch on. The blocks
-act like branch labels! This new BranchInst
terminates
-the BasicBlock
provided as an argument. To give the caller a way
-to keep inserting after calling handle_if
we create an "exit" block
-which is returned to the caller. Note that the "exit" block is used as the
-terminator for both the "then" and the "else" blocks. This gaurantees that no
-matter what else "handle_if" or "fill_in" does, they end up at the "exit" block.
+arguments the blocks to branch to and the condition to branch on. The
+BasicBlock
objects act like branch labels! This new
+BranchInst
terminates the BasicBlock
provided
+as an argument. To give the caller a way to keep inserting after calling
+handle_if
, we create an exit_bb
block which is
+returned
+to the caller. Note that the exit_bb
block is used as the
+terminator for both the then_bb
and the else_bb
+blocks. This guarantees that no matter what else handle_if
+or fill_in
does, they end up at the exit_bb
block.
It took a little getting used to and several rounds of postings to the LLVM -mail list to wrap my head around this instruction correctly. Even though I had +mailing list to wrap my head around this instruction correctly. Even though I had read the Language Reference and Programmer's Manual a couple times each, I still missed a few very key points:
This means that when you look up an element in the global variable (assuming -its a struct or array), you must deference the pointer first! For many +it's a struct or array), you must deference the pointer first! For many things, this leads to the idiom:
@@ -312,40 +322,43 @@ pointer. The second index subscripts the array. If you're a "C" programmer, this
will run against your grain because you'll naturally think of the global array
variable and the address of its first element as the same. That tripped me up
for a while until I realized that they really do differ .. by type.
-Remember that LLVM is a strongly typed language itself. Absolutely everything
-has a type. The "type" of the global variable is [24 x int]*. That is, its
+Remember that LLVM is strongly typed. Everything has a type.
+The "type" of the global variable is [24 x int]*. That is, it's
a pointer to an array of 24 ints. When you dereference that global variable with
-a single index, you now have a " [24 x int]" type, the pointer is gone. Although
+a single (0) index, you now have a "[24 x int]" type. Although
the pointer value of the dereferenced global and the address of the zero'th element
in the array will be the same, they differ in their type. The zero'th element has
type "int" while the pointer value has type "[24 x int]".
-Get this one aspect of LLVM right in your head and you'll save yourself
+
Get this one aspect of LLVM right in your head, and you'll save yourself
a lot of compiler writing headaches down the road.
Linkage types in LLVM can be a little confusing, especially if your compiler -writing mind has affixed very hard concepts to particular words like "weak", +writing mind has affixed firm concepts to particular words like "weak", "external", "global", "linkonce", etc. LLVM does not use the precise -definitions of say ELF or GCC even though they share common terms. To be fair, +definitions of, say, ELF or GCC, even though they share common terms. To be fair, the concepts are related and similar but not precisely the same. This can lead you to think you know what a linkage type represents but in fact it is slightly different. I recommend you read the Language Reference on this topic very -carefully.
+carefully. Then, read it again.
Here are some handy tips that I discovered along the way:
This section describes the Stacker language
Stacker definitions define what they do to the global stack. Before proceeding, a few words about the stack are in order. The stack is simply a global array of 32-bit integers or pointers. A global index keeps track -of the location of the to of the stack. All of this is hidden from the -programmer but it needs to be noted because it is the foundation of the +of the location of the top of the stack. All of this is hidden from the +programmer, but it needs to be noted because it is the foundation of the conceptual programming model for Stacker. When you write a definition, you are, essentially, saying how you want that definition to manipulate the global stack.
Manipulating the stack can be quite hazardous. There is no distinction given and no checking for the various types of values that can be placed on the stack. Automatic coercion between types is performed. In many -cases this is useful. For example, a boolean value placed on the stack +cases, this is useful. For example, a boolean value placed on the stack can be interpreted as an integer with good results. However, using a word that interprets that boolean value as a pointer to a string to print out will almost always yield a crash. Stacker simply leaves it to the programmer to get it right without any interference or hindering -on interpretation of the stack values. You've been warned :)
+on interpretation of the stack values. You've been warned. :)So, your typical definition will have the form:
+: name ... ;
+The name
is up to you but it must start with a letter and contain
+only letters, numbers, and underscore. Names are case sensitive and must not be
+the same as the name of a built-in word. The ...
is replaced by
+the stack manipulating words that you wish to define name
as.
+ + +
+Stacker supports two types of comments. A hash mark (#) starts a comment + that extends to the end of the line. It is identical to the kind of comments + commonly used in shell scripts. A pair of parentheses also surround a comment. + In both cases, the content of the comment is ignored by the Stacker compiler. The + following does nothing in Stacker. +
+
+# This is a comment to end of line
+( This is an enclosed comment )
+
+See the example program to see comments in use in +a real program.
There are three kinds of literal values in Stacker. Integer, Strings, +
There are three kinds of literal values in Stacker: Integers, Strings,
and Booleans. In each case, the stack operation is to simply push the
- value onto the stack. So, for example:
+ value on to the stack. So, for example:
42 " is the answer." TRUE
- will push three values onto the stack: the integer 42, the
- string " is the answer." and the boolean TRUE.
Words in a definition come in two flavors: built-in and programmer defined. Simply mentioning the name of a previously defined or declared -programmer-defined word causes that words definition to be invoked. It +programmer-defined word causes that word's stack actions to be invoked. It is somewhat like a function call in other languages. The built-in -words have various effects, described below.
+words have various effects, described below.Sometimes you need to call a word before it is defined. For this, you can
-use the FORWARD
declaration. It looks like this
FORWARD
declaration. It looks like this:
FORWARD name ;
This simply states to Stacker that "name" is the name of a definition that is defined elsewhere. Generally it means the definition can be found @@ -434,20 +471,21 @@ linking.
The built-in words of the Stacker language are put in several groups depending on what they do. The groups are as follows:
Definition Of Operation Of Built In Words | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LOGICAL OPERATIONS | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Word | Name | Operation | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
< | -LT | -w1 w2 -- b | -Two values (w1 and w2) are popped off the stack and
+
+
|