X-Git-Url: http://plrg.eecs.uci.edu/git/?a=blobdiff_plain;f=docs%2FCodeGenerator.html;h=c4b15dfe377e8a4962dd1d24350e9159d9326c42;hb=057a4b40a65692ea54e0a00cb6ea27d0855be51f;hp=dd9cd4a45bae9bd5e3486295832e747888581bcc;hpb=d26795a034aef970392e789d782333528b128006;p=oota-llvm.git diff --git a/docs/CodeGenerator.html b/docs/CodeGenerator.html index dd9cd4a45ba..c4b15dfe377 100644 --- a/docs/CodeGenerator.html +++ b/docs/CodeGenerator.html @@ -5,12 +5,23 @@ The LLVM Target-Independent Code Generator + + + -
+

The LLVM Target-Independent Code Generator -

+
  1. Introduction @@ -33,12 +44,22 @@
  2. The TargetJITInfo class
  3. -
  4. Machine code description classes +
  5. The "Machine" Code Generator classes +
  6. +
  7. The "MC" Layer +
  8. Target-independent code generation algorithms @@ -76,16 +97,24 @@
  9. Instruction folding
  10. Built in register allocators
  11. -
  12. Code Emission -
  13. +
  14. Code Emission
  15. +
  16. VLIW Packetizer + +
  17. +
  18. Implementing a Native Assembler
  19. +
  20. Target-specific Implementation Notes
-

Written by Chris Lattner, - Bill Wendling, - Fernando Magno Quintao - Pereira and - Jim Laskey

+

Written by the LLVM Team.

@@ -111,18 +137,18 @@
-
+

Introduction -

+ -
+

The LLVM target-independent code generator is a framework that provides a suite of reusable components for translating the LLVM internal representation to the machine code for a specified target—either in assembly form (suitable for a static compiler) or in binary machine code format (usable for - a JIT compiler). The LLVM target-independent code generator consists of five + a JIT compiler). The LLVM target-independent code generator consists of six main components:

    @@ -131,10 +157,17 @@ independently of how they will be used. These interfaces are defined in include/llvm/Target/. -
  1. Classes used to represent the machine code - being generated for a target. These classes are intended to be abstract +
  2. Classes used to represent the code being + generated for a target. These classes are intended to be abstract enough to represent the machine code for any target machine. These - classes are defined in include/llvm/CodeGen/.
  3. + classes are defined in include/llvm/CodeGen/. At this level, + concepts like "constant pool entries" and "jump tables" are explicitly + exposed. + +
  4. Classes and algorithms used to represent code as the object file level, + the MC Layer. These classes represent assembly level + constructs like labels, sections, and instructions. At this level, + concepts like "constant pool entries" and "jump tables" don't exist.
  5. Target-independent algorithms used to implement various phases of native code generation (register allocation, scheduling, @@ -165,14 +198,12 @@ depend on the target-description and machine code representation classes, ensuring that it is portable.

    -
- -
+

Required components in the code generator -

+ -
+

The two pieces of the LLVM code generator are the high-level interface to the code generator and the set of reusable components that can be used to build @@ -200,11 +231,11 @@

- + -
+

The LLVM target-independent code generator is designed to support efficient and quality code generation for standard register-based microprocessors. @@ -274,11 +305,11 @@

- + -
+

The target description classes require a detailed description of the target architecture. These target descriptions often have a large amount of common @@ -301,13 +332,15 @@

+
+ - + -
+

The LLVM target description classes (located in the include/llvm/Target directory) provide an abstract description of @@ -323,14 +356,12 @@ TargetMachine class provides accessors that should be implemented by the target.

-
- - + -
+

The TargetMachine class provides virtual methods that are used to access the target-specific implementations of the various target description @@ -346,11 +377,11 @@

- + -
+

The TargetData class is the only required target description class, and it is the only class that is not extensible (you cannot derived a new @@ -362,11 +393,11 @@

- + -
+

The TargetLowering class is used by SelectionDAG based instruction selectors primarily to describe how LLVM code should be lowered to @@ -388,11 +419,11 @@

- + -
+

The TargetRegisterInfo class is used to describe the register file of the target and any interactions between the registers.

@@ -422,11 +453,11 @@
- + -
+

The TargetInstrInfo class is used to describe the machine instructions supported by the target. It is essentially an array of @@ -440,11 +471,11 @@

- + -
+

The TargetFrameInfo class is used to provide information about the stack frame layout of the target. It holds the direction of stack growth, the @@ -456,11 +487,11 @@

- + -
+

The TargetSubtarget class is used to provide information about the specific chip set being targeted. A sub-target informs code generation of @@ -472,11 +503,11 @@ -

+ -
+

The TargetJITInfo class exposes an abstract interface used by the Just-In-Time code generator to perform target-specific activities, such as @@ -486,13 +517,15 @@

+
+ - + -
+

At the high-level, LLVM code is translated to a machine specific representation formed out of @@ -505,14 +538,12 @@ SSA representation for machine code, as well as a register allocated, non-SSA form.

-
- - + -
+

Target machine instructions are represented as instances of the MachineInstr class. This class is an extremely abstract way of @@ -553,14 +584,12 @@

Also if the first operand is a def, it is easier to create instructions whose only def is the first operand.

-
- - + -
+

Machine instructions are created by using the BuildMI functions, located in the include/llvm/CodeGen/MachineInstrBuilder.h file. The @@ -600,18 +629,18 @@ BuildMI(MBB, X86::JNE, 1).addMBB(&MBB);

-MI.addReg(Reg, MachineOperand::Def);
+MI.addReg(Reg, RegState::Define);
 
- + -
+

One important issue that the code generator needs to be aware of is the presence of fixed registers. In particular, there are often places in the @@ -679,11 +708,26 @@ ret

-
- Machine code in SSA form +

+ Call-clobbered registers +

+ +
+ +

Some machine instructions, like calls, clobber a large number of physical + registers. Rather than adding <def,dead> operands for + all of them, it is possible to use an MO_RegisterMask operand + instead. The register mask operand holds a bit mask of preserved registers, + and everything else is considered to be clobbered by the instruction.

+
-
+ +

+ Machine code in SSA form +

+ +

MachineInstr's are initially selected in SSA-form, and are maintained in SSA-form until register allocation happens. For the most part, @@ -696,12 +740,14 @@ ret

+
+ - + -
+

The MachineBasicBlock class contains a list of machine instructions (MachineInstr instances). It roughly @@ -714,11 +760,11 @@ ret

- + -
+

The MachineFunction class contains a list of machine basic blocks (MachineBasicBlock instances). It @@ -731,26 +777,258 @@ ret

+ +

+ MachineInstr Bundles +

+ +
+ +

LLVM code generator can model sequences of instructions as MachineInstr + bundles. A MI bundle can model a VLIW group / pack which contains an + arbitrary number of parallel instructions. It can also be used to model + a sequential list of instructions (potentially with data dependencies) that + cannot be legally separated (e.g. ARM Thumb2 IT blocks).

+ +

Conceptually a MI bundle is a MI with a number of other MIs nested within: +

+ +
+
+--------------
+|   Bundle   | ---------
+--------------          \
+       |           ----------------
+       |           |      MI      |
+       |           ----------------
+       |                   |
+       |           ----------------
+       |           |      MI      |
+       |           ----------------
+       |                   |
+       |           ----------------
+       |           |      MI      |
+       |           ----------------
+       |
+--------------
+|   Bundle   | --------
+--------------         \
+       |           ----------------
+       |           |      MI      |
+       |           ----------------
+       |                   |
+       |           ----------------
+       |           |      MI      |
+       |           ----------------
+       |                   |
+       |                  ...
+       |
+--------------
+|   Bundle   | --------
+--------------         \
+       |
+      ...
+
+
+ +

MI bundle support does not change the physical representations of + MachineBasicBlock and MachineInstr. All the MIs (including top level and + nested ones) are stored as sequential list of MIs. The "bundled" MIs are + marked with the 'InsideBundle' flag. A top level MI with the special BUNDLE + opcode is used to represent the start of a bundle. It's legal to mix BUNDLE + MIs with indiviual MIs that are not inside bundles nor represent bundles. +

+ +

MachineInstr passes should operate on a MI bundle as a single unit. Member + methods have been taught to correctly handle bundles and MIs inside bundles. + The MachineBasicBlock iterator has been modified to skip over bundled MIs to + enforce the bundle-as-a-single-unit concept. An alternative iterator + instr_iterator has been added to MachineBasicBlock to allow passes to + iterate over all of the MIs in a MachineBasicBlock, including those which + are nested inside bundles. The top level BUNDLE instruction must have the + correct set of register MachineOperand's that represent the cumulative + inputs and outputs of the bundled MIs.

+ +

Packing / bundling of MachineInstr's should be done as part of the register + allocation super-pass. More specifically, the pass which determines what + MIs should be bundled together must be done after code generator exits SSA + form (i.e. after two-address pass, PHI elimination, and copy coalescing). + Bundles should only be finalized (i.e. adding BUNDLE MIs and input and + output register MachineOperands) after virtual registers have been + rewritten into physical registers. This requirement eliminates the need to + add virtual register operands to BUNDLE instructions which would effectively + double the virtual register def and use lists.

+ +
+ +
+ -
- Target-independent code generation algorithms +

+ The "MC" Layer +

+ + +
+ +

+The MC Layer is used to represent and process code at the raw machine code +level, devoid of "high level" information like "constant pools", "jump tables", +"global variables" or anything like that. At this level, LLVM handles things +like label names, machine instructions, and sections in the object file. The +code in this layer is used for a number of important purposes: the tail end of +the code generator uses it to write a .s or .o file, and it is also used by the +llvm-mc tool to implement standalone machine code assemblers and disassemblers. +

+ +

+This section describes some of the important classes. There are also a number +of important subsystems that interact at this layer, they are described later +in this manual. +

+ + +

+ The MCStreamer API +

+ +
+ +

+MCStreamer is best thought of as an assembler API. It is an abstract API which +is implemented in different ways (e.g. to output a .s file, output an +ELF .o file, etc) but whose API correspond directly to what you see in a .s +file. MCStreamer has one method per directive, such as EmitLabel, +EmitSymbolAttribute, SwitchSection, EmitValue (for .byte, .word), etc, which +directly correspond to assembly level directives. It also has an +EmitInstruction method, which is used to output an MCInst to the streamer. +

+ +

+This API is most important for two clients: the llvm-mc stand-alone assembler is +effectively a parser that parses a line, then invokes a method on MCStreamer. In +the code generator, the Code Emission phase of the code +generator lowers higher level LLVM IR and Machine* constructs down to the MC +layer, emitting directives through MCStreamer.

+ +

+On the implementation side of MCStreamer, there are two major implementations: +one for writing out a .s file (MCAsmStreamer), and one for writing out a .o +file (MCObjectStreamer). MCAsmStreamer is a straight-forward implementation +that prints out a directive for each method (e.g. EmitValue -> .byte), but +MCObjectStreamer implements a full assembler. +

+
+ + +

+ The MCContext class +

+ +
+ +

+The MCContext class is the owner of a variety of uniqued data structures at the +MC layer, including symbols, sections, etc. As such, this is the class that you +interact with to create symbols and sections. This class can not be subclassed. +

+ +
+ + +

+ The MCSymbol class +

+ +
+ +

+The MCSymbol class represents a symbol (aka label) in the assembly file. There +are two interesting kinds of symbols: assembler temporary symbols, and normal +symbols. Assembler temporary symbols are used and processed by the assembler +but are discarded when the object file is produced. The distinction is usually +represented by adding a prefix to the label, for example "L" labels are +assembler temporary labels in MachO. +

+ +

MCSymbols are created by MCContext and uniqued there. This means that +MCSymbols can be compared for pointer equivalence to find out if they are the +same symbol. Note that pointer inequality does not guarantee the labels will +end up at different addresses though. It's perfectly legal to output something +like this to the .s file:

+ +

+  foo:
+  bar:
+    .byte 4
+
+ +

In this case, both the foo and bar symbols will have the same address.

+ +
+ + +

+ The MCSection class +

+ +
+ +

+The MCSection class represents an object-file specific section. It is subclassed +by object file specific implementations (e.g. MCSectionMachO, +MCSectionCOFF, MCSectionELF) and these are created and uniqued +by MCContext. The MCStreamer has a notion of the current section, which can be +changed with the SwitchToSection method (which corresponds to a ".section" +directive in a .s file). +

+ +
+ + +

+ The MCInst class +

+ +
+ +

+The MCInst class is a target-independent representation of an instruction. It +is a simple class (much more so than MachineInstr) +that holds a target-specific opcode and a vector of MCOperands. MCOperand, in +turn, is a simple discriminated union of three cases: 1) a simple immediate, +2) a target register ID, 3) a symbolic expression (e.g. "Lfoo-Lbar+42") as an +MCExpr. +

+ +

MCInst is the common currency used to represent machine instructions at the +MC layer. It is the type used by the instruction encoder, the instruction +printer, and the type generated by the assembly parser and disassembler. +

+ +
+ +
+ + +

+ Target-independent code generation algorithms +

-
+

This section documents the phases described in the high-level design of the code generator. It explains how they work and some of the rationale behind their design.

-
- - + -
+

Instruction Selection is the process of translating LLVM code presented to the code generator into target-specific machine instructions. There are @@ -762,14 +1040,12 @@ ret selector to be generated from these .td files, though currently there are still things that require custom C++ code.

-
- - + -
+

The SelectionDAG provides an abstraction for code representation in a way that is amenable to instruction selection using automatic techniques @@ -827,11 +1103,11 @@ ret

- + -
+

SelectionDAG-based instruction selection consists of the following steps:

@@ -856,9 +1132,9 @@ ret SelectionDAG optimizer is run to clean up redundancies exposed by type legalization. -
  • Legalize SelectionDAG Types — - This stage transforms SelectionDAG nodes to eliminate any types that are - unsupported on the target.
  • +
  • Legalize SelectionDAG Ops — + This stage transforms SelectionDAG nodes to eliminate any operations + that are unsupported on the target.
  • Optimize SelectionDAG — The SelectionDAG optimizer is run to eliminate inefficiencies introduced by @@ -908,11 +1184,11 @@ ret
  • - + -
    +

    The initial SelectionDAG is naïvely peephole expanded from the LLVM input by the SelectionDAGLowering class in the @@ -928,11 +1204,11 @@ ret

    - + -
    +

    The Legalize phase is in charge of converting a DAG to only use the types that are natively supported by the target.

    @@ -961,11 +1237,11 @@ ret
    - + -
    +

    The Legalize phase is in charge of converting a DAG to only use the operations that are natively supported by the target.

    @@ -993,12 +1269,13 @@ ret
    - +

    + + SelectionDAG Optimization Phase: the DAG Combiner + +

    -
    +

    The SelectionDAG optimization phase is run multiple times for code generation, immediately after the DAG is built and once after each @@ -1028,11 +1305,11 @@ ret

    - + -
    +

    The Select phase is the bulk of the target-specific code for instruction selection. This phase takes a legal SelectionDAG as input, pattern matches @@ -1041,9 +1318,9 @@ ret

    -%t1 = add float %W, %X
    -%t2 = mul float %t1, %Y
    -%t3 = add float %t2, %Z
    +%t1 = fadd float %W, %X
    +%t2 = fmul float %t1, %Y
    +%t3 = fadd float %t2, %Z
     
    @@ -1089,8 +1366,8 @@ def FADDS : AForm_2<59, 21,

    The portion of the instruction definition in bold indicates the pattern used to match the instruction. The DAG operators (like fmul/fadd) are defined in - the lib/Target/TargetSelectionDAG.td file. "F4RC" is the - register class of the input and result values.

    + the include/llvm/Target/TargetSelectionDAG.td file. " + F4RC" is the register class of the input and result values.

    The TableGen DAG instruction selector generator reads the instruction patterns in the .td file and automatically builds parts of the @@ -1189,11 +1466,11 @@ def : Pat<(i32 imm:$imm),

    - + -
    +

    The scheduling phase takes the DAG of target instructions from the selection phase and assigns an order. The scheduler can pick an order depending on @@ -1210,11 +1487,11 @@ def : Pat<(i32 imm:$imm),

    - + -
    +
    1. Optional function-at-a-time selection.
    2. @@ -1224,18 +1501,20 @@ def : Pat<(i32 imm:$imm),
    +
    + - -

    To Be Written

    + +

    To Be Written

    - + -
    +

    Live Intervals are the ranges (intervals) where a variable is live. They are used by some register allocator passes to @@ -1243,14 +1522,12 @@ def : Pat<(i32 imm:$imm), register are live at the same point in the program (i.e., they conflict). When this situation occurs, one virtual register must be spilled.

    -
    - - + -
    +

    The first step in determining the live intervals of variables is to calculate the set of registers that are immediately dead after the instruction (i.e., @@ -1292,11 +1569,11 @@ def : Pat<(i32 imm:$imm),

    - + -
    +

    We now have the information available to perform the live intervals analysis and build the live intervals themselves. We start off by numbering the basic @@ -1311,12 +1588,14 @@ def : Pat<(i32 imm:$imm),

    +
    + - + -
    +

    The Register Allocation problem consists in mapping a program Pv, that can use an unbounded number of virtual registers, @@ -1326,23 +1605,21 @@ def : Pat<(i32 imm:$imm), accommodate all the virtual registers, some of them will have to be mapped into memory. These virtuals are called spilled virtuals.

    -
    - - + -
    +

    In LLVM, physical registers are denoted by integer numbers that normally range from 1 to 1023. To see how this numbering is defined for a particular architecture, you can read the GenRegisterNames.inc file for that architecture. For instance, by - inspecting lib/Target/X86/X86GenRegisterNames.inc we see that the - 32-bit register EAX is denoted by 15, and the MMX register - MM0 is mapped to 48.

    + inspecting lib/Target/X86/X86GenRegisterInfo.inc we see that the + 32-bit register EAX is denoted by 43, and the MMX register + MM0 is mapped to 65.

    Some architectures contain registers that share the same physical location. A notable example is the X86 platform. For instance, in the X86 architecture, @@ -1350,7 +1627,7 @@ def : Pat<(i32 imm:$imm), bits. These physical registers are marked as aliased in LLVM. Given a particular architecture, you can check which registers are aliased by inspecting its RegisterInfo.td file. Moreover, the method - TargetRegisterInfo::getAliasSet(p_reg) returns an array containing + MCRegisterInfo::getAliasSet(p_reg) returns an array containing all the physical registers aliased to the register p_reg.

    Physical registers, in LLVM, are grouped in Register Classes. @@ -1380,23 +1657,30 @@ bool RegMapping_Fer::compatible_class(MachineFunction &mf, for RegisterClass, the last parameter of which is a list of registers. Just commenting some out is one simple way to avoid them being used. A more polite way is to explicitly exclude some registers from - the allocation order. See the definition of the GR register - class in lib/Target/IA64/IA64RegisterInfo.td for an example of this - (e.g., numReservedRegs registers are hidden.)

    + the allocation order. See the definition of the GR8 register + class in lib/Target/X86/X86RegisterInfo.td for an example of this. +

    Virtual registers are also denoted by integer numbers. Contrary to physical - registers, different virtual registers never share the same number. The - smallest virtual register is normally assigned the number 1024. This may - change, so, in order to know which is the first virtual register, you should - access TargetRegisterInfo::FirstVirtualRegister. Any register whose - number is greater than or equal - to TargetRegisterInfo::FirstVirtualRegister is considered a virtual - register. Whereas physical registers are statically defined in - a TargetRegisterInfo.td file and cannot be created by the - application developer, that is not the case with virtual registers. In order - to create new virtual registers, use the + registers, different virtual registers never share the same number. Whereas + physical registers are statically defined in a TargetRegisterInfo.td + file and cannot be created by the application developer, that is not the case + with virtual registers. In order to create new virtual registers, use the method MachineRegisterInfo::createVirtualRegister(). This method - will return a virtual register with the highest code.

    + will return a new virtual register. Use an IndexedMap<Foo, + VirtReg2IndexFunctor> to hold information per virtual register. If you + need to enumerate all virtual registers, use the function + TargetRegisterInfo::index2VirtReg() to find the virtual register + numbers:

    + +
    +
    +  for (unsigned i = 0, e = MRI->getNumVirtRegs(); i != e; ++i) {
    +    unsigned VirtReg = TargetRegisterInfo::index2VirtReg(i);
    +    stuff(VirtReg);
    +  }
    +
    +

    Before register allocation, the operands of an instruction are mostly virtual registers, although physical registers may also be used. In order to check if @@ -1429,18 +1713,18 @@ bool RegMapping_Fer::compatible_class(MachineFunction &mf, instruction, use TargetInstrInfo::get(opcode)::ImplicitUses. Pre-colored registers impose constraints on any register allocation algorithm. The - register allocator must make sure that none of them is been overwritten by + register allocator must make sure that none of them are overwritten by the values of virtual registers while still alive.

    - + -
    +

    There are two ways to map virtual registers to physical registers (or to memory slots). The first way, that we will call direct mapping, is @@ -1456,8 +1740,8 @@ bool RegMapping_Fer::compatible_class(MachineFunction &mf, order to get and store values in memory. To assign a physical register to a virtual register present in a given operand, use MachineOperand::setReg(p_reg). To insert a store instruction, - use TargetRegisterInfo::storeRegToStackSlot(...), and to insert a - load instruction, use TargetRegisterInfo::loadRegFromStackSlot.

    + use TargetInstrInfo::storeRegToStackSlot(...), and to insert a + load instruction, use TargetInstrInfo::loadRegFromStackSlot.

    The indirect mapping shields the application developer from the complexities of inserting load and store instructions. In order to map a virtual register @@ -1486,11 +1770,11 @@ bool RegMapping_Fer::compatible_class(MachineFunction &mf,

    - + -
    +

    With very rare exceptions (e.g., function calls), the LLVM machine code instructions are three address instructions. That is, each instruction is @@ -1522,11 +1806,11 @@ bool RegMapping_Fer::compatible_class(MachineFunction &mf,

    - + -
    +

    An important transformation that happens during register allocation is called the SSA Deconstruction Phase. The SSA form simplifies many analyses @@ -1546,11 +1830,11 @@ bool RegMapping_Fer::compatible_class(MachineFunction &mf,

    - + -
    +

    Instruction folding is an optimization performed during register allocation that removes unnecessary copy instructions. For instance, a @@ -1583,32 +1867,38 @@ bool RegMapping_Fer::compatible_class(MachineFunction &mf, -

    + -
    +

    The LLVM infrastructure provides the application developer with three different register allocators:

      -
    • Simple — This is a very simple implementation that does not - keep values in registers across instructions. This register allocator - immediately spills every value right after it is computed, and reloads all - used operands from memory to temporary registers before each - instruction.
    • - -
    • Local — This register allocator is an improvement on the - Simple implementation. It allocates registers on a basic block - level, attempting to keep values in registers and reusing registers as - appropriate.
    • - -
    • Linear ScanThe default allocator. This is the - well-know linear scan register allocator. Whereas the - Simple and Local algorithms use a direct mapping - implementation technique, the Linear Scan implementation - uses a spiller in order to place load and stores.
    • +
    • Fast — This register allocator is the default for debug + builds. It allocates registers on a basic block level, attempting to keep + values in registers and reusing registers as appropriate.
    • + +
    • Basic — This is an incremental approach to register + allocation. Live ranges are assigned to registers one at a time in + an order that is driven by heuristics. Since code can be rewritten + on-the-fly during allocation, this framework allows interesting + allocators to be developed as extensions. It is not itself a + production register allocator but is a potentially useful + stand-alone mode for triaging bugs and as a performance baseline. + +
    • GreedyThe default allocator. This is a + highly tuned implementation of the Basic allocator that + incorporates global live range splitting. This allocator works hard + to minimize the cost of spill code. + +
    • PBQP — A Partitioned Boolean Quadratic Programming (PBQP) + based register allocator. This allocator works by constructing a PBQP + problem representing the register allocation problem under consideration, + solving this using a PBQP solver, and mapping the solution back to a + register assignment.

    The type of register allocator used in llc can be chosen with the @@ -1616,69 +1906,733 @@ bool RegMapping_Fer::compatible_class(MachineFunction &mf,

    -$ llc -f -regalloc=simple file.bc -o sp.s;
    -$ llc -f -regalloc=local file.bc -o lc.s;
    -$ llc -f -regalloc=linearscan file.bc -o ln.s;
    +$ llc -regalloc=linearscan file.bc -o ln.s;
    +$ llc -regalloc=fast file.bc -o fa.s;
    +$ llc -regalloc=pbqp file.bc -o pbqp.s;
     
    +
    + -
    +

    Prolog/Epilog Code Insertion +

    + +
    + + +

    + Compact Unwind +

    + +
    + +

    Throwing an exception requires unwinding out of a function. The + information on how to unwind a given function is traditionally expressed in + DWARF unwind (a.k.a. frame) info. But that format was originally developed + for debuggers to backtrace, and each Frame Description Entry (FDE) requires + ~20-30 bytes per function. There is also the cost of mapping from an address + in a function to the corresponding FDE at runtime. An alternative unwind + encoding is called compact unwind and requires just 4-bytes per + function.

    + +

    The compact unwind encoding is a 32-bit value, which is encoded in an + architecture-specific way. It specifies which registers to restore and from + where, and how to unwind out of the function. When the linker creates a final + linked image, it will create a __TEXT,__unwind_info + section. This section is a small and fast way for the runtime to access + unwind info for any given function. If we emit compact unwind info for the + function, that compact unwind info will be encoded in + the __TEXT,__unwind_info section. If we emit DWARF unwind info, + the __TEXT,__unwind_info section will contain the offset of the + FDE in the __TEXT,__eh_frame section in the final linked + image.

    + +

    For X86, there are three modes for the compact unwind encoding:

    + +
    +
    Function with a Frame Pointer (EBP or RBP)
    +

    EBP/RBP-based frame, where EBP/RBP is pushed + onto the stack immediately after the return address, + then ESP/RSP is moved to EBP/RBP. Thus to + unwind, ESP/RSP is restored with the + current EBP/RBP value, then EBP/RBP is restored + by popping the stack, and the return is done by popping the stack once + more into the PC. All non-volatile registers that need to be restored must + have been saved in a small range on the stack that + starts EBP-4 to EBP-1020 (RBP-8 + to RBP-1020). The offset (divided by 4 in 32-bit mode and 8 + in 64-bit mode) is encoded in bits 16-23 (mask: 0x00FF0000). + The registers saved are encoded in bits 0-14 + (mask: 0x00007FFF) as five 3-bit entries from the following + table:

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Compact Numberi386 Registerx86-64 Regiser
    1EBXRBX
    2ECXR12
    3EDXR13
    4EDIR14
    5ESIR15
    6EBPRBP
    + +
    + +
    Frameless with a Small Constant Stack Size (EBP + or RBP is not used as a frame pointer)
    +

    To return, a constant (encoded in the compact unwind encoding) is added + to the ESP/RSP. Then the return is done by popping the stack + into the PC. All non-volatile registers that need to be restored must have + been saved on the stack immediately after the return address. The stack + size (divided by 4 in 32-bit mode and 8 in 64-bit mode) is encoded in bits + 16-23 (mask: 0x00FF0000). There is a maximum stack size of + 1024 bytes in 32-bit mode and 2048 in 64-bit mode. The number of registers + saved is encoded in bits 9-12 (mask: 0x00001C00). Bits 0-9 + (mask: 0x000003FF) contain which registers were saved and + their order. (See + the encodeCompactUnwindRegistersWithoutFrame() function + in lib/Target/X86FrameLowering.cpp for the encoding + algorithm.)

    + +
    Frameless with a Large Constant Stack Size (EBP + or RBP is not used as a frame pointer)
    +

    This case is like the "Frameless with a Small Constant Stack Size" + case, but the stack size is too large to encode in the compact unwind + encoding. Instead it requires that the function contains "subl + $nnnnnn, %esp" in its prolog. The compact encoding contains the + offset to the $nnnnnn value in the function in bits 9-12 + (mask: 0x00001C00).

    +
    + +
    +
    -

    To Be Written

    + - -

    To Be Written

    + +

    To Be Written

    + -
    +

    Code Emission +

    + +
    + +

    The code emission step of code generation is responsible for lowering from +the code generator abstractions (like MachineFunction, MachineInstr, etc) down +to the abstractions used by the MC layer (MCInst, +MCStreamer, etc). This is +done with a combination of several different classes: the (misnamed) +target-independent AsmPrinter class, target-specific subclasses of AsmPrinter +(such as SparcAsmPrinter), and the TargetLoweringObjectFile class.

    + +

    Since the MC layer works at the level of abstraction of object files, it +doesn't have a notion of functions, global variables etc. Instead, it thinks +about labels, directives, and instructions. A key class used at this time is +the MCStreamer class. This is an abstract API that is implemented in different +ways (e.g. to output a .s file, output an ELF .o file, etc) that is effectively +an "assembler API". MCStreamer has one method per directive, such as EmitLabel, +EmitSymbolAttribute, SwitchSection, etc, which directly correspond to assembly +level directives. +

    + +

    If you are interested in implementing a code generator for a target, there +are three important things that you have to implement for your target:

    + +
      +
    1. First, you need a subclass of AsmPrinter for your target. This class +implements the general lowering process converting MachineFunction's into MC +label constructs. The AsmPrinter base class provides a number of useful methods +and routines, and also allows you to override the lowering process in some +important ways. You should get much of the lowering for free if you are +implementing an ELF, COFF, or MachO target, because the TargetLoweringObjectFile +class implements much of the common logic.
    2. + +
    3. Second, you need to implement an instruction printer for your target. The +instruction printer takes an MCInst and renders it to a +raw_ostream as text. Most of this is automatically generated from the .td file +(when you specify something like "add $dst, $src1, $src2" in the +instructions), but you need to implement routines to print operands.
    4. + +
    5. Third, you need to implement code that lowers a MachineInstr to an MCInst, usually implemented in +"<target>MCInstLower.cpp". This lowering process is often target +specific, and is responsible for turning jump table entries, constant pool +indices, global variable addresses, etc into MCLabels as appropriate. This +translation layer is also responsible for expanding pseudo ops used by the code +generator into the actual machine instructions they correspond to. The MCInsts +that are generated by this are fed into the instruction printer or the encoder. +
    6. + +
    + +

    Finally, at your choosing, you can also implement an subclass of +MCCodeEmitter which lowers MCInst's into machine code bytes and relocations. +This is important if you want to support direct .o file emission, or would like +to implement an assembler for your target.

    + +
    + + +

    + VLIW Packetizer +

    + +
    + +

    In a Very Long Instruction Word (VLIW) architecture, the compiler is + responsible for mapping instructions to functional-units available on + the architecture. To that end, the compiler creates groups of instructions + called packets or bundles. The VLIW packetizer in LLVM is + a target-independent mechanism to enable the packetization of machine + instructions.

    + + + +

    + Mapping from instructions to functional units +

    + +
    + +

    Instructions in a VLIW target can typically be mapped to multiple functional +units. During the process of packetizing, the compiler must be able to reason +about whether an instruction can be added to a packet. This decision can be +complex since the compiler has to examine all possible mappings of instructions +to functional units. Therefore to alleviate compilation-time complexity, the +VLIW packetizer parses the instruction classes of a target and generates tables +at compiler build time. These tables can then be queried by the provided +machine-independent API to determine if an instruction can be accommodated in a +packet.

    +
    + + +

    + + How the packetization tables are generated and used + +

    + +
    + +

    The packetizer reads instruction classes from a target's itineraries and +creates a deterministic finite automaton (DFA) to represent the state of a +packet. A DFA consists of three major elements: inputs, states, and +transitions. The set of inputs for the generated DFA represents the instruction +being added to a packet. The states represent the possible consumption +of functional units by instructions in a packet. In the DFA, transitions from +one state to another occur on the addition of an instruction to an existing +packet. If there is a legal mapping of functional units to instructions, then +the DFA contains a corresponding transition. The absence of a transition +indicates that a legal mapping does not exist and that the instruction cannot +be added to the packet.

    + +

    To generate tables for a VLIW target, add TargetGenDFAPacketizer.inc +as a target to the Makefile in the target directory. The exported API provides +three functions: DFAPacketizer::clearResources(), +DFAPacketizer::reserveResources(MachineInstr *MI), and +DFAPacketizer::canReserveResources(MachineInstr *MI). These functions +allow a target packetizer to add an instruction to an existing packet and to +check whether an instruction can be added to a packet. See +llvm/CodeGen/DFAPacketizer.h for more information.

    + +
    +
    -

    To Be Written

    + +
    + + +

    + Implementing a Native Assembler +

    + + +
    + +

    Though you're probably reading this because you want to write or maintain a +compiler backend, LLVM also fully supports building a native assemblers too. +We've tried hard to automate the generation of the assembler from the .td files +(in particular the instruction syntax and encodings), which means that a large +part of the manual and repetitive data entry can be factored and shared with the +compiler.

    + + +

    Instruction Parsing

    + +

    To Be Written

    + + + +

    + Instruction Alias Processing +

    + +
    +

    Once the instruction is parsed, it enters the MatchInstructionImpl function. +The MatchInstructionImpl function performs alias processing and then does +actual matching.

    + +

    Alias processing is the phase that canonicalizes different lexical forms of +the same instructions down to one representation. There are several different +kinds of alias that are possible to implement and they are listed below in the +order that they are processed (which is in order from simplest/weakest to most +complex/powerful). Generally you want to use the first alias mechanism that +meets the needs of your instruction, because it will allow a more concise +description.

    + -
    - Generating Assembly Code +

    Mnemonic Aliases

    + +
    + +

    The first phase of alias processing is simple instruction mnemonic +remapping for classes of instructions which are allowed with two different +mnemonics. This phase is a simple and unconditionally remapping from one input +mnemonic to one output mnemonic. It isn't possible for this form of alias to +look at the operands at all, so the remapping must apply for all forms of a +given mnemonic. Mnemonic aliases are defined simply, for example X86 has: +

    + +
    +
    +def : MnemonicAlias<"cbw",     "cbtw">;
    +def : MnemonicAlias<"smovq",   "movsq">;
    +def : MnemonicAlias<"fldcww",  "fldcw">;
    +def : MnemonicAlias<"fucompi", "fucomip">;
    +def : MnemonicAlias<"ud2a",    "ud2">;
    +
    -

    To Be Written

    + +

    ... and many others. With a MnemonicAlias definition, the mnemonic is +remapped simply and directly. Though MnemonicAlias's can't look at any aspect +of the instruction (such as the operands) they can depend on global modes (the +same ones supported by the matcher), through a Requires clause:

    + +
    +
    +def : MnemonicAlias<"pushf", "pushfq">, Requires<[In64BitMode]>;
    +def : MnemonicAlias<"pushf", "pushfl">, Requires<[In32BitMode]>;
    +
    +
    + +

    In this example, the mnemonic gets mapped into different a new one depending +on the current instruction set.

    + +
    + -
    - Generating Binary Machine Code +

    Instruction Aliases

    + +
    + +

    The most general phase of alias processing occurs while matching is +happening: it provides new forms for the matcher to match along with a specific +instruction to generate. An instruction alias has two parts: the string to +match and the instruction to generate. For example: +

    + +
    +
    +def : InstAlias<"movsx $src, $dst", (MOVSX16rr8W GR16:$dst, GR8  :$src)>;
    +def : InstAlias<"movsx $src, $dst", (MOVSX16rm8W GR16:$dst, i8mem:$src)>;
    +def : InstAlias<"movsx $src, $dst", (MOVSX32rr8  GR32:$dst, GR8  :$src)>;
    +def : InstAlias<"movsx $src, $dst", (MOVSX32rr16 GR32:$dst, GR16 :$src)>;
    +def : InstAlias<"movsx $src, $dst", (MOVSX64rr8  GR64:$dst, GR8  :$src)>;
    +def : InstAlias<"movsx $src, $dst", (MOVSX64rr16 GR64:$dst, GR16 :$src)>;
    +def : InstAlias<"movsx $src, $dst", (MOVSX64rr32 GR64:$dst, GR32 :$src)>;
    +
    +
    + +

    This shows a powerful example of the instruction aliases, matching the +same mnemonic in multiple different ways depending on what operands are present +in the assembly. The result of instruction aliases can include operands in a +different order than the destination instruction, and can use an input +multiple times, for example:

    + +
    +
    +def : InstAlias<"clrb $reg", (XOR8rr  GR8 :$reg, GR8 :$reg)>;
    +def : InstAlias<"clrw $reg", (XOR16rr GR16:$reg, GR16:$reg)>;
    +def : InstAlias<"clrl $reg", (XOR32rr GR32:$reg, GR32:$reg)>;
    +def : InstAlias<"clrq $reg", (XOR64rr GR64:$reg, GR64:$reg)>;
    +
    +
    + +

    This example also shows that tied operands are only listed once. In the X86 +backend, XOR8rr has two input GR8's and one output GR8 (where an input is tied +to the output). InstAliases take a flattened operand list without duplicates +for tied operands. The result of an instruction alias can also use immediates +and fixed physical registers which are added as simple immediate operands in the +result, for example:

    + +
    +
    +// Fixed Immediate operand.
    +def : InstAlias<"aad", (AAD8i8 10)>;
    +
    +// Fixed register operand.
    +def : InstAlias<"fcomi", (COM_FIr ST1)>;
    +
    +// Simple alias.
    +def : InstAlias<"fcomi $reg", (COM_FIr RST:$reg)>;
    +
    +
    + + +

    Instruction aliases can also have a Requires clause to make them +subtarget specific.

    + +

    If the back-end supports it, the instruction printer can automatically emit + the alias rather than what's being aliased. It typically leads to better, + more readable code. If it's better to print out what's being aliased, then + pass a '0' as the third parameter to the InstAlias definition.

    +
    -
    -

    For the JIT or .o file writer

    + +

    Instruction Matching

    + +

    To Be Written

    + +
    - + -
    +

    This section of the document explains features or design decisions that are - specific to the code generator for a particular target.

    + specific to the code generator for a particular target. First we start + with a table that summarizes what features are supported by each target.

    + + +

    + Target Feature Matrix +

    + +
    + +

    Note that this table does not include the C backend or Cpp backends, since +they do not use the target independent code generator infrastructure. It also +doesn't list features that are not supported fully by any target yet. It +considers a feature to be supported if at least one subtarget supports it. A +feature being supported means that it is useful and works for most cases, it +does not indicate that there are zero known bugs in the implementation. Here +is the key:

    + + + + + + + + + + + + + + + +
    UnknownNo supportPartial SupportComplete Support
    + +

    Here is the table:

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Target
    FeatureARMCellSPUHexagonMBlazeMSP430MipsPTXPowerPCSparcX86XCore
    is generally reliable
    assembly parser
    disassembler
    inline asm
    jit*
    .o file writing
    tail calls
    segmented stacks *
    + + +

    Is Generally Reliable

    + +
    +

    This box indicates whether the target is considered to be production quality. +This indicates that the target has been used as a static compiler to +compile large amounts of code by a variety of different people and is in +continuous use.

    +
    + + +

    Assembly Parser

    + +
    +

    This box indicates whether the target supports parsing target specific .s +files by implementing the MCAsmParser interface. This is required for llvm-mc +to be able to act as a native assembler and is required for inline assembly +support in the native .o file writer.

    + +
    + + + +

    Disassembler

    + +
    +

    This box indicates whether the target supports the MCDisassembler API for +disassembling machine opcode bytes into MCInst's.

    + +
    + + +

    Inline Asm

    + +
    +

    This box indicates whether the target supports most popular inline assembly +constraints and modifiers.

    + +
    + + +

    JIT Support

    + +
    +

    This box indicates whether the target supports the JIT compiler through +the ExecutionEngine interface.

    + +

    The ARM backend has basic support for integer code +in ARM codegen mode, but lacks NEON and full Thumb support.

    + +
    + + +

    .o File Writing

    + +
    + +

    This box indicates whether the target supports writing .o files (e.g. MachO, +ELF, and/or COFF) files directly from the target. Note that the target also +must include an assembly parser and general inline assembly support for full +inline assembly support in the .o writer.

    + +

    Targets that don't support this feature can obviously still write out .o +files, they just rely on having an external assembler to translate from a .s +file to a .o file (as is the case for many C compilers).

    + +
    + + +

    Tail Calls

    + +
    + +

    This box indicates whether the target supports guaranteed tail calls. These +are calls marked "tail" and use the fastcc +calling convention. Please see the tail call section +more more details.

    + +
    + + +

    Segmented Stacks

    + +
    + +

    This box indicates whether the target supports segmented stacks. This +replaces the traditional large C stack with many linked segments. It +is compatible with the gcc +implementation used by the Go front end.

    + +

    Basic support exists on the X86 backend. Currently +vararg doesn't work and the object files are not marked the way the gold +linker expects, but simple Go programs can be built by dragonegg.

    + +
    - + -
    +

    Tail call optimization, callee reusing the stack of the caller, is currently supported on x86/x86-64 and PowerPC. It is performed if:

      -
    • Caller and callee have the calling convention fastcc.
    • +
    • Caller and callee have the calling convention fastcc or + cc 10 (GHC call convention).
    • The call is a tail call - in tail position (ret immediately follows call and ret uses value of call or is void).
    • @@ -1731,31 +2685,68 @@ define fastcc i32 @tailcaller(i32 %in1, i32 %in2) { (because one or more of above constraints are not met) to be followed by a readjustment of the stack. So performance might be worse in such cases.

      -

      On x86 and x86-64 one register is reserved for indirect tail calls (e.g via a - function pointer). So there is one less register for integer argument - passing. For x86 this means 2 registers (if inreg parameter - attribute is used) and for x86-64 this means 5 register are used.

      +
    + +

    + Sibling call optimization +

    + +
    + +

    Sibling call optimization is a restricted form of tail call optimization. + Unlike tail call optimization described in the previous section, it can be + performed automatically on any tail calls when -tailcallopt option + is not specified.

    + +

    Sibling call optimization is currently performed on x86/x86-64 when the + following constraints are met:

    + +
      +
    • Caller and callee have the same calling convention. It can be either + c or fastcc. + +
    • The call is a tail call - in tail position (ret immediately follows call + and ret uses value of call or is void).
    • + +
    • Caller and callee have matching return type or the callee result is not + used. + +
    • If any of the callee arguments are being passed in stack, they must be + available in caller's own incoming argument stack and the frame offsets + must be the same. +
    + +

    Example:

    +
    +
    +declare i32 @bar(i32, i32)
    +
    +define i32 @foo(i32 %a, i32 %b, i32 %c) {
    +entry:
    +  %0 = tail call i32 @bar(i32 %a, i32 %b)
    +  ret i32 %0
    +}
    +
    +
    - + -
    +

    The X86 code generator lives in the lib/Target/X86 directory. This code generator is capable of targeting a variety of x86-32 and x86-64 processors, and includes support for ISA extensions such as MMX and SSE.

    -
    - - + -
    +

    The following are the known target triples that are supported by the X86 backend. This is not an exhaustive list, and it would be useful to add those @@ -1773,36 +2764,41 @@ define fastcc i32 @tailcaller(i32 %in1, i32 %in2) {

  • i386-pc-mingw32msvc — MingW crosscompiler on Linux
  • i686-apple-darwin* — Apple Darwin on X86
  • + +
  • x86_64-unknown-linux-gnu — Linux
  • - + -
    +

    The following target-specific calling conventions are known to backend:

      -
    • x86_StdCall — stdcall calling convention seen on Microsoft - Windows platform (CC ID = 64).
    • - -
    • x86_FastCall — fastcall calling convention seen on Microsoft - Windows platform (CC ID = 65).
    • +
    • x86_StdCall — stdcall calling convention seen on Microsoft + Windows platform (CC ID = 64).
    • +
    • x86_FastCall — fastcall calling convention seen on Microsoft + Windows platform (CC ID = 65).
    • +
    • x86_ThisCall — Similar to X86_StdCall. Passes first argument + in ECX, others via stack. Callee is responsible for stack cleaning. This + convention is used by MSVC by default for methods in its ABI + (CC ID = 70).
    - + -
    +

    The x86 has a very flexible way of accessing memory. It is capable of forming memory addresses of the following expression directly in integer @@ -1810,35 +2806,38 @@ define fastcc i32 @tailcaller(i32 %in1, i32 %in2) {

    -Base + [1,2,4,8] * IndexReg + Disp32
    +SegmentReg: Base + [1,2,4,8] * IndexReg + Disp32
     
    -

    In order to represent this, LLVM tracks no less than 4 operands for each +

    In order to represent this, LLVM tracks no less than 5 operands for each memory operand of this form. This means that the "load" form of 'mov' has the following MachineOperands in this order:

    -Index:        0     |    1        2       3           4
    -Meaning:   DestReg, | BaseReg,  Scale, IndexReg, Displacement
    -OperandTy: VirtReg, | VirtReg, UnsImm, VirtReg,   SignExtImm
    +Index:        0     |    1        2       3           4          5
    +Meaning:   DestReg, | BaseReg,  Scale, IndexReg, Displacement Segment
    +OperandTy: VirtReg, | VirtReg, UnsImm, VirtReg,   SignExtImm  PhysReg
     

    Stores, and all other instructions, treat the four memory operands in the - same way and in the same order.

    + same way and in the same order. If the segment register is unspecified + (regno = 0), then no segment override is generated. "Lea" operations do not + have a segment register specified, so they only have 4 operands for their + memory reference.

    - + -
    +
    -

    x86 has an experimental feature which provides +

    x86 has a feature which provides the ability to perform loads and stores to different address spaces via the x86 segment registers. A segment override prefix byte on an instruction causes the instruction's memory access to go to the specified @@ -1877,11 +2876,11 @@ OperandTy: VirtReg, | VirtReg, UnsImm, VirtReg, SignExtImm

    - + -
    +

    An instruction name consists of the base name, a default operand size, and a a character per operand with an optional special size. For example:

    @@ -1897,25 +2896,25 @@ MOVSX32rm16 -> movsx, 32-bit register, 16-bit memory
    +
    + - + -
    +

    The PowerPC code generator lives in the lib/Target/PowerPC directory. The code generation is retargetable to several variations or subtargets of the PowerPC ISA; including ppc32, ppc64 and altivec.

    -
    - - + -
    +

    LLVM follows the AIX PowerPC ABI, with two deviations. LLVM uses a PC relative (PIC) or static addressing for accessing global values, so no TOC @@ -1931,11 +2930,11 @@ MOVSX32rm16 -> movsx, 32-bit register, 16-bit memory

    - + -
    +

    The size of a PowerPC frame is usually fixed for the duration of a function's invocation. Since the frame is fixed size, all references @@ -2078,11 +3077,11 @@ MOVSX32rm16 -> movsx, 32-bit register, 16-bit memory

    - + -
    +

    The llvm prolog and epilog are the same as described in the PowerPC ABI, with the following exceptions. Callee saved registers are spilled after the frame @@ -2095,16 +3094,83 @@ MOVSX32rm16 -> movsx, 32-bit register, 16-bit memory

    - + -
    +

    TODO - More to come.

    +
    + + +

    + The PTX backend +

    + +
    + +

    The PTX code generator lives in the lib/Target/PTX directory. It is + currently a work-in-progress, but already supports most of the code + generation functionality needed to generate correct PTX kernels for + CUDA devices.

    + +

    The code generator can target PTX 2.0+, and shader model 1.0+. The + PTX ISA Reference Manual is used as the primary source of ISA + information, though an effort is made to make the output of the code + generator match the output of the NVidia nvcc compiler, whenever + possible.

    + +

    Code Generator Options:

    + + + + + + + + + + + + + + + + + +
    OptionDescription
    doubleIf enabled, the map_f64_to_f32 directive is + disabled in the PTX output, allowing native double-precision + arithmetic
    no-fmaDisable generation of Fused-Multiply Add + instructions, which may be beneficial for some devices
    smxy / computexySet shader model/compute capability to x.y, + e.g. sm20 or compute13
    + +

    Working:

    +
      +
    • Arithmetic instruction selection (including combo FMA)
    • +
    • Bitwise instruction selection
    • +
    • Control-flow instruction selection
    • +
    • Function calls (only on SM 2.0+ and no return arguments)
    • +
    • Addresses spaces (0 = global, 1 = constant, 2 = local, 4 = + shared)
    • +
    • Thread synchronization (bar.sync)
    • +
    • Special register reads ([N]TID, [N]CTAID, PMx, CLOCK, etc.)
    • +
    + +

    In Progress:

    +
      +
    • Robust call instruction selection
    • +
    • Stack frame allocation
    • +
    • Device-specific instruction scheduling optimizations
    • +
    + + +
    + +

    @@ -2115,7 +3181,7 @@ MOVSX32rm16 -> movsx, 32-bit register, 16-bit memory src="http://www.w3.org/Icons/valid-html401-blue" alt="Valid HTML 4.01"> Chris Lattner
    - The LLVM Compiler Infrastructure
    + The LLVM Compiler Infrastructure
    Last modified: $Date$