X-Git-Url: http://plrg.eecs.uci.edu/git/?a=blobdiff_plain;ds=sidebyside;f=lib%2FTarget%2FARM%2FREADME.txt;h=a75007cd4f896741b50fb3f338e688df43407453;hb=93305bc4620c12042b11a5e721feda7892d2f65d;hp=04bf9588ad42fdabb206270ab7ffad38ecaa7967;hpb=2d1222c060f56909c57c7828b6cad6a3e25017e2;p=oota-llvm.git diff --git a/lib/Target/ARM/README.txt b/lib/Target/ARM/README.txt index 04bf9588ad4..a75007cd4f8 100644 --- a/lib/Target/ARM/README.txt +++ b/lib/Target/ARM/README.txt @@ -17,20 +17,34 @@ Reimplement 'select' in terms of 'SEL'. //===---------------------------------------------------------------------===// -The constant island pass is extremely naive. If a constant pool entry is -out of range, it *always* splits a block and inserts a copy of the cp -entry inline. It should: +Crazy idea: Consider code that uses lots of 8-bit or 16-bit values. By the +time regalloc happens, these values are now in a 32-bit register, usually with +the top-bits known to be sign or zero extended. If spilled, we should be able +to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part +of the reload. -1. Check to see if there is already a copy of this constant nearby. If so, - reuse it. -2. Instead of always splitting blocks to insert the constant, insert it in - nearby 'water'. -3. Constant island references should be ref counted. If a constant reference - is out-of-range, and the last reference to a constant is relocated, the - dead constant should be removed. +Doing this reduces the size of the stack frame (important for thumb etc), and +also increases the likelihood that we will be able to reload multiple values +from the stack with a single load. -This pass has all the framework needed to implement this, but it hasn't -been done. +//===---------------------------------------------------------------------===// + +The constant island pass is in good shape. Some cleanups might be desirable, +but there is unlikely to be much improvement in the generated code. + +1. There may be some advantage to trying to be smarter about the initial +placement, rather than putting everything at the end. + +2. The handling of 2-byte padding for Thumb is overly conservative. There +would be a small gain to keeping accurate track of the padding (which would +require aligning functions containing constant pools to 4-byte boundaries). + +3. There might be some compile-time efficiency to be had by representing +consecutive islands as a single block rather than multiple blocks. + +4. Use a priority queue to sort constant pool users in inverse order of + position so we always process the one closed to the end of functions + first. This may simply CreateNewWater. //===---------------------------------------------------------------------===// @@ -142,6 +156,29 @@ odd/even pair. However, we probably would pay a penalty if the address is not aligned on 8-byte boundary. This requires more information on load / store nodes (and MI's?) then we currently carry. +6) struct copies appear to be done field by field +instead of by words, at least sometimes: + +struct foo { int x; short s; char c1; char c2; }; +void cpy(struct foo*a, struct foo*b) { *a = *b; } + +llvm code (-O2) + ldrb r3, [r1, #+6] + ldr r2, [r1] + ldrb r12, [r1, #+7] + ldrh r1, [r1, #+4] + str r2, [r0] + strh r1, [r0, #+4] + strb r3, [r0, #+6] + strb r12, [r0, #+7] +gcc code (-O2) + ldmia r1, {r1-r2} + stmia r0, {r1-r2} + +In this benchmark poor handling of aggregate copies has shown up as +having a large effect on size, and possibly speed as well (we don't have +a good way to measure on ARM). + //===---------------------------------------------------------------------===// * Consider this silly example: @@ -284,53 +321,8 @@ See McCat/18-imp/ComputeBoundingBoxes for an example. //===---------------------------------------------------------------------===// -We need register scavenging. Currently, the 'ip' register is reserved in case -frame indexes are too big. This means that we generate extra code for stuff -like this: - -void foo(unsigned x, unsigned y, unsigned z, unsigned *a, unsigned *b, unsigned *c) { - short Rconst = (short) (16384.0f * 1.40200 + 0.5 ); - *a = x * Rconst; - *b = y * Rconst; - *c = z * Rconst; -} - -we compile it to: - -_foo: -*** stmfd sp!, {r4, r7} -*** add r7, sp, #4 - mov r4, #186 - orr r4, r4, #89, 24 @ 22784 - mul r0, r0, r4 - str r0, [r3] - mul r0, r1, r4 - ldr r1, [sp, #+8] - str r0, [r1] - mul r0, r2, r4 - ldr r1, [sp, #+12] - str r0, [r1] -*** sub sp, r7, #4 -*** ldmfd sp!, {r4, r7} - bx lr - -GCC produces: - -_foo: - ldr ip, L4 - mul r0, ip, r0 - mul r1, ip, r1 - str r0, [r3, #0] - ldr r3, [sp, #0] - mul r2, ip, r2 - str r1, [r3, #0] - ldr r3, [sp, #4] - str r2, [r3, #0] - bx lr -L4: - .long 22970 - -This is apparently all because we couldn't use ip here. +Register scavenging is now implemented. The example in the previous version +of this document produces optimal code at -O2. //===---------------------------------------------------------------------===// @@ -451,3 +443,91 @@ http://www.inf.u-szeged.hu/gcc-arm/ http://citeseer.ist.psu.edu/debus04linktime.html //===---------------------------------------------------------------------===// + +gcc generates smaller code for this function at -O2 or -Os: + +void foo(signed char* p) { + if (*p == 3) + bar(); + else if (*p == 4) + baz(); + else if (*p == 5) + quux(); +} + +llvm decides it's a good idea to turn the repeated if...else into a +binary tree, as if it were a switch; the resulting code requires -1 +compare-and-branches when *p<=2 or *p==5, the same number if *p==4 +or *p>6, and +1 if *p==3. So it should be a speed win +(on balance). However, the revised code is larger, with 4 conditional +branches instead of 3. + +More seriously, there is a byte->word extend before +each comparison, where there should be only one, and the condition codes +are not remembered when the same two values are compared twice. + +//===---------------------------------------------------------------------===// + +More register scavenging work: + +1. Use the register scavenger to track frame index materialized into registers + (those that do not fit in addressing modes) to allow reuse in the same BB. +2. Finish scavenging for Thumb. +3. We know some spills and restores are unnecessary. The issue is once live + intervals are merged, they are not never split. So every def is spilled + and every use requires a restore if the register allocator decides the + resulting live interval is not assigned a physical register. It may be + possible (with the help of the scavenger) to turn some spill / restore + pairs into register copies. + +//===---------------------------------------------------------------------===// + +More LSR enhancements possible: + +1. Teach LSR about pre- and post- indexed ops to allow iv increment be merged + in a load / store. +2. Allow iv reuse even when a type conversion is required. For example, i8 + and i32 load / store addressing modes are identical. + + +//===---------------------------------------------------------------------===// + +This: + +int foo(int a, int b, int c, int d) { + long long acc = (long long)a * (long long)b; + acc += (long long)c * (long long)d; + return (int)(acc >> 32); +} + +Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies +two signed 32-bit values to produce a 64-bit value, and accumulates this with +a 64-bit value. + +We currently get this with v6: + +_foo: + mul r12, r1, r0 + smmul r1, r1, r0 + smmul r0, r3, r2 + mul r3, r3, r2 + adds r3, r3, r12 + adc r0, r0, r1 + bx lr + +and this with v4: + +_foo: + stmfd sp!, {r7, lr} + mov r7, sp + mul r12, r1, r0 + smull r0, r1, r1, r0 + smull lr, r0, r3, r2 + mul r3, r3, r2 + adds r3, r3, r12 + adc r0, r0, r1 + ldmfd sp!, {r7, pc} + +This apparently occurs in real code. + +//===---------------------------------------------------------------------===//