* We would really like to support UXTAB16, but we need to prove that the
add doesn't need to overflow between the two 16-bit chunks.
-* implement predication support
* Implement pre/post increment support. (e.g. PR935)
-* Coalesce stack slots!
* Implement smarter constant generation for binops with large immediates.
-* Consider materializing FP constants like 0.0f and 1.0f using integer
- immediate instructions then copy to FPU. Slower than load into FPU?
+A few ARMv6T2 ops should be pattern matched: BFI, SBFX, and UBFX
+
+Interesting optimization for PIC codegen on arm-linux:
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43129
+
+//===---------------------------------------------------------------------===//
+
+Crazy idea: Consider code that uses lots of 8-bit or 16-bit values. By the
+time regalloc happens, these values are now in a 32-bit register, usually with
+the top-bits known to be sign or zero extended. If spilled, we should be able
+to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part
+of the reload.
+
+Doing this reduces the size of the stack frame (important for thumb etc), and
+also increases the likelihood that we will be able to reload multiple values
+from the stack with a single load.
//===---------------------------------------------------------------------===//
-The constant island pass is extremely naive. If a constant pool entry is
-out of range, it *always* splits a block and inserts a copy of the cp
-entry inline. It should:
+The constant island pass is in good shape. Some cleanups might be desirable,
+but there is unlikely to be much improvement in the generated code.
-1. Check to see if there is already a copy of this constant nearby. If so,
- reuse it.
-2. Instead of always splitting blocks to insert the constant, insert it in
- nearby 'water'.
-3. Constant island references should be ref counted. If a constant reference
- is out-of-range, and the last reference to a constant is relocated, the
- dead constant should be removed.
+1. There may be some advantage to trying to be smarter about the initial
+placement, rather than putting everything at the end.
-This pass has all the framework needed to implement this, but it hasn't
-been done.
+2. There might be some compile-time efficiency to be had by representing
+consecutive islands as a single block rather than multiple blocks.
+
+3. Use a priority queue to sort constant pool users in inverse order of
+ position so we always process the one closed to the end of functions
+ first. This may simply CreateNewWater.
//===---------------------------------------------------------------------===//
-We need to start generating predicated instructions. The .td files have a way
-to express this now (see the PPC conditional return instruction), but the
-branch folding pass (or a new if-cvt pass) should start producing these, at
-least in the trivial case.
+Eliminate copysign custom expansion. We are still generating crappy code with
+default expansion + if-conversion.
-Among the obvious wins, doing so can eliminate the need to custom expand
-copysign (i.e. we won't need to custom expand it to get the conditional
-negate).
+//===---------------------------------------------------------------------===//
-This allows us to eliminate one instruction from:
+Eliminate one instruction from:
define i32 @_Z6slow4bii(i32 %x, i32 %y) {
%tmp = icmp sgt i32 %x, %y
movgt r1, r0
mov r0, r1
bx lr
+=>
+
+__Z6slow4bii:
+ cmp r0, r1
+ movle r0, r1
+ bx lr
//===---------------------------------------------------------------------===//
//===---------------------------------------------------------------------===//
-We currently compile abs:
-int foo(int p) { return p < 0 ? -p : p; }
-
-into:
-
-_foo:
- rsb r1, r0, #0
- cmn r0, #1
- movgt r1, r0
- mov r0, r1
- bx lr
-
-This is very, uh, literal. This could be a 3 operation sequence:
- t = (p sra 31);
- res = (p xor t)-t
-
-Which would be better. This occurs in png decode.
-
-//===---------------------------------------------------------------------===//
-
More load / store optimizations:
-1) Look past instructions without side-effects (not load, store, branch, etc.)
- when forming the list of loads / stores to optimize.
-
-2) Smarter register allocation?
-We are probably missing some opportunities to use ldm / stm. Consider:
-
-ldr r5, [r0]
-ldr r4, [r0, #4]
-
-This cannot be merged into a ldm. Perhaps we will need to do the transformation
-before register allocation. Then teach the register allocator to allocate a
-chunk of consecutive registers.
-
-3) Better representation for block transfer? This is from Olden/power:
+1) Better representation for block transfer? This is from Olden/power:
fldd d0, [r4]
fstd d0, [r4, #+32]
If we can spare the registers, it would be better to use fldm and fstm here.
Need major register allocator enhancement though.
-4) Can we recognize the relative position of constantpool entries? i.e. Treat
+2) Can we recognize the relative position of constantpool entries? i.e. Treat
ldr r0, LCPI17_3
ldr r1, LCPI17_4
.long -858993459
.long 1074318540
-5) Can we make use of ldrd and strd? Instead of generating ldm / stm, use
-ldrd/strd instead if there are only two destination registers that form an
-odd/even pair. However, we probably would pay a penalty if the address is not
-aligned on 8-byte boundary. This requires more information on load / store
-nodes (and MI's?) then we currently carry.
+3) struct copies appear to be done field by field
+instead of by words, at least sometimes:
+
+struct foo { int x; short s; char c1; char c2; };
+void cpy(struct foo*a, struct foo*b) { *a = *b; }
+
+llvm code (-O2)
+ ldrb r3, [r1, #+6]
+ ldr r2, [r1]
+ ldrb r12, [r1, #+7]
+ ldrh r1, [r1, #+4]
+ str r2, [r0]
+ strh r1, [r0, #+4]
+ strb r3, [r0, #+6]
+ strb r12, [r0, #+7]
+gcc code (-O2)
+ ldmia r1, {r1-r2}
+ stmia r0, {r1-r2}
+
+In this benchmark poor handling of aggregate copies has shown up as
+having a large effect on size, and possibly speed as well (we don't have
+a good way to measure on ARM).
//===---------------------------------------------------------------------===//
}
_bar:
- sub sp, sp, #16
- str r4, [sp, #+12]
- str r5, [sp, #+8]
- str lr, [sp, #+4]
- mov r4, r0
- mov r5, r1
- ldr r0, LCPI2_0
- bl _foo
- fmsr f0, r0
- fcvtsd d0, f0
- fmdrr d1, r4, r5
- faddd d0, d0, d1
- fmrrd r0, r1, d0
- ldr lr, [sp, #+4]
- ldr r5, [sp, #+8]
- ldr r4, [sp, #+12]
- add sp, sp, #16
- bx lr
+ stmfd sp!, {r4, r5, r7, lr}
+ add r7, sp, #8
+ mov r4, r0
+ mov r5, r1
+ fldd d0, LCPI1_0
+ fmrrd r0, r1, d0
+ bl _foo
+ fmdrr d0, r4, r5
+ fmsr s2, r0
+ fsitod d1, s2
+ faddd d0, d1, d0
+ fmrrd r0, r1, d0
+ ldmfd sp!, {r4, r5, r7, pc}
Ignore the prologue and epilogue stuff for a second. Note
mov r4, r0
//===---------------------------------------------------------------------===//
-We need register scavenging. Currently, the 'ip' register is reserved in case
-frame indexes are too big. This means that we generate extra code for stuff
-like this:
-
-void foo(unsigned x, unsigned y, unsigned z, unsigned *a, unsigned *b, unsigned *c) {
- short Rconst = (short) (16384.0f * 1.40200 + 0.5 );
- *a = x * Rconst;
- *b = y * Rconst;
- *c = z * Rconst;
-}
-
-we compile it to:
-
-_foo:
-*** stmfd sp!, {r4, r7}
-*** add r7, sp, #4
- mov r4, #186
- orr r4, r4, #89, 24 @ 22784
- mul r0, r0, r4
- str r0, [r3]
- mul r0, r1, r4
- ldr r1, [sp, #+8]
- str r0, [r1]
- mul r0, r2, r4
- ldr r1, [sp, #+12]
- str r0, [r1]
-*** sub sp, r7, #4
-*** ldmfd sp!, {r4, r7}
- bx lr
-
-GCC produces:
-
-_foo:
- ldr ip, L4
- mul r0, ip, r0
- mul r1, ip, r1
- str r0, [r3, #0]
- ldr r3, [sp, #0]
- mul r2, ip, r2
- str r1, [r3, #0]
- ldr r3, [sp, #4]
- str r2, [r3, #0]
- bx lr
-L4:
- .long 22970
-
-This is apparently all because we couldn't use ip here.
-
-//===---------------------------------------------------------------------===//
-
Pre-/post- indexed load / stores:
1) We should not make the pre/post- indexed load/store transform if the base ptr
4) Once we added support for multiple result patterns, write indexed loads
patterns instead of C++ instruction selection code.
-5) Use FLDM / FSTM to emulate indexed FP load / store.
-
-//===---------------------------------------------------------------------===//
-
-We should add i64 support to take advantage of the 64-bit load / stores.
-We can add a pseudo i64 register class containing pseudo registers that are
-register pairs. All other ops (e.g. add, sub) would be expanded as usual.
-
-We need to add pseudo instructions (i.e. gethi / getlo) to extract i32 registers
-from the i64 register. These are single moves which can be eliminated if the
-destination register is a sub-register of the source. We should implement proper
-subreg support in the register allocator to coalesce these away.
-
-There are other minor issues such as multiple instructions for a spill / restore
-/ move.
+5) Use VLDM / VSTM to emulate indexed FP load / store.
//===---------------------------------------------------------------------===//
http://citeseer.ist.psu.edu/debus04linktime.html
//===---------------------------------------------------------------------===//
+
+gcc generates smaller code for this function at -O2 or -Os:
+
+void foo(signed char* p) {
+ if (*p == 3)
+ bar();
+ else if (*p == 4)
+ baz();
+ else if (*p == 5)
+ quux();
+}
+
+llvm decides it's a good idea to turn the repeated if...else into a
+binary tree, as if it were a switch; the resulting code requires -1
+compare-and-branches when *p<=2 or *p==5, the same number if *p==4
+or *p>6, and +1 if *p==3. So it should be a speed win
+(on balance). However, the revised code is larger, with 4 conditional
+branches instead of 3.
+
+More seriously, there is a byte->word extend before
+each comparison, where there should be only one, and the condition codes
+are not remembered when the same two values are compared twice.
+
+//===---------------------------------------------------------------------===//
+
+More LSR enhancements possible:
+
+1. Teach LSR about pre- and post- indexed ops to allow iv increment be merged
+ in a load / store.
+2. Allow iv reuse even when a type conversion is required. For example, i8
+ and i32 load / store addressing modes are identical.
+
+
+//===---------------------------------------------------------------------===//
+
+This:
+
+int foo(int a, int b, int c, int d) {
+ long long acc = (long long)a * (long long)b;
+ acc += (long long)c * (long long)d;
+ return (int)(acc >> 32);
+}
+
+Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies
+two signed 32-bit values to produce a 64-bit value, and accumulates this with
+a 64-bit value.
+
+We currently get this with both v4 and v6:
+
+_foo:
+ smull r1, r0, r1, r0
+ smull r3, r2, r3, r2
+ adds r3, r3, r1
+ adc r0, r2, r0
+ bx lr
+
+//===---------------------------------------------------------------------===//
+
+This:
+ #include <algorithm>
+ std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
+ { return std::make_pair(a + b, a + b < a); }
+ bool no_overflow(unsigned a, unsigned b)
+ { return !full_add(a, b).second; }
+
+Should compile to:
+
+_Z8full_addjj:
+ adds r2, r1, r2
+ movcc r1, #0
+ movcs r1, #1
+ str r2, [r0, #0]
+ strb r1, [r0, #4]
+ mov pc, lr
+
+_Z11no_overflowjj:
+ cmn r0, r1
+ movcs r0, #0
+ movcc r0, #1
+ mov pc, lr
+
+not:
+
+__Z8full_addjj:
+ add r3, r2, r1
+ str r3, [r0]
+ mov r2, #1
+ mov r12, #0
+ cmp r3, r1
+ movlo r12, r2
+ str r12, [r0, #+4]
+ bx lr
+__Z11no_overflowjj:
+ add r3, r1, r0
+ mov r2, #1
+ mov r1, #0
+ cmp r3, r0
+ movhs r1, r2
+ mov r0, r1
+ bx lr
+
+//===---------------------------------------------------------------------===//
+
+Some of the NEON intrinsics may be appropriate for more general use, either
+as target-independent intrinsics or perhaps elsewhere in the ARM backend.
+Some of them may also be lowered to target-independent SDNodes, and perhaps
+some new SDNodes could be added.
+
+For example, maximum, minimum, and absolute value operations are well-defined
+and standard operations, both for vector and scalar types.
+
+The current NEON-specific intrinsics for count leading zeros and count one
+bits could perhaps be replaced by the target-independent ctlz and ctpop
+intrinsics. It may also make sense to add a target-independent "ctls"
+intrinsic for "count leading sign bits". Likewise, the backend could use
+the target-independent SDNodes for these operations.
+
+ARMv6 has scalar saturating and halving adds and subtracts. The same
+intrinsics could possibly be used for both NEON's vector implementations of
+those operations and the ARMv6 scalar versions.
+
+//===---------------------------------------------------------------------===//
+
+Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting
+LDR into imm12 and so_reg forms. This allows us to clean up some code. e.g.
+ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg)
+while ARMConstantIslandPass only need to worry about LDR (literal).
+
+//===---------------------------------------------------------------------===//
+
+Constant island pass should make use of full range SoImm values for LEApcrel.
+Be careful though as the last attempt caused infinite looping on lencod.
+
+//===---------------------------------------------------------------------===//
+
+Predication issue. This function:
+
+extern unsigned array[ 128 ];
+int foo( int x ) {
+ int y;
+ y = array[ x & 127 ];
+ if ( x & 128 )
+ y = 123456789 & ( y >> 2 );
+ else
+ y = 123456789 & y;
+ return y;
+}
+
+compiles to:
+
+_foo:
+ and r1, r0, #127
+ ldr r2, LCPI1_0
+ ldr r2, [r2]
+ ldr r1, [r2, +r1, lsl #2]
+ mov r2, r1, lsr #2
+ tst r0, #128
+ moveq r2, r1
+ ldr r0, LCPI1_1
+ and r0, r2, r0
+ bx lr
+
+It would be better to do something like this, to fold the shift into the
+conditional move:
+
+ and r1, r0, #127
+ ldr r2, LCPI1_0
+ ldr r2, [r2]
+ ldr r1, [r2, +r1, lsl #2]
+ tst r0, #128
+ movne r1, r1, lsr #2
+ ldr r0, LCPI1_1
+ and r0, r1, r0
+ bx lr
+
+it saves an instruction and a register.
+
+//===---------------------------------------------------------------------===//
+
+It might be profitable to cse MOVi16 if there are lots of 32-bit immediates
+with the same bottom half.
+
+//===---------------------------------------------------------------------===//
+
+Robert Muth started working on an alternate jump table implementation that
+does not put the tables in-line in the text. This is more like the llvm
+default jump table implementation. This might be useful sometime. Several
+revisions of patches are on the mailing list, beginning at:
+http://lists.cs.uiuc.edu/pipermail/llvmdev/2009-June/022763.html
+
+//===---------------------------------------------------------------------===//
+
+Make use of the "rbit" instruction.
+
+//===---------------------------------------------------------------------===//
+
+Take a look at test/CodeGen/Thumb2/machine-licm.ll. ARM should be taught how
+to licm and cse the unnecessary load from cp#1.
+
+//===---------------------------------------------------------------------===//
+
+The CMN instruction sets the flags like an ADD instruction, while CMP sets
+them like a subtract. Therefore to be able to use CMN for comparisons other
+than the Z bit, we'll need additional logic to reverse the conditionals
+associated with the comparison. Perhaps a pseudo-instruction for the comparison,
+with a post-codegen pass to clean up and handle the condition codes?
+See PR5694 for testcase.
+
+//===---------------------------------------------------------------------===//
+
+Given the following on armv5:
+int test1(int A, int B) {
+ return (A&-8388481)|(B&8388480);
+}
+
+We currently generate:
+ ldr r2, .LCPI0_0
+ and r0, r0, r2
+ ldr r2, .LCPI0_1
+ and r1, r1, r2
+ orr r0, r1, r0
+ bx lr
+
+We should be able to replace the second ldr+and with a bic (i.e. reuse the
+constant which was already loaded). Not sure what's necessary to do that.
+
+//===---------------------------------------------------------------------===//
+
+The code generated for bswap on armv4/5 (CPUs without rev) is less than ideal:
+
+int a(int x) { return __builtin_bswap32(x); }
+
+a:
+ mov r1, #255, 24
+ mov r2, #255, 16
+ and r1, r1, r0, lsr #8
+ and r2, r2, r0, lsl #8
+ orr r1, r1, r0, lsr #24
+ orr r0, r2, r0, lsl #24
+ orr r0, r0, r1
+ bx lr
+
+Something like the following would be better (fewer instructions/registers):
+ eor r1, r0, r0, ror #16
+ bic r1, r1, #0xff0000
+ mov r1, r1, lsr #8
+ eor r0, r1, r0, ror #8
+ bx lr
+
+A custom Thumb version would also be a slight improvement over the generic
+version.
+
+//===---------------------------------------------------------------------===//
+
+Consider the following simple C code:
+
+void foo(unsigned char *a, unsigned char *b, int *c) {
+ if ((*a | *b) == 0) *c = 0;
+}
+
+currently llvm-gcc generates something like this (nice branchless code I'd say):
+
+ ldrb r0, [r0]
+ ldrb r1, [r1]
+ orr r0, r1, r0
+ tst r0, #255
+ moveq r0, #0
+ streq r0, [r2]
+ bx lr
+
+Note that both "tst" and "moveq" are redundant.
+
+//===---------------------------------------------------------------------===//
+
+When loading immediate constants with movt/movw, if there are multiple
+constants needed with the same low 16 bits, and those values are not live at
+the same time, it would be possible to use a single movw instruction, followed
+by multiple movt instructions to rewrite the high bits to different values.
+For example:
+
+ volatile store i32 -1, i32* inttoptr (i32 1342210076 to i32*), align 4,
+ !tbaa
+!0
+ volatile store i32 -1, i32* inttoptr (i32 1342341148 to i32*), align 4,
+ !tbaa
+!0
+
+is compiled and optimized to:
+
+ movw r0, #32796
+ mov.w r1, #-1
+ movt r0, #20480
+ str r1, [r0]
+ movw r0, #32796 @ <= this MOVW is not needed, value is there already
+ movt r0, #20482
+ str r1, [r0]
+
+//===---------------------------------------------------------------------===//
+
+Improve codegen for select's:
+if (x != 0) x = 1
+if (x == 1) x = 1
+
+ARM codegen used to look like this:
+ mov r1, r0
+ cmp r1, #1
+ mov r0, #0
+ moveq r0, #1
+
+The naive lowering select between two different values. It should recognize the
+test is equality test so it's more a conditional move rather than a select:
+ cmp r0, #1
+ movne r0, #0
+
+Currently this is a ARM specific dag combine. We probably should make it into a
+target-neutral one.
+
+//===---------------------------------------------------------------------===//
+
+Optimize unnecessary checks for zero with __builtin_clz/ctz. Those builtins
+are specified to be undefined at zero, so portable code must check for zero
+and handle it as a special case. That is unnecessary on ARM where those
+operations are implemented in a way that is well-defined for zero. For
+example:
+
+int f(int x) { return x ? __builtin_clz(x) : sizeof(int)*8; }
+
+should just be implemented with a CLZ instruction. Since there are other
+targets, e.g., PPC, that share this behavior, it would be best to implement
+this in a target-independent way: we should probably fold that (when using
+"undefined at zero" semantics) to set the "defined at zero" bit and have
+the code generator expand out the right code.
+
+
+//===---------------------------------------------------------------------===//
+
+Clean up the test/MC/ARM files to have more robust register choices.
+
+R0 should not be used as a register operand in the assembler tests as it's then
+not possible to distinguish between a correct encoding and a missing operand
+encoding, as zero is the default value for the binary encoder.
+e.g.,
+ add r0, r0 // bad
+ add r3, r5 // good
+
+Register operands should be distinct. That is, when the encoding does not
+require two syntactical operands to refer to the same register, two different
+registers should be used in the test so as to catch errors where the
+operands are swapped in the encoding.
+e.g.,
+ subs.w r1, r1, r1 // bad
+ subs.w r1, r2, r3 // good
+