1 //===- README.txt - Notes for improving PowerPC-specific code gen ---------===//
5 * implement do-loop -> bdnz transform
7 ===-------------------------------------------------------------------------===
9 Support 'update' load/store instructions. These are cracked on the G5, but are
12 With preinc enabled, this:
14 long *%test4(long *%X, long *%dest) {
15 %Y = getelementptr long* %X, int 4
17 store long %A, long* %dest
32 with -sched=list-burr, I get:
41 ===-------------------------------------------------------------------------===
43 We compile the hottest inner loop of viterbi to:
54 bne cr0, LBB1_83 ;bb420.i
56 The CBE manages to produce:
67 This could be much better (bdnz instead of bdz) but it still beats us. If we
68 produced this with bdnz, the loop would be a single dispatch group.
70 ===-------------------------------------------------------------------------===
87 This is effectively a simple form of predication.
89 ===-------------------------------------------------------------------------===
91 Lump the constant pool for each function into ONE pic object, and reference
92 pieces of it as offsets from the start. For functions like this (contrived
93 to have lots of constants obviously):
95 double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }
100 lis r2, ha16(.CPI_X_0)
101 lfd f0, lo16(.CPI_X_0)(r2)
102 lis r2, ha16(.CPI_X_1)
103 lfd f2, lo16(.CPI_X_1)(r2)
105 lis r2, ha16(.CPI_X_2)
106 lfd f1, lo16(.CPI_X_2)(r2)
107 lis r2, ha16(.CPI_X_3)
108 lfd f2, lo16(.CPI_X_3)(r2)
112 It would be better to materialize .CPI_X into a register, then use immediates
113 off of the register to avoid the lis's. This is even more important in PIC
116 Note that this (and the static variable version) is discussed here for GCC:
117 http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
119 ===-------------------------------------------------------------------------===
121 PIC Code Gen IPO optimization:
123 Squish small scalar globals together into a single global struct, allowing the
124 address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
125 of the GOT on targets with one).
127 Note that this is discussed here for GCC:
128 http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
130 ===-------------------------------------------------------------------------===
132 Implement Newton-Rhapson method for improving estimate instructions to the
133 correct accuracy, and implementing divide as multiply by reciprocal when it has
134 more than one use. Itanium will want this too.
136 ===-------------------------------------------------------------------------===
140 int %f1(int %a, int %b) {
141 %tmp.1 = and int %a, 15 ; <int> [#uses=1]
142 %tmp.3 = and int %b, 240 ; <int> [#uses=1]
143 %tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
147 without a copy. We make this currently:
150 rlwinm r2, r4, 0, 24, 27
151 rlwimi r2, r3, 0, 28, 31
155 The two-addr pass or RA needs to learn when it is profitable to commute an
156 instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
157 currently only commutes to avoid inserting a copy BEFORE the two addr instr.
159 ===-------------------------------------------------------------------------===
161 Compile offsets from allocas:
164 %X = alloca { int, int }
165 %Y = getelementptr {int,int}* %X, int 0, uint 1
169 into a single add, not two:
176 --> important for C++.
178 ===-------------------------------------------------------------------------===
180 No loads or stores of the constants should be needed:
182 struct foo { double X, Y; };
183 void xxx(struct foo F);
184 void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
186 ===-------------------------------------------------------------------------===
188 Darwin Stub LICM optimization:
194 Have to go through an indirect stub if bar is external or linkonce. It would
195 be better to compile it as:
200 which only computes the address of bar once (instead of each time through the
201 stub). This is Darwin specific and would have to be done in the code generator.
202 Probably not a win on x86.
204 ===-------------------------------------------------------------------------===
206 Simple IPO for argument passing, change:
207 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
209 the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
210 of arguments get assigned to r3 through r10. That is, if you have a function
211 foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
212 argument bytes for r4 and r5. The trick then would be to shuffle the argument
213 order for functions we can internalize so that the maximum number of
214 integers/pointers get passed in regs before you see any of the fp arguments.
216 Instead of implementing this, it would actually probably be easier to just
217 implement a PPC fastcc, where we could do whatever we wanted to the CC,
218 including having this work sanely.
220 ===-------------------------------------------------------------------------===
222 Fix Darwin FP-In-Integer Registers ABI
224 Darwin passes doubles in structures in integer registers, which is very very
225 bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
226 that percolates these things out of functions.
228 Check out how horrible this is:
229 http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
231 This is an extension of "interprocedural CC unmunging" that can't be done with
234 ===-------------------------------------------------------------------------===
241 return b * 3; // ignore the fact that this is always 3.
247 into something not this:
252 rlwinm r2, r2, 29, 31, 31
254 bgt cr0, LBB1_2 ; UnifiedReturnBlock
256 rlwinm r2, r2, 0, 31, 31
259 LBB1_2: ; UnifiedReturnBlock
263 In particular, the two compares (marked 1) could be shared by reversing one.
264 This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
265 same operands (but backwards) exists. In this case, this wouldn't save us
266 anything though, because the compares still wouldn't be shared.
268 ===-------------------------------------------------------------------------===
270 The legalizer should lower this:
272 bool %test(ulong %x) {
273 %tmp = setlt ulong %x, 4294967296
277 into "if x.high == 0", not:
293 noticed in 2005-05-11-Popcount-ffs-fls.c.
296 ===-------------------------------------------------------------------------===
298 We should custom expand setcc instead of pretending that we have it. That
299 would allow us to expose the access of the crbit after the mfcr, allowing
300 that access to be trivially folded into other ops. A simple example:
302 int foo(int a, int b) { return (a < b) << 4; }
309 rlwinm r2, r2, 29, 31, 31
313 ===-------------------------------------------------------------------------===
315 Fold add and sub with constant into non-extern, non-weak addresses so this:
318 void bar(int b) { a = b; }
319 void foo(unsigned char *c) {
336 lbz r2, lo16(_a+3)(r2)
340 ===-------------------------------------------------------------------------===
342 We generate really bad code for this:
344 int f(signed char *a, _Bool b, _Bool c) {
350 ===-------------------------------------------------------------------------===
353 int test(unsigned *P) { return *P >> 24; }
368 ===-------------------------------------------------------------------------===
370 On the G5, logical CR operations are more expensive in their three
371 address form: ops that read/write the same register are half as expensive as
372 those that read from two registers that are different from their destination.
374 We should model this with two separate instructions. The isel should generate
375 the "two address" form of the instructions. When the register allocator
376 detects that it needs to insert a copy due to the two-addresness of the CR
377 logical op, it will invoke PPCInstrInfo::convertToThreeAddress. At this point
378 we can convert to the "three address" instruction, to save code space.
380 This only matters when we start generating cr logical ops.
382 ===-------------------------------------------------------------------------===
384 We should compile these two functions to the same thing:
387 void f(int a, int b, int *P) {
388 *P = (a-b)>=0?(a-b):(b-a);
390 void g(int a, int b, int *P) {
394 Further, they should compile to something better than:
400 bgt cr0, LBB2_2 ; entry
417 ... which is much nicer.
419 This theoretically may help improve twolf slightly (used in dimbox.c:142?).
421 ===-------------------------------------------------------------------------===
423 int foo(int N, int ***W, int **TK, int X) {
426 for (t = 0; t < N; ++t)
427 for (i = 0; i < 4; ++i)
428 W[t / X][i][t % X] = TK[i][t];
433 We generate relatively atrocious code for this loop compared to gcc.
435 We could also strength reduce the rem and the div:
436 http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf
438 ===-------------------------------------------------------------------------===
440 float foo(float X) { return (int)(X); }
455 We could use a target dag combine to turn the lwz/extsw into an lwa when the
456 lwz has a single use. Since LWA is cracked anyway, this would be a codesize
459 ===-------------------------------------------------------------------------===
461 We generate ugly code for this:
463 void func(unsigned int *ret, float dx, float dy, float dz, float dw) {
465 if(dx < -dw) code |= 1;
466 if(dx > dw) code |= 2;
467 if(dy < -dw) code |= 4;
468 if(dy > dw) code |= 8;
469 if(dz < -dw) code |= 16;
470 if(dz > dw) code |= 32;
474 ===-------------------------------------------------------------------------===
476 Complete the signed i32 to FP conversion code using 64-bit registers
477 transformation, good for PI. See PPCISelLowering.cpp, this comment:
479 // FIXME: disable this lowered code. This generates 64-bit register values,
480 // and we don't model the fact that the top part is clobbered by calls. We
481 // need to flag these together so that the value isn't live across a call.
482 //setOperationAction(ISD::SINT_TO_FP, MVT::i32, Custom);
484 Also, if the registers are spilled to the stack, we have to ensure that all
485 64-bits of them are save/restored, otherwise we will miscompile the code. It
486 sounds like we need to get the 64-bit register classes going.
488 ===-------------------------------------------------------------------------===
490 %struct.B = type { ubyte, [3 x ubyte] }
492 void %foo(%struct.B* %b) {
494 %tmp = cast %struct.B* %b to uint* ; <uint*> [#uses=1]
495 %tmp = load uint* %tmp ; <uint> [#uses=1]
496 %tmp3 = cast %struct.B* %b to uint* ; <uint*> [#uses=1]
497 %tmp4 = load uint* %tmp3 ; <uint> [#uses=1]
498 %tmp8 = cast %struct.B* %b to uint* ; <uint*> [#uses=2]
499 %tmp9 = load uint* %tmp8 ; <uint> [#uses=1]
500 %tmp4.mask17 = shl uint %tmp4, ubyte 1 ; <uint> [#uses=1]
501 %tmp1415 = and uint %tmp4.mask17, 2147483648 ; <uint> [#uses=1]
502 %tmp.masked = and uint %tmp, 2147483648 ; <uint> [#uses=1]
503 %tmp11 = or uint %tmp1415, %tmp.masked ; <uint> [#uses=1]
504 %tmp12 = and uint %tmp9, 2147483647 ; <uint> [#uses=1]
505 %tmp13 = or uint %tmp12, %tmp11 ; <uint> [#uses=1]
506 store uint %tmp13, uint* %tmp8
516 rlwimi r2, r4, 0, 0, 0
520 We could collapse a bunch of those ORs and ANDs and generate the following
525 rlwinm r4, r2, 1, 0, 0
530 ===-------------------------------------------------------------------------===
534 unsigned test6(unsigned x) {
535 return ((x & 0x00FF0000) >> 16) | ((x & 0x000000FF) << 16);
542 rlwinm r3, r3, 16, 0, 31
551 rlwinm r3,r3,16,24,31
556 ===-------------------------------------------------------------------------===
558 Consider a function like this:
560 float foo(float X) { return X + 1234.4123f; }
562 The FP constant ends up in the constant pool, so we need to get the LR register.
563 This ends up producing code like this:
572 addis r2, r2, ha16(.CPI_foo_0-"L00000$pb")
573 lfs f0, lo16(.CPI_foo_0-"L00000$pb")(r2)
579 This is functional, but there is no reason to spill the LR register all the way
580 to the stack (the two marked instrs): spilling it to a GPR is quite enough.
582 Implementing this will require some codegen improvements. Nate writes:
584 "So basically what we need to support the "no stack frame save and restore" is a
585 generalization of the LR optimization to "callee-save regs".
587 Currently, we have LR marked as a callee-save reg. The register allocator sees
588 that it's callee save, and spills it directly to the stack.
590 Ideally, something like this would happen:
592 LR would be in a separate register class from the GPRs. The class of LR would be
593 marked "unspillable". When the register allocator came across an unspillable
594 reg, it would ask "what is the best class to copy this into that I *can* spill"
595 If it gets a class back, which it will in this case (the gprs), it grabs a free
596 register of that class. If it is then later necessary to spill that reg, so be
599 ===-------------------------------------------------------------------------===