3 * implement do-loop -> bdnz transform
4 * implement powerpc-64 for darwin
6 ===-------------------------------------------------------------------------===
8 Support 'update' load/store instructions. These are cracked on the G5, but are
11 ===-------------------------------------------------------------------------===
13 Should hint to the branch select pass that it doesn't need to print the second
14 unconditional branch, so we don't end up with things like:
15 b .LBBl42__2E_expand_function_8_674 ; loopentry.24
16 b .LBBl42__2E_expand_function_8_42 ; NewDefault
17 b .LBBl42__2E_expand_function_8_42 ; NewDefault
21 ===-------------------------------------------------------------------------===
26 if (X == 0x12345678) bar();
42 ===-------------------------------------------------------------------------===
44 Lump the constant pool for each function into ONE pic object, and reference
45 pieces of it as offsets from the start. For functions like this (contrived
46 to have lots of constants obviously):
48 double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }
53 lis r2, ha16(.CPI_X_0)
54 lfd f0, lo16(.CPI_X_0)(r2)
55 lis r2, ha16(.CPI_X_1)
56 lfd f2, lo16(.CPI_X_1)(r2)
58 lis r2, ha16(.CPI_X_2)
59 lfd f1, lo16(.CPI_X_2)(r2)
60 lis r2, ha16(.CPI_X_3)
61 lfd f2, lo16(.CPI_X_3)(r2)
65 It would be better to materialize .CPI_X into a register, then use immediates
66 off of the register to avoid the lis's. This is even more important in PIC
69 Note that this (and the static variable version) is discussed here for GCC:
70 http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
72 ===-------------------------------------------------------------------------===
74 PIC Code Gen IPO optimization:
76 Squish small scalar globals together into a single global struct, allowing the
77 address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
78 of the GOT on targets with one).
80 Note that this is discussed here for GCC:
81 http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
83 ===-------------------------------------------------------------------------===
85 Implement Newton-Rhapson method for improving estimate instructions to the
86 correct accuracy, and implementing divide as multiply by reciprocal when it has
87 more than one use. Itanium will want this too.
89 ===-------------------------------------------------------------------------===
91 #define ARRAY_LENGTH 16
96 unsigned int field0 : 6;
97 unsigned int field1 : 6;
98 unsigned int field2 : 6;
99 unsigned int field3 : 6;
100 unsigned int field4 : 3;
101 unsigned int field5 : 4;
102 unsigned int field6 : 1;
104 unsigned int field6 : 1;
105 unsigned int field5 : 4;
106 unsigned int field4 : 3;
107 unsigned int field3 : 6;
108 unsigned int field2 : 6;
109 unsigned int field1 : 6;
110 unsigned int field0 : 6;
119 typedef struct program_t {
120 union bitfield array[ARRAY_LENGTH];
126 void AdjustBitfields(program* prog, unsigned int fmt1)
128 prog->array[0].bitfields.field0 = fmt1;
129 prog->array[0].bitfields.field1 = fmt1 + 1;
132 We currently generate:
137 rlwinm r2, r2, 0, 0, 19
138 rlwinm r5, r5, 6, 20, 25
139 rlwimi r2, r4, 0, 26, 31
144 We should teach someone that or (rlwimi, rlwinm) with disjoint masks can be
145 turned into rlwimi (rlwimi)
147 The better codegen would be:
158 ===-------------------------------------------------------------------------===
162 int %f1(int %a, int %b) {
163 %tmp.1 = and int %a, 15 ; <int> [#uses=1]
164 %tmp.3 = and int %b, 240 ; <int> [#uses=1]
165 %tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
169 without a copy. We make this currently:
172 rlwinm r2, r4, 0, 24, 27
173 rlwimi r2, r3, 0, 28, 31
177 The two-addr pass or RA needs to learn when it is profitable to commute an
178 instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
179 currently only commutes to avoid inserting a copy BEFORE the two addr instr.
181 ===-------------------------------------------------------------------------===
183 Compile offsets from allocas:
186 %X = alloca { int, int }
187 %Y = getelementptr {int,int}* %X, int 0, uint 1
191 into a single add, not two:
198 --> important for C++.
200 ===-------------------------------------------------------------------------===
202 int test3(int a, int b) { return (a < 0) ? a : 0; }
204 should be branch free code. LLVM is turning it into < 1 because of the RHS.
206 ===-------------------------------------------------------------------------===
208 No loads or stores of the constants should be needed:
210 struct foo { double X, Y; };
211 void xxx(struct foo F);
212 void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
214 ===-------------------------------------------------------------------------===
216 Darwin Stub LICM optimization:
222 Have to go through an indirect stub if bar is external or linkonce. It would
223 be better to compile it as:
228 which only computes the address of bar once (instead of each time through the
229 stub). This is Darwin specific and would have to be done in the code generator.
230 Probably not a win on x86.
232 ===-------------------------------------------------------------------------===
234 PowerPC i1/setcc stuff (depends on subreg stuff):
236 Check out the PPC code we get for 'compare' in this testcase:
237 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19672
239 oof. on top of not doing the logical crnand instead of (mfcr, mfcr,
240 invert, invert, or), we then have to compare it against zero instead of
241 using the value already in a CR!
243 that should be something like
247 bne cr0, LBB_compare_4
255 rlwinm r7, r7, 30, 31, 31
256 rlwinm r8, r8, 30, 31, 31
262 bne cr0, LBB_compare_4 ; loopexit
264 FreeBench/mason has a basic block that looks like this:
266 %tmp.130 = seteq int %p.0__, 5 ; <bool> [#uses=1]
267 %tmp.134 = seteq int %p.1__, 6 ; <bool> [#uses=1]
268 %tmp.139 = seteq int %p.2__, 12 ; <bool> [#uses=1]
269 %tmp.144 = seteq int %p.3__, 13 ; <bool> [#uses=1]
270 %tmp.149 = seteq int %p.4__, 14 ; <bool> [#uses=1]
271 %tmp.154 = seteq int %p.5__, 15 ; <bool> [#uses=1]
272 %bothcond = and bool %tmp.134, %tmp.130 ; <bool> [#uses=1]
273 %bothcond123 = and bool %bothcond, %tmp.139 ; <bool>
274 %bothcond124 = and bool %bothcond123, %tmp.144 ; <bool>
275 %bothcond125 = and bool %bothcond124, %tmp.149 ; <bool>
276 %bothcond126 = and bool %bothcond125, %tmp.154 ; <bool>
277 br bool %bothcond126, label %shortcirc_next.5, label %else.0
279 This is a particularly important case where handling CRs better will help.
281 ===-------------------------------------------------------------------------===
283 Simple IPO for argument passing, change:
284 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
286 the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
287 of arguments get assigned to r3 through r10. That is, if you have a function
288 foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
289 argument bytes for r4 and r5. The trick then would be to shuffle the argument
290 order for functions we can internalize so that the maximum number of
291 integers/pointers get passed in regs before you see any of the fp arguments.
293 Instead of implementing this, it would actually probably be easier to just
294 implement a PPC fastcc, where we could do whatever we wanted to the CC,
295 including having this work sanely.
297 ===-------------------------------------------------------------------------===
299 Fix Darwin FP-In-Integer Registers ABI
301 Darwin passes doubles in structures in integer registers, which is very very
302 bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
303 that percolates these things out of functions.
305 Check out how horrible this is:
306 http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
308 This is an extension of "interprocedural CC unmunging" that can't be done with
311 ===-------------------------------------------------------------------------===
313 Generate lwbrx and other byteswapping load/store instructions when reasonable.
315 ===-------------------------------------------------------------------------===
317 Implement TargetConstantVec, and set up PPC to custom lower ConstantVec into
318 TargetConstantVec's if it's one of the many forms that are algorithmically
319 computable using the spiffy altivec instructions.
321 ===-------------------------------------------------------------------------===
328 return b * 3; // ignore the fact that this is always 3.
334 into something not this:
339 rlwinm r2, r2, 29, 31, 31
341 bgt cr0, LBB1_2 ; UnifiedReturnBlock
343 rlwinm r2, r2, 0, 31, 31
346 LBB1_2: ; UnifiedReturnBlock
350 In particular, the two compares (marked 1) could be shared by reversing one.
351 This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
352 same operands (but backwards) exists. In this case, this wouldn't save us
353 anything though, because the compares still wouldn't be shared.
355 ===-------------------------------------------------------------------------===
357 The legalizer should lower this:
359 bool %test(ulong %x) {
360 %tmp = setlt ulong %x, 4294967296
364 into "if x.high == 0", not:
380 noticed in 2005-05-11-Popcount-ffs-fls.c.
383 ===-------------------------------------------------------------------------===
385 We should custom expand setcc instead of pretending that we have it. That
386 would allow us to expose the access of the crbit after the mfcr, allowing
387 that access to be trivially folded into other ops. A simple example:
389 int foo(int a, int b) { return (a < b) << 4; }
396 rlwinm r2, r2, 29, 31, 31
400 ===-------------------------------------------------------------------------===
402 Fold add and sub with constant into non-extern, non-weak addresses so this:
405 void bar(int b) { a = b; }
406 void foo(unsigned char *c) {
423 lbz r2, lo16(_a+3)(r2)
427 ===-------------------------------------------------------------------------===
429 We generate really bad code for this:
431 int f(signed char *a, _Bool b, _Bool c) {
437 ===-------------------------------------------------------------------------===
440 int test(unsigned *P) { return *P >> 24; }
455 ===-------------------------------------------------------------------------===
457 On the G5, logical CR operations are more expensive in their three
458 address form: ops that read/write the same register are half as expensive as
459 those that read from two registers that are different from their destination.
461 We should model this with two separate instructions. The isel should generate
462 the "two address" form of the instructions. When the register allocator
463 detects that it needs to insert a copy due to the two-addresness of the CR
464 logical op, it will invoke PPCInstrInfo::convertToThreeAddress. At this point
465 we can convert to the "three address" instruction, to save code space.
467 This only matters when we start generating cr logical ops.
469 ===-------------------------------------------------------------------------===
471 We should compile these two functions to the same thing:
474 void f(int a, int b, int *P) {
475 *P = (a-b)>=0?(a-b):(b-a);
477 void g(int a, int b, int *P) {
481 Further, they should compile to something better than:
487 bgt cr0, LBB2_2 ; entry
504 ... which is much nicer.
506 This theoretically may help improve twolf slightly (used in dimbox.c:142?).
508 ===-------------------------------------------------------------------------===
510 int foo(int N, int ***W, int **TK, int X) {
513 for (t = 0; t < N; ++t)
514 for (i = 0; i < 4; ++i)
515 W[t / X][i][t % X] = TK[i][t];
520 We generate relatively atrocious code for this loop compared to gcc.