3 * implement do-loop -> bdnz transform
4 * implement powerpc-64 for darwin
6 ===-------------------------------------------------------------------------===
8 Use the stfiwx instruction for:
10 void foo(float a, int *b) { *b = a; }
12 ===-------------------------------------------------------------------------===
14 Support 'update' load/store instructions. These are cracked on the G5, but are
17 ===-------------------------------------------------------------------------===
19 Should hint to the branch select pass that it doesn't need to print the second
20 unconditional branch, so we don't end up with things like:
21 b .LBBl42__2E_expand_function_8_674 ; loopentry.24
22 b .LBBl42__2E_expand_function_8_42 ; NewDefault
23 b .LBBl42__2E_expand_function_8_42 ; NewDefault
27 ===-------------------------------------------------------------------------===
32 if (X == 0x12345678) bar();
48 ===-------------------------------------------------------------------------===
50 Lump the constant pool for each function into ONE pic object, and reference
51 pieces of it as offsets from the start. For functions like this (contrived
52 to have lots of constants obviously):
54 double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }
59 lis r2, ha16(.CPI_X_0)
60 lfd f0, lo16(.CPI_X_0)(r2)
61 lis r2, ha16(.CPI_X_1)
62 lfd f2, lo16(.CPI_X_1)(r2)
64 lis r2, ha16(.CPI_X_2)
65 lfd f1, lo16(.CPI_X_2)(r2)
66 lis r2, ha16(.CPI_X_3)
67 lfd f2, lo16(.CPI_X_3)(r2)
71 It would be better to materialize .CPI_X into a register, then use immediates
72 off of the register to avoid the lis's. This is even more important in PIC
75 Note that this (and the static variable version) is discussed here for GCC:
76 http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
78 ===-------------------------------------------------------------------------===
80 PIC Code Gen IPO optimization:
82 Squish small scalar globals together into a single global struct, allowing the
83 address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
84 of the GOT on targets with one).
86 Note that this is discussed here for GCC:
87 http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
89 ===-------------------------------------------------------------------------===
91 Implement Newton-Rhapson method for improving estimate instructions to the
92 correct accuracy, and implementing divide as multiply by reciprocal when it has
93 more than one use. Itanium will want this too.
95 ===-------------------------------------------------------------------------===
97 #define ARRAY_LENGTH 16
102 unsigned int field0 : 6;
103 unsigned int field1 : 6;
104 unsigned int field2 : 6;
105 unsigned int field3 : 6;
106 unsigned int field4 : 3;
107 unsigned int field5 : 4;
108 unsigned int field6 : 1;
110 unsigned int field6 : 1;
111 unsigned int field5 : 4;
112 unsigned int field4 : 3;
113 unsigned int field3 : 6;
114 unsigned int field2 : 6;
115 unsigned int field1 : 6;
116 unsigned int field0 : 6;
125 typedef struct program_t {
126 union bitfield array[ARRAY_LENGTH];
132 void AdjustBitfields(program* prog, unsigned int fmt1)
134 prog->array[0].bitfields.field0 = fmt1;
135 prog->array[0].bitfields.field1 = fmt1 + 1;
138 We currently generate:
143 rlwinm r2, r2, 0, 0, 19
144 rlwinm r5, r5, 6, 20, 25
145 rlwimi r2, r4, 0, 26, 31
150 We should teach someone that or (rlwimi, rlwinm) with disjoint masks can be
151 turned into rlwimi (rlwimi)
153 The better codegen would be:
164 ===-------------------------------------------------------------------------===
168 int %f1(int %a, int %b) {
169 %tmp.1 = and int %a, 15 ; <int> [#uses=1]
170 %tmp.3 = and int %b, 240 ; <int> [#uses=1]
171 %tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
175 without a copy. We make this currently:
178 rlwinm r2, r4, 0, 24, 27
179 rlwimi r2, r3, 0, 28, 31
183 The two-addr pass or RA needs to learn when it is profitable to commute an
184 instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
185 currently only commutes to avoid inserting a copy BEFORE the two addr instr.
187 ===-------------------------------------------------------------------------===
189 Compile offsets from allocas:
192 %X = alloca { int, int }
193 %Y = getelementptr {int,int}* %X, int 0, uint 1
197 into a single add, not two:
204 --> important for C++.
206 ===-------------------------------------------------------------------------===
208 int test3(int a, int b) { return (a < 0) ? a : 0; }
210 should be branch free code. LLVM is turning it into < 1 because of the RHS.
212 ===-------------------------------------------------------------------------===
214 No loads or stores of the constants should be needed:
216 struct foo { double X, Y; };
217 void xxx(struct foo F);
218 void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
220 ===-------------------------------------------------------------------------===
222 Darwin Stub LICM optimization:
228 Have to go through an indirect stub if bar is external or linkonce. It would
229 be better to compile it as:
234 which only computes the address of bar once (instead of each time through the
235 stub). This is Darwin specific and would have to be done in the code generator.
236 Probably not a win on x86.
238 ===-------------------------------------------------------------------------===
240 PowerPC i1/setcc stuff (depends on subreg stuff):
242 Check out the PPC code we get for 'compare' in this testcase:
243 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19672
245 oof. on top of not doing the logical crnand instead of (mfcr, mfcr,
246 invert, invert, or), we then have to compare it against zero instead of
247 using the value already in a CR!
249 that should be something like
253 bne cr0, LBB_compare_4
261 rlwinm r7, r7, 30, 31, 31
262 rlwinm r8, r8, 30, 31, 31
268 bne cr0, LBB_compare_4 ; loopexit
270 FreeBench/mason has a basic block that looks like this:
272 %tmp.130 = seteq int %p.0__, 5 ; <bool> [#uses=1]
273 %tmp.134 = seteq int %p.1__, 6 ; <bool> [#uses=1]
274 %tmp.139 = seteq int %p.2__, 12 ; <bool> [#uses=1]
275 %tmp.144 = seteq int %p.3__, 13 ; <bool> [#uses=1]
276 %tmp.149 = seteq int %p.4__, 14 ; <bool> [#uses=1]
277 %tmp.154 = seteq int %p.5__, 15 ; <bool> [#uses=1]
278 %bothcond = and bool %tmp.134, %tmp.130 ; <bool> [#uses=1]
279 %bothcond123 = and bool %bothcond, %tmp.139 ; <bool>
280 %bothcond124 = and bool %bothcond123, %tmp.144 ; <bool>
281 %bothcond125 = and bool %bothcond124, %tmp.149 ; <bool>
282 %bothcond126 = and bool %bothcond125, %tmp.154 ; <bool>
283 br bool %bothcond126, label %shortcirc_next.5, label %else.0
285 This is a particularly important case where handling CRs better will help.
287 ===-------------------------------------------------------------------------===
289 Simple IPO for argument passing, change:
290 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
292 the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
293 of arguments get assigned to r3 through r10. That is, if you have a function
294 foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
295 argument bytes for r4 and r5. The trick then would be to shuffle the argument
296 order for functions we can internalize so that the maximum number of
297 integers/pointers get passed in regs before you see any of the fp arguments.
299 Instead of implementing this, it would actually probably be easier to just
300 implement a PPC fastcc, where we could do whatever we wanted to the CC,
301 including having this work sanely.
303 ===-------------------------------------------------------------------------===
305 Fix Darwin FP-In-Integer Registers ABI
307 Darwin passes doubles in structures in integer registers, which is very very
308 bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
309 that percolates these things out of functions.
311 Check out how horrible this is:
312 http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
314 This is an extension of "interprocedural CC unmunging" that can't be done with
317 ===-------------------------------------------------------------------------===
319 Generate lwbrx and other byteswapping load/store instructions when reasonable.
321 ===-------------------------------------------------------------------------===
323 Implement TargetConstantVec, and set up PPC to custom lower ConstantVec into
324 TargetConstantVec's if it's one of the many forms that are algorithmically
325 computable using the spiffy altivec instructions.
327 ===-------------------------------------------------------------------------===
334 return b * 3; // ignore the fact that this is always 3.
340 into something not this:
345 rlwinm r2, r2, 29, 31, 31
347 bgt cr0, LBB1_2 ; UnifiedReturnBlock
349 rlwinm r2, r2, 0, 31, 31
352 LBB1_2: ; UnifiedReturnBlock
356 In particular, the two compares (marked 1) could be shared by reversing one.
357 This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
358 same operands (but backwards) exists. In this case, this wouldn't save us
359 anything though, because the compares still wouldn't be shared.
361 ===-------------------------------------------------------------------------===
363 The legalizer should lower this:
365 bool %test(ulong %x) {
366 %tmp = setlt ulong %x, 4294967296
370 into "if x.high == 0", not:
386 noticed in 2005-05-11-Popcount-ffs-fls.c.
389 ===-------------------------------------------------------------------------===
391 We should custom expand setcc instead of pretending that we have it. That
392 would allow us to expose the access of the crbit after the mfcr, allowing
393 that access to be trivially folded into other ops. A simple example:
395 int foo(int a, int b) { return (a < b) << 4; }
402 rlwinm r2, r2, 29, 31, 31
406 ===-------------------------------------------------------------------------===
408 Fold add and sub with constant into non-extern, non-weak addresses so this:
411 void bar(int b) { a = b; }
412 void foo(unsigned char *c) {
429 lbz r2, lo16(_a+3)(r2)
433 ===-------------------------------------------------------------------------===
435 We generate really bad code for this:
437 int f(signed char *a, _Bool b, _Bool c) {