lib/Target/X86/README.txt

   1 //===---------------------------------------------------------------------===//
   2 // Random ideas for the X86 backend.
   3 //===---------------------------------------------------------------------===//
   4
   5 We should add support for the "movbe" instruction, which does a byte-swapping
   6 copy (3-addr bswap + memory support?)  This is available on Atom processors.
   7
   8 //===---------------------------------------------------------------------===//
   9
  10 CodeGen/X86/lea-3.ll:test3 should be a single LEA, not a shift/move.  The X86
  11 backend knows how to three-addressify this shift, but it appears the register
  12 allocator isn't even asking it to do so in this case.  We should investigate
  13 why this isn't happening, it could have significant impact on other important
  14 cases for X86 as well.
  15
  16 //===---------------------------------------------------------------------===//
  17
  18 This should be one DIV/IDIV instruction, not a libcall:
  19
  20 unsigned test(unsigned long long X, unsigned Y) {
  21         return X/Y;
  22 }
  23
  24 This can be done trivially with a custom legalizer.  What about overflow
  25 though?  http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
  26
  27 //===---------------------------------------------------------------------===//
  28
  29 Improvements to the multiply -> shift/add algorithm:
  30 http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
  31
  32 //===---------------------------------------------------------------------===//
  33
  34 Improve code like this (occurs fairly frequently, e.g. in LLVM):
  35 long long foo(int x) { return 1LL << x; }
  36
  37 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
  38 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
  39 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
  40
  41 Another useful one would be  ~0ULL >> X and ~0ULL << X.
  42
  43 One better solution for 1LL << x is:
  44         xorl    %eax, %eax
  45         xorl    %edx, %edx
  46         testb   $32, %cl
  47         sete    %al
  48         setne   %dl
  49         sall    %cl, %eax
  50         sall    %cl, %edx
  51
  52 But that requires good 8-bit subreg support.
  53
  54 Also, this might be better.  It's an extra shift, but it's one instruction
  55 shorter, and doesn't stress 8-bit subreg support.
  56 (From http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01148.html,
  57 but without the unnecessary and.)
  58         movl %ecx, %eax
  59         shrl $5, %eax
  60         movl %eax, %edx
  61         xorl $1, %edx
  62         sall %cl, %eax
  63         sall %cl. %edx
  64
  65 64-bit shifts (in general) expand to really bad code.  Instead of using
  66 cmovs, we should expand to a conditional branch like GCC produces.
  67
  68 //===---------------------------------------------------------------------===//
  69
  70 Compile this:
  71 _Bool f(_Bool a) { return a!=1; }
  72
  73 into:
  74         movzbl  %dil, %eax
  75         xorl    $1, %eax
  76         ret
  77
  78 (Although note that this isn't a legal way to express the code that llvm-gcc
  79 currently generates for that function.)
  80
  81 //===---------------------------------------------------------------------===//
  82
  83 Some isel ideas:
  84
  85 1. Dynamic programming based approach when compile time if not an
  86    issue.
  87 2. Code duplication (addressing mode) during isel.
  88 3. Other ideas from "Register-Sensitive Selection, Duplication, and
  89    Sequencing of Instructions".
  90 4. Scheduling for reduced register pressure.  E.g. "Minimum Register
  91    Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
  92    and other related papers.
  93    http://citeseer.ist.psu.edu/govindarajan01minimum.html
  94
  95 //===---------------------------------------------------------------------===//
  96
  97 Should we promote i16 to i32 to avoid partial register update stalls?
  98
  99 //===---------------------------------------------------------------------===//
 100
 101 Leave any_extend as pseudo instruction and hint to register
 102 allocator. Delay codegen until post register allocation.
 103 Note. any_extend is now turned into an INSERT_SUBREG. We still need to teach
 104 the coalescer how to deal with it though.
 105
 106 //===---------------------------------------------------------------------===//
 107
 108 It appears icc use push for parameter passing. Need to investigate.
 109
 110 //===---------------------------------------------------------------------===//
 111
 112 Only use inc/neg/not instructions on processors where they are faster than
 113 add/sub/xor.  They are slower on the P4 due to only updating some processor
 114 flags.
 115
 116 //===---------------------------------------------------------------------===//
 117
 118 The instruction selector sometimes misses folding a load into a compare.  The
 119 pattern is written as (cmp reg, (load p)).  Because the compare isn't
 120 commutative, it is not matched with the load on both sides.  The dag combiner
 121 should be made smart enough to cannonicalize the load into the RHS of a compare
 122 when it can invert the result of the compare for free.
 123
 124 //===---------------------------------------------------------------------===//
 125
 126 In many cases, LLVM generates code like this:
 127
 128 _test:
 129         movl 8(%esp), %eax
 130         cmpl %eax, 4(%esp)
 131         setl %al
 132         movzbl %al, %eax
 133         ret
 134
 135 on some processors (which ones?), it is more efficient to do this:
 136
 137 _test:
 138         movl 8(%esp), %ebx
 139         xor  %eax, %eax
 140         cmpl %ebx, 4(%esp)
 141         setl %al
 142         ret
 143
 144 Doing this correctly is tricky though, as the xor clobbers the flags.
 145
 146 //===---------------------------------------------------------------------===//
 147
 148 We should generate bts/btr/etc instructions on targets where they are cheap or
 149 when codesize is important.  e.g., for:
 150
 151 void setbit(int *target, int bit) {
 152     *target |= (1 << bit);
 153 }
 154 void clearbit(int *target, int bit) {
 155     *target &= ~(1 << bit);
 156 }
 157
 158 //===---------------------------------------------------------------------===//
 159
 160 Instead of the following for memset char*, 1, 10:
 161
 162         movl $16843009, 4(%edx)
 163         movl $16843009, (%edx)
 164         movw $257, 8(%edx)
 165
 166 It might be better to generate
 167
 168         movl $16843009, %eax
 169         movl %eax, 4(%edx)
 170         movl %eax, (%edx)
 171         movw al, 8(%edx)
 172
 173 when we can spare a register. It reduces code size.
 174
 175 //===---------------------------------------------------------------------===//
 176
 177 Evaluate what the best way to codegen sdiv X, (2^C) is.  For X/8, we currently
 178 get this:
 179
 180 define i32 @test1(i32 %X) {
 181     %Y = sdiv i32 %X, 8
 182     ret i32 %Y
 183 }
 184
 185 _test1:
 186         movl 4(%esp), %eax
 187         movl %eax, %ecx
 188         sarl $31, %ecx
 189         shrl $29, %ecx
 190         addl %ecx, %eax
 191         sarl $3, %eax
 192         ret
 193
 194 GCC knows several different ways to codegen it, one of which is this:
 195
 196 _test1:
 197         movl    4(%esp), %eax
 198         cmpl    $-1, %eax
 199         leal    7(%eax), %ecx
 200         cmovle  %ecx, %eax
 201         sarl    $3, %eax
 202         ret
 203
 204 which is probably slower, but it's interesting at least :)
 205
 206 //===---------------------------------------------------------------------===//
 207
 208 We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
 209 We should leave these as libcalls for everything over a much lower threshold,
 210 since libc is hand tuned for medium and large mem ops (avoiding RFO for large
 211 stores, TLB preheating, etc)
 212
 213 //===---------------------------------------------------------------------===//
 214
 215 Optimize this into something reasonable:
 216  x * copysign(1.0, y) * copysign(1.0, z)
 217
 218 //===---------------------------------------------------------------------===//
 219
 220 Optimize copysign(x, *y) to use an integer load from y.
 221
 222 //===---------------------------------------------------------------------===//
 223
 224 The following tests perform worse with LSR:
 225
 226 lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
 227
 228 //===---------------------------------------------------------------------===//
 229
 230 Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
 231 FR64 to VR128.
 232
 233 //===---------------------------------------------------------------------===//
 234
 235 Adding to the list of cmp / test poor codegen issues:
 236
 237 int test(__m128 *A, __m128 *B) {
 238   if (_mm_comige_ss(*A, *B))
 239     return 3;
 240   else
 241     return 4;
 242 }
 243
 244 _test:
 245         movl 8(%esp), %eax
 246         movaps (%eax), %xmm0
 247         movl 4(%esp), %eax
 248         movaps (%eax), %xmm1
 249         comiss %xmm0, %xmm1
 250         setae %al
 251         movzbl %al, %ecx
 252         movl $3, %eax
 253         movl $4, %edx
 254         cmpl $0, %ecx
 255         cmove %edx, %eax
 256         ret
 257
 258 Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
 259 are a number of issues. 1) We are introducing a setcc between the result of the
 260 intrisic call and select. 2) The intrinsic is expected to produce a i32 value
 261 so a any extend (which becomes a zero extend) is added.
 262
 263 We probably need some kind of target DAG combine hook to fix this.
 264
 265 //===---------------------------------------------------------------------===//
 266
 267 We generate significantly worse code for this than GCC:
 268 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
 269 http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
 270
 271 There is also one case we do worse on PPC.
 272
 273 //===---------------------------------------------------------------------===//
 274
 275 For this:
 276
 277 int test(int a)
 278 {
 279   return a * 3;
 280 }
 281
 282 We currently emits
 283         imull $3, 4(%esp), %eax
 284
 285 Perhaps this is what we really should generate is? Is imull three or four
 286 cycles? Note: ICC generates this:
 287         movl    4(%esp), %eax
 288         leal    (%eax,%eax,2), %eax
 289
 290 The current instruction priority is based on pattern complexity. The former is
 291 more "complex" because it folds a load so the latter will not be emitted.
 292
 293 Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
 294 should always try to match LEA first since the LEA matching code does some
 295 estimate to determine whether the match is profitable.
 296
 297 However, if we care more about code size, then imull is better. It's two bytes
 298 shorter than movl + leal.
 299
 300 On a Pentium M, both variants have the same characteristics with regard
 301 to throughput; however, the multiplication has a latency of four cycles, as
 302 opposed to two cycles for the movl+lea variant.
 303
 304 //===---------------------------------------------------------------------===//
 305
 306 __builtin_ffs codegen is messy.
 307
 308 int ffs_(unsigned X) { return __builtin_ffs(X); }
 309
 310 llvm produces:
 311 ffs_:
 312         movl    4(%esp), %ecx
 313         bsfl    %ecx, %eax
 314         movl    $32, %edx
 315         cmove   %edx, %eax
 316         incl    %eax
 317         xorl    %edx, %edx
 318         testl   %ecx, %ecx
 319         cmove   %edx, %eax
 320         ret
 321
 322 vs gcc:
 323
 324 _ffs_:
 325         movl    $-1, %edx
 326         bsfl    4(%esp), %eax
 327         cmove   %edx, %eax
 328         addl    $1, %eax
 329         ret
 330
 331 Another example of __builtin_ffs (use predsimplify to eliminate a select):
 332
 333 int foo (unsigned long j) {
 334   if (j)
 335     return __builtin_ffs (j) - 1;
 336   else
 337     return 0;
 338 }
 339
 340 //===---------------------------------------------------------------------===//
 341
 342 It appears gcc place string data with linkonce linkage in
 343 .section __TEXT,__const_coal,coalesced instead of
 344 .section __DATA,__const_coal,coalesced.
 345 Take a look at darwin.h, there are other Darwin assembler directives that we
 346 do not make use of.
 347
 348 //===---------------------------------------------------------------------===//
 349
 350 define i32 @foo(i32* %a, i32 %t) {
 351 entry:
 352         br label %cond_true
 353
 354 cond_true:              ; preds = %cond_true, %entry
 355         %x.0.0 = phi i32 [ 0, %entry ], [ %tmp9, %cond_true ]           ; <i32> [#uses=3]
 356         %t_addr.0.0 = phi i32 [ %t, %entry ], [ %tmp7, %cond_true ]             ; <i32> [#uses=1]
 357         %tmp2 = getelementptr i32* %a, i32 %x.0.0               ; <i32*> [#uses=1]
 358         %tmp3 = load i32* %tmp2         ; <i32> [#uses=1]
 359         %tmp5 = add i32 %t_addr.0.0, %x.0.0             ; <i32> [#uses=1]
 360         %tmp7 = add i32 %tmp5, %tmp3            ; <i32> [#uses=2]
 361         %tmp9 = add i32 %x.0.0, 1               ; <i32> [#uses=2]
 362         %tmp = icmp sgt i32 %tmp9, 39           ; <i1> [#uses=1]
 363         br i1 %tmp, label %bb12, label %cond_true
 364
 365 bb12:           ; preds = %cond_true
 366         ret i32 %tmp7
 367 }
 368 is pessimized by -loop-reduce and -indvars
 369
 370 //===---------------------------------------------------------------------===//
 371
 372 u32 to float conversion improvement:
 373
 374 float uint32_2_float( unsigned u ) {
 375   float fl = (int) (u & 0xffff);
 376   float fh = (int) (u >> 16);
 377   fh *= 0x1.0p16f;
 378   return fh + fl;
 379 }
 380
 381 00000000        subl    $0x04,%esp
 382 00000003        movl    0x08(%esp,1),%eax
 383 00000007        movl    %eax,%ecx
 384 00000009        shrl    $0x10,%ecx
 385 0000000c        cvtsi2ss        %ecx,%xmm0
 386 00000010        andl    $0x0000ffff,%eax
 387 00000015        cvtsi2ss        %eax,%xmm1
 388 00000019        mulss   0x00000078,%xmm0
 389 00000021        addss   %xmm1,%xmm0
 390 00000025        movss   %xmm0,(%esp,1)
 391 0000002a        flds    (%esp,1)
 392 0000002d        addl    $0x04,%esp
 393 00000030        ret
 394
 395 //===---------------------------------------------------------------------===//
 396
 397 When using fastcc abi, align stack slot of argument of type double on 8 byte
 398 boundary to improve performance.
 399
 400 //===---------------------------------------------------------------------===//
 401
 402 Codegen:
 403
 404 int f(int a, int b) {
 405   if (a == 4 || a == 6)
 406     b++;
 407   return b;
 408 }
 409
 410
 411 as:
 412
 413 or eax, 2
 414 cmp eax, 6
 415 jz label
 416
 417 //===---------------------------------------------------------------------===//
 418
 419 GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
 420 simplifications for integer "x cmp y ? a : b".  For example, instead of:
 421
 422 int G;
 423 void f(int X, int Y) {
 424   G = X < 0 ? 14 : 13;
 425 }
 426
 427 compiling to:
 428
 429 _f:
 430         movl $14, %eax
 431         movl $13, %ecx
 432         movl 4(%esp), %edx
 433         testl %edx, %edx
 434         cmovl %eax, %ecx
 435         movl %ecx, _G
 436         ret
 437
 438 it could be:
 439 _f:
 440         movl    4(%esp), %eax
 441         sarl    $31, %eax
 442         notl    %eax
 443         addl    $14, %eax
 444         movl    %eax, _G
 445         ret
 446
 447 etc.
 448
 449 Another is:
 450 int usesbb(unsigned int a, unsigned int b) {
 451        return (a < b ? -1 : 0);
 452 }
 453 to:
 454 _usesbb:
 455         movl    8(%esp), %eax
 456         cmpl    %eax, 4(%esp)
 457         sbbl    %eax, %eax
 458         ret
 459
 460 instead of:
 461 _usesbb:
 462         xorl    %eax, %eax
 463         movl    8(%esp), %ecx
 464         cmpl    %ecx, 4(%esp)
 465         movl    $4294967295, %ecx
 466         cmovb   %ecx, %eax
 467         ret
 468
 469 //===---------------------------------------------------------------------===//
 470
 471 Consider the expansion of:
 472
 473 define i32 @test3(i32 %X) {
 474         %tmp1 = urem i32 %X, 255
 475         ret i32 %tmp1
 476 }
 477
 478 Currently it compiles to:
 479
 480 ...
 481         movl $2155905153, %ecx
 482         movl 8(%esp), %esi
 483         movl %esi, %eax
 484         mull %ecx
 485 ...
 486
 487 This could be "reassociated" into:
 488
 489         movl $2155905153, %eax
 490         movl 8(%esp), %ecx
 491         mull %ecx
 492
 493 to avoid the copy.  In fact, the existing two-address stuff would do this
 494 except that mul isn't a commutative 2-addr instruction.  I guess this has
 495 to be done at isel time based on the #uses to mul?
 496
 497 //===---------------------------------------------------------------------===//
 498
 499 Make sure the instruction which starts a loop does not cross a cacheline
 500 boundary. This requires knowning the exact length of each machine instruction.
 501 That is somewhat complicated, but doable. Example 256.bzip2:
 502
 503 In the new trace, the hot loop has an instruction which crosses a cacheline
 504 boundary.  In addition to potential cache misses, this can't help decoding as I
 505 imagine there has to be some kind of complicated decoder reset and realignment
 506 to grab the bytes from the next cacheline.
 507
 508 532  532 0x3cfc movb     (1809(%esp, %esi), %bl   <<<--- spans 2 64 byte lines
 509 942  942 0x3d03 movl     %dh, (1809(%esp, %esi)
 510 937  937 0x3d0a incl     %esi
 511 3    3   0x3d0b cmpb     %bl, %dl
 512 27   27  0x3d0d jnz      0x000062db <main+11707>
 513
 514 //===---------------------------------------------------------------------===//
 515
 516 In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.
 517
 518 //===---------------------------------------------------------------------===//
 519
 520 This could be a single 16-bit load.
 521
 522 int f(char *p) {
 523     if ((p[0] == 1) & (p[1] == 2)) return 1;
 524     return 0;
 525 }
 526
 527 //===---------------------------------------------------------------------===//
 528
 529 We should inline lrintf and probably other libc functions.
 530
 531 //===---------------------------------------------------------------------===//
 532
 533 Use the FLAGS values from arithmetic instructions more.  For example, compile:
 534
 535 int add_zf(int *x, int y, int a, int b) {
 536      if ((*x += y) == 0)
 537           return a;
 538      else
 539           return b;
 540 }
 541
 542 to:
 543        addl    %esi, (%rdi)
 544        movl    %edx, %eax
 545        cmovne  %ecx, %eax
 546        ret
 547 instead of:
 548
 549 _add_zf:
 550         addl (%rdi), %esi
 551         movl %esi, (%rdi)
 552         testl %esi, %esi
 553         cmove %edx, %ecx
 554         movl %ecx, %eax
 555         ret
 556
 557 As another example, compile function f2 in test/CodeGen/X86/cmp-test.ll
 558 without a test instruction.
 559
 560 //===---------------------------------------------------------------------===//
 561
 562 These two functions have identical effects:
 563
 564 unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return i;}
 565 unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
 566
 567 We currently compile them to:
 568
 569 _f:
 570         movl 4(%esp), %eax
 571         movl %eax, %ecx
 572         incl %ecx
 573         movl 8(%esp), %edx
 574         cmpl %edx, %ecx
 575         jne LBB1_2      #UnifiedReturnBlock
 576 LBB1_1: #cond_true
 577         addl $2, %eax
 578         ret
 579 LBB1_2: #UnifiedReturnBlock
 580         movl %ecx, %eax
 581         ret
 582 _f2:
 583         movl 4(%esp), %eax
 584         movl %eax, %ecx
 585         incl %ecx
 586         cmpl 8(%esp), %ecx
 587         sete %cl
 588         movzbl %cl, %ecx
 589         leal 1(%ecx,%eax), %eax
 590         ret
 591
 592 both of which are inferior to GCC's:
 593
 594 _f:
 595         movl    4(%esp), %edx
 596         leal    1(%edx), %eax
 597         addl    $2, %edx
 598         cmpl    8(%esp), %eax
 599         cmove   %edx, %eax
 600         ret
 601 _f2:
 602         movl    4(%esp), %eax
 603         addl    $1, %eax
 604         xorl    %edx, %edx
 605         cmpl    8(%esp), %eax
 606         sete    %dl
 607         addl    %edx, %eax
 608         ret
 609
 610 //===---------------------------------------------------------------------===//
 611
 612 This code:
 613
 614 void test(int X) {
 615   if (X) abort();
 616 }
 617
 618 is currently compiled to:
 619
 620 _test:
 621         subl $12, %esp
 622         cmpl $0, 16(%esp)
 623         jne LBB1_1
 624         addl $12, %esp
 625         ret
 626 LBB1_1:
 627         call L_abort$stub
 628
 629 It would be better to produce:
 630
 631 _test:
 632         subl $12, %esp
 633         cmpl $0, 16(%esp)
 634         jne L_abort$stub
 635         addl $12, %esp
 636         ret
 637
 638 This can be applied to any no-return function call that takes no arguments etc.
 639 Alternatively, the stack save/restore logic could be shrink-wrapped, producing
 640 something like this:
 641
 642 _test:
 643         cmpl $0, 4(%esp)
 644         jne LBB1_1
 645         ret
 646 LBB1_1:
 647         subl $12, %esp
 648         call L_abort$stub
 649
 650 Both are useful in different situations.  Finally, it could be shrink-wrapped
 651 and tail called, like this:
 652
 653 _test:
 654         cmpl $0, 4(%esp)
 655         jne LBB1_1
 656         ret
 657 LBB1_1:
 658         pop %eax   # realign stack.
 659         call L_abort$stub
 660
 661 Though this probably isn't worth it.
 662
 663 //===---------------------------------------------------------------------===//
 664
 665 Sometimes it is better to codegen subtractions from a constant (e.g. 7-x) with
 666 a neg instead of a sub instruction.  Consider:
 667
 668 int test(char X) { return 7-X; }
 669
 670 we currently produce:
 671 _test:
 672         movl $7, %eax
 673         movsbl 4(%esp), %ecx
 674         subl %ecx, %eax
 675         ret
 676
 677 We would use one fewer register if codegen'd as:
 678
 679         movsbl 4(%esp), %eax
 680         neg %eax
 681         add $7, %eax
 682         ret
 683
 684 Note that this isn't beneficial if the load can be folded into the sub.  In
 685 this case, we want a sub:
 686
 687 int test(int X) { return 7-X; }
 688 _test:
 689         movl $7, %eax
 690         subl 4(%esp), %eax
 691         ret
 692
 693 //===---------------------------------------------------------------------===//
 694
 695 Leaf functions that require one 4-byte spill slot have a prolog like this:
 696
 697 _foo:
 698         pushl   %esi
 699         subl    $4, %esp
 700 ...
 701 and an epilog like this:
 702         addl    $4, %esp
 703         popl    %esi
 704         ret
 705
 706 It would be smaller, and potentially faster, to push eax on entry and to
 707 pop into a dummy register instead of using addl/subl of esp.  Just don't pop
 708 into any return registers :)
 709
 710 //===---------------------------------------------------------------------===//
 711
 712 The X86 backend should fold (branch (or (setcc, setcc))) into multiple
 713 branches.  We generate really poor code for:
 714
 715 double testf(double a) {
 716        return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
 717 }
 718
 719 For example, the entry BB is:
 720
 721 _testf:
 722         subl    $20, %esp
 723         pxor    %xmm0, %xmm0
 724         movsd   24(%esp), %xmm1
 725         ucomisd %xmm0, %xmm1
 726         setnp   %al
 727         sete    %cl
 728         testb   %cl, %al
 729         jne     LBB1_5  # UnifiedReturnBlock
 730 LBB1_1: # cond_true
 731
 732
 733 it would be better to replace the last four instructions with:
 734
 735         jp LBB1_1
 736         je LBB1_5
 737 LBB1_1:
 738
 739 We also codegen the inner ?: into a diamond:
 740
 741        cvtss2sd        LCPI1_0(%rip), %xmm2
 742         cvtss2sd        LCPI1_1(%rip), %xmm3
 743         ucomisd %xmm1, %xmm0
 744         ja      LBB1_3  # cond_true
 745 LBB1_2: # cond_true
 746         movapd  %xmm3, %xmm2
 747 LBB1_3: # cond_true
 748         movapd  %xmm2, %xmm0
 749         ret
 750
 751 We should sink the load into xmm3 into the LBB1_2 block.  This should
 752 be pretty easy, and will nuke all the copies.
 753
 754 //===---------------------------------------------------------------------===//
 755
 756 This:
 757         #include <algorithm>
 758         inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
 759         { return std::make_pair(a + b, a + b < a); }
 760         bool no_overflow(unsigned a, unsigned b)
 761         { return !full_add(a, b).second; }
 762
 763 Should compile to:
 764
 765
 766         _Z11no_overflowjj:
 767                 addl    %edi, %esi
 768                 setae   %al
 769                 ret
 770
 771 FIXME: That code looks wrong; bool return is normally defined as zext.
 772
 773 on x86-64, not:
 774
 775 __Z11no_overflowjj:
 776         addl    %edi, %esi
 777         cmpl    %edi, %esi
 778         setae   %al
 779         movzbl  %al, %eax
 780         ret
 781
 782
 783 //===---------------------------------------------------------------------===//
 784
 785 The following code:
 786
 787 bb114.preheader:                ; preds = %cond_next94
 788         %tmp231232 = sext i16 %tmp62 to i32             ; <i32> [#uses=1]
 789         %tmp233 = sub i32 32, %tmp231232                ; <i32> [#uses=1]
 790         %tmp245246 = sext i16 %tmp65 to i32             ; <i32> [#uses=1]
 791         %tmp252253 = sext i16 %tmp68 to i32             ; <i32> [#uses=1]
 792         %tmp254 = sub i32 32, %tmp252253                ; <i32> [#uses=1]
 793         %tmp553554 = bitcast i16* %tmp37 to i8*         ; <i8*> [#uses=2]
 794         %tmp583584 = sext i16 %tmp98 to i32             ; <i32> [#uses=1]
 795         %tmp585 = sub i32 32, %tmp583584                ; <i32> [#uses=1]
 796         %tmp614615 = sext i16 %tmp101 to i32            ; <i32> [#uses=1]
 797         %tmp621622 = sext i16 %tmp104 to i32            ; <i32> [#uses=1]
 798         %tmp623 = sub i32 32, %tmp621622                ; <i32> [#uses=1]
 799         br label %bb114
 800
 801 produces:
 802
 803 LBB3_5: # bb114.preheader
 804         movswl  -68(%ebp), %eax
 805         movl    $32, %ecx
 806         movl    %ecx, -80(%ebp)
 807         subl    %eax, -80(%ebp)
 808         movswl  -52(%ebp), %eax
 809         movl    %ecx, -84(%ebp)
 810         subl    %eax, -84(%ebp)
 811         movswl  -70(%ebp), %eax
 812         movl    %ecx, -88(%ebp)
 813         subl    %eax, -88(%ebp)
 814         movswl  -50(%ebp), %eax
 815         subl    %eax, %ecx
 816         movl    %ecx, -76(%ebp)
 817         movswl  -42(%ebp), %eax
 818         movl    %eax, -92(%ebp)
 819         movswl  -66(%ebp), %eax
 820         movl    %eax, -96(%ebp)
 821         movw    $0, -98(%ebp)
 822
 823 This appears to be bad because the RA is not folding the store to the stack
 824 slot into the movl.  The above instructions could be:
 825         movl    $32, -80(%ebp)
 826 ...
 827         movl    $32, -84(%ebp)
 828 ...
 829 This seems like a cross between remat and spill folding.
 830
 831 This has redundant subtractions of %eax from a stack slot. However, %ecx doesn't
 832 change, so we could simply subtract %eax from %ecx first and then use %ecx (or
 833 vice-versa).
 834
 835 //===---------------------------------------------------------------------===//
 836
 837 This code:
 838
 839         %tmp659 = icmp slt i16 %tmp654, 0               ; <i1> [#uses=1]
 840         br i1 %tmp659, label %cond_true662, label %cond_next715
 841
 842 produces this:
 843
 844         testw   %cx, %cx
 845         movswl  %cx, %esi
 846         jns     LBB4_109        # cond_next715
 847
 848 Shark tells us that using %cx in the testw instruction is sub-optimal. It
 849 suggests using the 32-bit register (which is what ICC uses).
 850
 851 //===---------------------------------------------------------------------===//
 852
 853 We compile this:
 854
 855 void compare (long long foo) {
 856   if (foo < 4294967297LL)
 857     abort();
 858 }
 859
 860 to:
 861
 862 compare:
 863         subl    $4, %esp
 864         cmpl    $0, 8(%esp)
 865         setne   %al
 866         movzbw  %al, %ax
 867         cmpl    $1, 12(%esp)
 868         setg    %cl
 869         movzbw  %cl, %cx
 870         cmove   %ax, %cx
 871         testb   $1, %cl
 872         jne     .LBB1_2 # UnifiedReturnBlock
 873 .LBB1_1:        # ifthen
 874         call    abort
 875 .LBB1_2:        # UnifiedReturnBlock
 876         addl    $4, %esp
 877         ret
 878
 879 (also really horrible code on ppc).  This is due to the expand code for 64-bit
 880 compares.  GCC produces multiple branches, which is much nicer:
 881
 882 compare:
 883         subl    $12, %esp
 884         movl    20(%esp), %edx
 885         movl    16(%esp), %eax
 886         decl    %edx
 887         jle     .L7
 888 .L5:
 889         addl    $12, %esp
 890         ret
 891         .p2align 4,,7
 892 .L7:
 893         jl      .L4
 894         cmpl    $0, %eax
 895         .p2align 4,,8
 896         ja      .L5
 897 .L4:
 898         .p2align 4,,9
 899         call    abort
 900
 901 //===---------------------------------------------------------------------===//
 902
 903 Tail call optimization improvements: Tail call optimization currently
 904 pushes all arguments on the top of the stack (their normal place for
 905 non-tail call optimized calls) that source from the callers arguments
 906 or  that source from a virtual register (also possibly sourcing from
 907 callers arguments).
 908 This is done to prevent overwriting of parameters (see example
 909 below) that might be used later.
 910
 911 example:
 912
 913 int callee(int32, int64);
 914 int caller(int32 arg1, int32 arg2) {
 915   int64 local = arg2 * 2;
 916   return callee(arg2, (int64)local);
 917 }
 918
 919 [arg1]          [!arg2 no longer valid since we moved local onto it]
 920 [arg2]      ->  [(int64)
 921 [RETADDR]        local  ]
 922
 923 Moving arg1 onto the stack slot of callee function would overwrite
 924 arg2 of the caller.
 925
 926 Possible optimizations:
 927
 928
 929  - Analyse the actual parameters of the callee to see which would
 930    overwrite a caller parameter which is used by the callee and only
 931    push them onto the top of the stack.
 932
 933    int callee (int32 arg1, int32 arg2);
 934    int caller (int32 arg1, int32 arg2) {
 935        return callee(arg1,arg2);
 936    }
 937
 938    Here we don't need to write any variables to the top of the stack
 939    since they don't overwrite each other.
 940
 941    int callee (int32 arg1, int32 arg2);
 942    int caller (int32 arg1, int32 arg2) {
 943        return callee(arg2,arg1);
 944    }
 945
 946    Here we need to push the arguments because they overwrite each
 947    other.
 948
 949 //===---------------------------------------------------------------------===//
 950
 951 main ()
 952 {
 953   int i = 0;
 954   unsigned long int z = 0;
 955
 956   do {
 957     z -= 0x00004000;
 958     i++;
 959     if (i > 0x00040000)
 960       abort ();
 961   } while (z > 0);
 962   exit (0);
 963 }
 964
 965 gcc compiles this to:
 966
 967 _main:
 968         subl    $28, %esp
 969         xorl    %eax, %eax
 970         jmp     L2
 971 L3:
 972         cmpl    $262144, %eax
 973         je      L10
 974 L2:
 975         addl    $1, %eax
 976         cmpl    $262145, %eax
 977         jne     L3
 978         call    L_abort$stub
 979 L10:
 980         movl    $0, (%esp)
 981         call    L_exit$stub
 982
 983 llvm:
 984
 985 _main:
 986         subl    $12, %esp
 987         movl    $1, %eax
 988         movl    $16384, %ecx
 989 LBB1_1: # bb
 990         cmpl    $262145, %eax
 991         jge     LBB1_4  # cond_true
 992 LBB1_2: # cond_next
 993         incl    %eax
 994         addl    $4294950912, %ecx
 995         cmpl    $16384, %ecx
 996         jne     LBB1_1  # bb
 997 LBB1_3: # bb11
 998         xorl    %eax, %eax
 999         addl    $12, %esp
1000         ret
1001 LBB1_4: # cond_true
1002         call    L_abort$stub
1003
1004 1. LSR should rewrite the first cmp with induction variable %ecx.
1005 2. DAG combiner should fold
1006         leal    1(%eax), %edx
1007         cmpl    $262145, %edx
1008    =>
1009         cmpl    $262144, %eax
1010
1011 //===---------------------------------------------------------------------===//
1012
1013 define i64 @test(double %X) {
1014         %Y = fptosi double %X to i64
1015         ret i64 %Y
1016 }
1017
1018 compiles to:
1019
1020 _test:
1021         subl    $20, %esp
1022         movsd   24(%esp), %xmm0
1023         movsd   %xmm0, 8(%esp)
1024         fldl    8(%esp)
1025         fisttpll        (%esp)
1026         movl    4(%esp), %edx
1027         movl    (%esp), %eax
1028         addl    $20, %esp
1029         #FP_REG_KILL
1030         ret
1031
1032 This should just fldl directly from the input stack slot.
1033
1034 //===---------------------------------------------------------------------===//
1035
1036 This code:
1037 int foo (int x) { return (x & 65535) | 255; }
1038
1039 Should compile into:
1040
1041 _foo:
1042         movzwl  4(%esp), %eax
1043         orl     $255, %eax
1044         ret
1045
1046 instead of:
1047 _foo:
1048         movl    $255, %eax
1049         orl     4(%esp), %eax
1050         andl    $65535, %eax
1051         ret
1052
1053 //===---------------------------------------------------------------------===//
1054
1055 We're codegen'ing multiply of long longs inefficiently:
1056
1057 unsigned long long LLM(unsigned long long arg1, unsigned long long arg2) {
1058   return arg1 *  arg2;
1059 }
1060
1061 We compile to (fomit-frame-pointer):
1062
1063 _LLM:
1064         pushl   %esi
1065         movl    8(%esp), %ecx
1066         movl    16(%esp), %esi
1067         movl    %esi, %eax
1068         mull    %ecx
1069         imull   12(%esp), %esi
1070         addl    %edx, %esi
1071         imull   20(%esp), %ecx
1072         movl    %esi, %edx
1073         addl    %ecx, %edx
1074         popl    %esi
1075         ret
1076
1077 This looks like a scheduling deficiency and lack of remat of the load from
1078 the argument area.  ICC apparently produces:
1079
1080         movl      8(%esp), %ecx
1081         imull     12(%esp), %ecx
1082         movl      16(%esp), %eax
1083         imull     4(%esp), %eax
1084         addl      %eax, %ecx
1085         movl      4(%esp), %eax
1086         mull      12(%esp)
1087         addl      %ecx, %edx
1088         ret
1089
1090 Note that it remat'd loads from 4(esp) and 12(esp).  See this GCC PR:
1091 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
1092
1093 //===---------------------------------------------------------------------===//
1094
1095 We can fold a store into "zeroing a reg".  Instead of:
1096
1097 xorl    %eax, %eax
1098 movl    %eax, 124(%esp)
1099
1100 we should get:
1101
1102 movl    $0, 124(%esp)
1103
1104 if the flags of the xor are dead.
1105
1106 Likewise, we isel "x<<1" into "add reg,reg".  If reg is spilled, this should
1107 be folded into: shl [mem], 1
1108
1109 //===---------------------------------------------------------------------===//
1110
1111 This testcase misses a read/modify/write opportunity (from PR1425):
1112
1113 void vertical_decompose97iH1(int *b0, int *b1, int *b2, int width){
1114     int i;
1115     for(i=0; i<width; i++)
1116         b1[i] += (1*(b0[i] + b2[i])+0)>>0;
1117 }
1118
1119 We compile it down to:
1120
1121 LBB1_2: # bb
1122         movl    (%esi,%edi,4), %ebx
1123         addl    (%ecx,%edi,4), %ebx
1124         addl    (%edx,%edi,4), %ebx
1125         movl    %ebx, (%ecx,%edi,4)
1126         incl    %edi
1127         cmpl    %eax, %edi
1128         jne     LBB1_2  # bb
1129
1130 the inner loop should add to the memory location (%ecx,%edi,4), saving
1131 a mov.  Something like:
1132
1133         movl    (%esi,%edi,4), %ebx
1134         addl    (%edx,%edi,4), %ebx
1135         addl    %ebx, (%ecx,%edi,4)
1136
1137 Here is another interesting example:
1138
1139 void vertical_compose97iH1(int *b0, int *b1, int *b2, int width){
1140     int i;
1141     for(i=0; i<width; i++)
1142         b1[i] -= (1*(b0[i] + b2[i])+0)>>0;
1143 }
1144
1145 We miss the r/m/w opportunity here by using 2 subs instead of an add+sub[mem]:
1146
1147 LBB9_2: # bb
1148         movl    (%ecx,%edi,4), %ebx
1149         subl    (%esi,%edi,4), %ebx
1150         subl    (%edx,%edi,4), %ebx
1151         movl    %ebx, (%ecx,%edi,4)
1152         incl    %edi
1153         cmpl    %eax, %edi
1154         jne     LBB9_2  # bb
1155
1156 Additionally, LSR should rewrite the exit condition of these loops to use
1157 a stride-4 IV, would would allow all the scales in the loop to go away.
1158 This would result in smaller code and more efficient microops.
1159
1160 //===---------------------------------------------------------------------===//
1161
1162 In SSE mode, we turn abs and neg into a load from the constant pool plus a xor
1163 or and instruction, for example:
1164
1165         xorpd   LCPI1_0, %xmm2
1166
1167 However, if xmm2 gets spilled, we end up with really ugly code like this:
1168
1169         movsd   (%esp), %xmm0
1170         xorpd   LCPI1_0, %xmm0
1171         movsd   %xmm0, (%esp)
1172
1173 Since we 'know' that this is a 'neg', we can actually "fold" the spill into
1174 the neg/abs instruction, turning it into an *integer* operation, like this:
1175
1176         xorl 2147483648, [mem+4]     ## 2147483648 = (1 << 31)
1177
1178 you could also use xorb, but xorl is less likely to lead to a partial register
1179 stall.  Here is a contrived testcase:
1180
1181 double a, b, c;
1182 void test(double *P) {
1183   double X = *P;
1184   a = X;
1185   bar();
1186   X = -X;
1187   b = X;
1188   bar();
1189   c = X;
1190 }
1191
1192 //===---------------------------------------------------------------------===//
1193
1194 handling llvm.memory.barrier on pre SSE2 cpus
1195
1196 should generate:
1197 lock ; mov %esp, %esp
1198
1199 //===---------------------------------------------------------------------===//
1200
1201 The generated code on x86 for checking for signed overflow on a multiply the
1202 obvious way is much longer than it needs to be.
1203
1204 int x(int a, int b) {
1205   long long prod = (long long)a*b;
1206   return  prod > 0x7FFFFFFF || prod < (-0x7FFFFFFF-1);
1207 }
1208
1209 See PR2053 for more details.
1210
1211 //===---------------------------------------------------------------------===//
1212
1213 We should investigate using cdq/ctld (effect: edx = sar eax, 31)
1214 more aggressively; it should cost the same as a move+shift on any modern
1215 processor, but it's a lot shorter. Downside is that it puts more
1216 pressure on register allocation because it has fixed operands.
1217
1218 Example:
1219 int abs(int x) {return x < 0 ? -x : x;}
1220
1221 gcc compiles this to the following when using march/mtune=pentium2/3/4/m/etc.:
1222 abs:
1223         movl    4(%esp), %eax
1224         cltd
1225         xorl    %edx, %eax
1226         subl    %edx, %eax
1227         ret
1228
1229 //===---------------------------------------------------------------------===//
1230
1231 Consider:
1232 int test(unsigned long a, unsigned long b) { return -(a < b); }
1233
1234 We currently compile this to:
1235
1236 define i32 @test(i32 %a, i32 %b) nounwind  {
1237         %tmp3 = icmp ult i32 %a, %b             ; <i1> [#uses=1]
1238         %tmp34 = zext i1 %tmp3 to i32           ; <i32> [#uses=1]
1239         %tmp5 = sub i32 0, %tmp34               ; <i32> [#uses=1]
1240         ret i32 %tmp5
1241 }
1242
1243 and
1244
1245 _test:
1246         movl    8(%esp), %eax
1247         cmpl    %eax, 4(%esp)
1248         setb    %al
1249         movzbl  %al, %eax
1250         negl    %eax
1251         ret
1252
1253 Several deficiencies here.  First, we should instcombine zext+neg into sext:
1254
1255 define i32 @test2(i32 %a, i32 %b) nounwind  {
1256         %tmp3 = icmp ult i32 %a, %b             ; <i1> [#uses=1]
1257         %tmp34 = sext i1 %tmp3 to i32           ; <i32> [#uses=1]
1258         ret i32 %tmp34
1259 }
1260
1261 However, before we can do that, we have to fix the bad codegen that we get for
1262 sext from bool:
1263
1264 _test2:
1265         movl    8(%esp), %eax
1266         cmpl    %eax, 4(%esp)
1267         setb    %al
1268         movzbl  %al, %eax
1269         shll    $31, %eax
1270         sarl    $31, %eax
1271         ret
1272
1273 This code should be at least as good as the code above.  Once this is fixed, we
1274 can optimize this specific case even more to:
1275
1276         movl    8(%esp), %eax
1277         xorl    %ecx, %ecx
1278         cmpl    %eax, 4(%esp)
1279         sbbl    %ecx, %ecx
1280
1281 //===---------------------------------------------------------------------===//
1282
1283 Take the following code (from
1284 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16541):
1285
1286 extern unsigned char first_one[65536];
1287 int FirstOnet(unsigned long long arg1)
1288 {
1289   if (arg1 >> 48)
1290     return (first_one[arg1 >> 48]);
1291   return 0;
1292 }
1293
1294
1295 The following code is currently generated:
1296 FirstOnet:
1297         movl    8(%esp), %eax
1298         cmpl    $65536, %eax
1299         movl    4(%esp), %ecx
1300         jb      .LBB1_2 # UnifiedReturnBlock
1301 .LBB1_1:        # ifthen
1302         shrl    $16, %eax
1303         movzbl  first_one(%eax), %eax
1304         ret
1305 .LBB1_2:        # UnifiedReturnBlock
1306         xorl    %eax, %eax
1307         ret
1308
1309 There are a few possible improvements here:
1310 1. We should be able to eliminate the dead load into %ecx
1311 2. We could change the "movl 8(%esp), %eax" into
1312    "movzwl 10(%esp), %eax"; this lets us change the cmpl
1313    into a testl, which is shorter, and eliminate the shift.
1314
1315 We could also in theory eliminate the branch by using a conditional
1316 for the address of the load, but that seems unlikely to be worthwhile
1317 in general.
1318
1319 //===---------------------------------------------------------------------===//
1320
1321 We compile this function:
1322
1323 define i32 @foo(i32 %a, i32 %b, i32 %c, i8 zeroext  %d) nounwind  {
1324 entry:
1325         %tmp2 = icmp eq i8 %d, 0                ; <i1> [#uses=1]
1326         br i1 %tmp2, label %bb7, label %bb
1327
1328 bb:             ; preds = %entry
1329         %tmp6 = add i32 %b, %a          ; <i32> [#uses=1]
1330         ret i32 %tmp6
1331
1332 bb7:            ; preds = %entry
1333         %tmp10 = sub i32 %a, %c         ; <i32> [#uses=1]
1334         ret i32 %tmp10
1335 }
1336
1337 to:
1338
1339 _foo:
1340         cmpb    $0, 16(%esp)
1341         movl    12(%esp), %ecx
1342         movl    8(%esp), %eax
1343         movl    4(%esp), %edx
1344         je      LBB1_2  # bb7
1345 LBB1_1: # bb
1346         addl    %edx, %eax
1347         ret
1348 LBB1_2: # bb7
1349         movl    %edx, %eax
1350         subl    %ecx, %eax
1351         ret
1352
1353 The coalescer could coalesce "edx" with "eax" to avoid the movl in LBB1_2
1354 if it commuted the addl in LBB1_1.
1355
1356 //===---------------------------------------------------------------------===//
1357
1358 See rdar://4653682.
1359
1360 From flops:
1361
1362 LBB1_15:        # bb310
1363         cvtss2sd        LCPI1_0, %xmm1
1364         addsd   %xmm1, %xmm0
1365         movsd   176(%esp), %xmm2
1366         mulsd   %xmm0, %xmm2
1367         movapd  %xmm2, %xmm3
1368         mulsd   %xmm3, %xmm3
1369         movapd  %xmm3, %xmm4
1370         mulsd   LCPI1_23, %xmm4
1371         addsd   LCPI1_24, %xmm4
1372         mulsd   %xmm3, %xmm4
1373         addsd   LCPI1_25, %xmm4
1374         mulsd   %xmm3, %xmm4
1375         addsd   LCPI1_26, %xmm4
1376         mulsd   %xmm3, %xmm4
1377         addsd   LCPI1_27, %xmm4
1378         mulsd   %xmm3, %xmm4
1379         addsd   LCPI1_28, %xmm4
1380         mulsd   %xmm3, %xmm4
1381         addsd   %xmm1, %xmm4
1382         mulsd   %xmm2, %xmm4
1383         movsd   152(%esp), %xmm1
1384         addsd   %xmm4, %xmm1
1385         movsd   %xmm1, 152(%esp)
1386         incl    %eax
1387         cmpl    %eax, %esi
1388         jge     LBB1_15 # bb310
1389 LBB1_16:        # bb358.loopexit
1390         movsd   152(%esp), %xmm0
1391         addsd   %xmm0, %xmm0
1392         addsd   LCPI1_22, %xmm0
1393         movsd   %xmm0, 152(%esp)
1394
1395 Rather than spilling the result of the last addsd in the loop, we should have
1396 insert a copy to split the interval (one for the duration of the loop, one
1397 extending to the fall through). The register pressure in the loop isn't high
1398 enough to warrant the spill.
1399
1400 Also check why xmm7 is not used at all in the function.
1401
1402 //===---------------------------------------------------------------------===//
1403
1404 Legalize loses track of the fact that bools are always zero extended when in
1405 memory.  This causes us to compile abort_gzip (from 164.gzip) from:
1406
1407 target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"
1408 target triple = "i386-apple-darwin8"
1409 @in_exit.4870.b = internal global i1 false              ; <i1*> [#uses=2]
1410 define fastcc void @abort_gzip() noreturn nounwind  {
1411 entry:
1412         %tmp.b.i = load i1* @in_exit.4870.b             ; <i1> [#uses=1]
1413         br i1 %tmp.b.i, label %bb.i, label %bb4.i
1414 bb.i:           ; preds = %entry
1415         tail call void @exit( i32 1 ) noreturn nounwind
1416         unreachable
1417 bb4.i:          ; preds = %entry
1418         store i1 true, i1* @in_exit.4870.b
1419         tail call void @exit( i32 1 ) noreturn nounwind
1420         unreachable
1421 }
1422 declare void @exit(i32) noreturn nounwind
1423
1424 into:
1425
1426 _abort_gzip:
1427         subl    $12, %esp
1428         movb    _in_exit.4870.b, %al
1429         notb    %al
1430         testb   $1, %al
1431         jne     LBB1_2  ## bb4.i
1432 LBB1_1: ## bb.i
1433   ...
1434
1435 //===---------------------------------------------------------------------===//
1436
1437 We compile:
1438
1439 int test(int x, int y) {
1440   return x-y-1;
1441 }
1442
1443 into (-m64):
1444
1445 _test:
1446         decl    %edi
1447         movl    %edi, %eax
1448         subl    %esi, %eax
1449         ret
1450
1451 it would be better to codegen as: x+~y  (notl+addl)
1452
1453 //===---------------------------------------------------------------------===//
1454
1455 This code:
1456
1457 int foo(const char *str,...)
1458 {
1459  __builtin_va_list a; int x;
1460  __builtin_va_start(a,str); x = __builtin_va_arg(a,int); __builtin_va_end(a);
1461  return x;
1462 }
1463
1464 gets compiled into this on x86-64:
1465         subq    $200, %rsp
1466         movaps  %xmm7, 160(%rsp)
1467         movaps  %xmm6, 144(%rsp)
1468         movaps  %xmm5, 128(%rsp)
1469         movaps  %xmm4, 112(%rsp)
1470         movaps  %xmm3, 96(%rsp)
1471         movaps  %xmm2, 80(%rsp)
1472         movaps  %xmm1, 64(%rsp)
1473         movaps  %xmm0, 48(%rsp)
1474         movq    %r9, 40(%rsp)
1475         movq    %r8, 32(%rsp)
1476         movq    %rcx, 24(%rsp)
1477         movq    %rdx, 16(%rsp)
1478         movq    %rsi, 8(%rsp)
1479         leaq    (%rsp), %rax
1480         movq    %rax, 192(%rsp)
1481         leaq    208(%rsp), %rax
1482         movq    %rax, 184(%rsp)
1483         movl    $48, 180(%rsp)
1484         movl    $8, 176(%rsp)
1485         movl    176(%rsp), %eax
1486         cmpl    $47, %eax
1487         jbe     .LBB1_3 # bb
1488 .LBB1_1:        # bb3
1489         movq    184(%rsp), %rcx
1490         leaq    8(%rcx), %rax
1491         movq    %rax, 184(%rsp)
1492 .LBB1_2:        # bb4
1493         movl    (%rcx), %eax
1494         addq    $200, %rsp
1495         ret
1496 .LBB1_3:        # bb
1497         movl    %eax, %ecx
1498         addl    $8, %eax
1499         addq    192(%rsp), %rcx
1500         movl    %eax, 176(%rsp)
1501         jmp     .LBB1_2 # bb4
1502
1503 gcc 4.3 generates:
1504         subq    $96, %rsp
1505 .LCFI0:
1506         leaq    104(%rsp), %rax
1507         movq    %rsi, -80(%rsp)
1508         movl    $8, -120(%rsp)
1509         movq    %rax, -112(%rsp)
1510         leaq    -88(%rsp), %rax
1511         movq    %rax, -104(%rsp)
1512         movl    $8, %eax
1513         cmpl    $48, %eax
1514         jb      .L6
1515         movq    -112(%rsp), %rdx
1516         movl    (%rdx), %eax
1517         addq    $96, %rsp
1518         ret
1519         .p2align 4,,10
1520         .p2align 3
1521 .L6:
1522         mov     %eax, %edx
1523         addq    -104(%rsp), %rdx
1524         addl    $8, %eax
1525         movl    %eax, -120(%rsp)
1526         movl    (%rdx), %eax
1527         addq    $96, %rsp
1528         ret
1529
1530 and it gets compiled into this on x86:
1531         pushl   %ebp
1532         movl    %esp, %ebp
1533         subl    $4, %esp
1534         leal    12(%ebp), %eax
1535         movl    %eax, -4(%ebp)
1536         leal    16(%ebp), %eax
1537         movl    %eax, -4(%ebp)
1538         movl    12(%ebp), %eax
1539         addl    $4, %esp
1540         popl    %ebp
1541         ret
1542
1543 gcc 4.3 generates:
1544         pushl   %ebp
1545         movl    %esp, %ebp
1546         movl    12(%ebp), %eax
1547         popl    %ebp
1548         ret
1549
1550 //===---------------------------------------------------------------------===//
1551
1552 Teach tblgen not to check bitconvert source type in some cases. This allows us
1553 to consolidate the following patterns in X86InstrMMX.td:
1554
1555 def : Pat<(v2i32 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1556                                                   (iPTR 0))))),
1557           (v2i32 (MMX_MOVDQ2Qrr VR128:$src))>;
1558 def : Pat<(v4i16 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1559                                                   (iPTR 0))))),
1560           (v4i16 (MMX_MOVDQ2Qrr VR128:$src))>;
1561 def : Pat<(v8i8 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1562                                                   (iPTR 0))))),
1563           (v8i8 (MMX_MOVDQ2Qrr VR128:$src))>;
1564
1565 There are other cases in various td files.
1566
1567 //===---------------------------------------------------------------------===//
1568
1569 Take something like the following on x86-32:
1570 unsigned a(unsigned long long x, unsigned y) {return x % y;}
1571
1572 We currently generate a libcall, but we really shouldn't: the expansion is
1573 shorter and likely faster than the libcall.  The expected code is something
1574 like the following:
1575
1576         movl    12(%ebp), %eax
1577         movl    16(%ebp), %ecx
1578         xorl    %edx, %edx
1579         divl    %ecx
1580         movl    8(%ebp), %eax
1581         divl    %ecx
1582         movl    %edx, %eax
1583         ret
1584
1585 A similar code sequence works for division.
1586
1587 //===---------------------------------------------------------------------===//
1588
1589 These should compile to the same code, but the later codegen's to useless
1590 instructions on X86. This may be a trivial dag combine (GCC PR7061):
1591
1592 struct s1 { unsigned char a, b; };
1593 unsigned long f1(struct s1 x) {
1594     return x.a + x.b;
1595 }
1596 struct s2 { unsigned a: 8, b: 8; };
1597 unsigned long f2(struct s2 x) {
1598     return x.a + x.b;
1599 }
1600
1601 //===---------------------------------------------------------------------===//
1602
1603 We currently compile this:
1604
1605 define i32 @func1(i32 %v1, i32 %v2) nounwind {
1606 entry:
1607   %t = call {i32, i1} @llvm.sadd.with.overflow.i32(i32 %v1, i32 %v2)
1608   %sum = extractvalue {i32, i1} %t, 0
1609   %obit = extractvalue {i32, i1} %t, 1
1610   br i1 %obit, label %overflow, label %normal
1611 normal:
1612   ret i32 %sum
1613 overflow:
1614   call void @llvm.trap()
1615   unreachable
1616 }
1617 declare {i32, i1} @llvm.sadd.with.overflow.i32(i32, i32)
1618 declare void @llvm.trap()
1619
1620 to:
1621
1622 _func1:
1623         movl    4(%esp), %eax
1624         addl    8(%esp), %eax
1625         jo      LBB1_2  ## overflow
1626 LBB1_1: ## normal
1627         ret
1628 LBB1_2: ## overflow
1629         ud2
1630
1631 it would be nice to produce "into" someday.
1632
1633 //===---------------------------------------------------------------------===//
1634
1635 This code:
1636
1637 void vec_mpys1(int y[], const int x[], int scaler) {
1638 int i;
1639 for (i = 0; i < 150; i++)
1640  y[i] += (((long long)scaler * (long long)x[i]) >> 31);
1641 }
1642
1643 Compiles to this loop with GCC 3.x:
1644
1645 .L5:
1646         movl    %ebx, %eax
1647         imull   (%edi,%ecx,4)
1648         shrdl   $31, %edx, %eax
1649         addl    %eax, (%esi,%ecx,4)
1650         incl    %ecx
1651         cmpl    $149, %ecx
1652         jle     .L5
1653
1654 llvm-gcc compiles it to the much uglier:
1655
1656 LBB1_1: ## bb1
1657         movl    24(%esp), %eax
1658         movl    (%eax,%edi,4), %ebx
1659         movl    %ebx, %ebp
1660         imull   %esi, %ebp
1661         movl    %ebx, %eax
1662         mull    %ecx
1663         addl    %ebp, %edx
1664         sarl    $31, %ebx
1665         imull   %ecx, %ebx
1666         addl    %edx, %ebx
1667         shldl   $1, %eax, %ebx
1668         movl    20(%esp), %eax
1669         addl    %ebx, (%eax,%edi,4)
1670         incl    %edi
1671         cmpl    $150, %edi
1672         jne     LBB1_1  ## bb1
1673
1674 The issue is that we hoist the cast of "scaler" to long long outside of the
1675 loop, the value comes into the loop as two values, and
1676 RegsForValue::getCopyFromRegs doesn't know how to put an AssertSext on the
1677 constructed BUILD_PAIR which represents the cast value.
1678
1679 //===---------------------------------------------------------------------===//
1680
1681 Test instructions can be eliminated by using EFLAGS values from arithmetic
1682 instructions. This is currently not done for mul, and, or, xor, neg, shl,
1683 sra, srl, shld, shrd, atomic ops, and others. It is also currently not done
1684 for read-modify-write instructions. It is also current not done if the
1685 OF or CF flags are needed.
1686
1687 The shift operators have the complication that when the shift count is
1688 zero, EFLAGS is not set, so they can only subsume a test instruction if
1689 the shift count is known to be non-zero. Also, using the EFLAGS value
1690 from a shift is apparently very slow on some x86 implementations.
1691
1692 In read-modify-write instructions, the root node in the isel match is
1693 the store, and isel has no way for the use of the EFLAGS result of the
1694 arithmetic to be remapped to the new node.
1695
1696 Add and subtract instructions set OF on signed overflow and CF on unsiged
1697 overflow, while test instructions always clear OF and CF. In order to
1698 replace a test with an add or subtract in a situation where OF or CF is
1699 needed, codegen must be able to prove that the operation cannot see
1700 signed or unsigned overflow, respectively.
1701
1702 //===---------------------------------------------------------------------===//
1703
1704 memcpy/memmove do not lower to SSE copies when possible.  A silly example is:
1705 define <16 x float> @foo(<16 x float> %A) nounwind {
1706         %tmp = alloca <16 x float>, align 16
1707         %tmp2 = alloca <16 x float>, align 16
1708         store <16 x float> %A, <16 x float>* %tmp
1709         %s = bitcast <16 x float>* %tmp to i8*
1710         %s2 = bitcast <16 x float>* %tmp2 to i8*
1711         call void @llvm.memcpy.i64(i8* %s, i8* %s2, i64 64, i32 16)
1712         %R = load <16 x float>* %tmp2
1713         ret <16 x float> %R
1714 }
1715
1716 declare void @llvm.memcpy.i64(i8* nocapture, i8* nocapture, i64, i32) nounwind
1717
1718 which compiles to:
1719
1720 _foo:
1721         subl    $140, %esp
1722         movaps  %xmm3, 112(%esp)
1723         movaps  %xmm2, 96(%esp)
1724         movaps  %xmm1, 80(%esp)
1725         movaps  %xmm0, 64(%esp)
1726         movl    60(%esp), %eax
1727         movl    %eax, 124(%esp)
1728         movl    56(%esp), %eax
1729         movl    %eax, 120(%esp)
1730         movl    52(%esp), %eax
1731         <many many more 32-bit copies>
1732         movaps  (%esp), %xmm0
1733         movaps  16(%esp), %xmm1
1734         movaps  32(%esp), %xmm2
1735         movaps  48(%esp), %xmm3
1736         addl    $140, %esp
1737         ret
1738
1739 On Nehalem, it may even be cheaper to just use movups when unaligned than to
1740 fall back to lower-granularity chunks.
1741
1742 //===---------------------------------------------------------------------===//
1743
1744 Implement processor-specific optimizations for parity with GCC on these
1745 processors.  GCC does two optimizations:
1746
1747 1. ix86_pad_returns inserts a noop before ret instructions if immediately
1748    preceeded by a conditional branch or is the target of a jump.
1749 2. ix86_avoid_jump_misspredicts inserts noops in cases where a 16-byte block of
1750    code contains more than 3 branches.
1751
1752 The first one is done for all AMDs, Core2, and "Generic"
1753 The second one is done for: Atom, Pentium Pro, all AMDs, Pentium 4, Nocona,
1754   Core 2, and "Generic"
1755
1756 //===---------------------------------------------------------------------===//
1757
1758 Testcase:
1759 int a(int x) { return (x & 127) > 31; }
1760
1761 Current output:
1762         movl    4(%esp), %eax
1763         andl    $127, %eax
1764         cmpl    $31, %eax
1765         seta    %al
1766         movzbl  %al, %eax
1767         ret
1768
1769 Ideal output:
1770         xorl    %eax, %eax
1771         testl   $96, 4(%esp)
1772         setne   %al
1773         ret
1774
1775 This should definitely be done in instcombine, canonicalizing the range
1776 condition into a != condition.  We get this IR:
1777
1778 define i32 @a(i32 %x) nounwind readnone {
1779 entry:
1780         %0 = and i32 %x, 127            ; <i32> [#uses=1]
1781         %1 = icmp ugt i32 %0, 31                ; <i1> [#uses=1]
1782         %2 = zext i1 %1 to i32          ; <i32> [#uses=1]
1783         ret i32 %2
1784 }
1785
1786 Instcombine prefers to strength reduce relational comparisons to equality
1787 comparisons when possible, this should be another case of that.  This could
1788 be handled pretty easily in InstCombiner::visitICmpInstWithInstAndIntCst, but it
1789 looks like InstCombiner::visitICmpInstWithInstAndIntCst should really already
1790 be redesigned to use ComputeMaskedBits and friends.
1791
1792
1793 //===---------------------------------------------------------------------===//
1794 Testcase:
1795 int x(int a) { return (a&0xf0)>>4; }
1796
1797 Current output:
1798         movl    4(%esp), %eax
1799         shrl    $4, %eax
1800         andl    $15, %eax
1801         ret
1802
1803 Ideal output:
1804         movzbl  4(%esp), %eax
1805         shrl    $4, %eax
1806         ret
1807
1808 //===---------------------------------------------------------------------===//
1809
1810 Testcase:
1811 int x(int a) { return (a & 0x80) ? 0x100 : 0; }
1812 int y(int a) { return (a & 0x80) *2; }
1813
1814 Current:
1815         testl   $128, 4(%esp)
1816         setne   %al
1817         movzbl  %al, %eax
1818         shll    $8, %eax
1819         ret
1820
1821 Better:
1822         movl    4(%esp), %eax
1823         addl    %eax, %eax
1824         andl    $256, %eax
1825         ret
1826
1827 This is another general instcombine transformation that is profitable on all
1828 targets.  In LLVM IR, these functions look like this:
1829
1830 define i32 @x(i32 %a) nounwind readnone {
1831 entry:
1832         %0 = and i32 %a, 128
1833         %1 = icmp eq i32 %0, 0
1834         %iftmp.0.0 = select i1 %1, i32 0, i32 256
1835         ret i32 %iftmp.0.0
1836 }
1837
1838 define i32 @y(i32 %a) nounwind readnone {
1839 entry:
1840         %0 = shl i32 %a, 1
1841         %1 = and i32 %0, 256
1842         ret i32 %1
1843 }
1844
1845 Replacing an icmp+select with a shift should always be considered profitable in
1846 instcombine.
1847
1848 //===---------------------------------------------------------------------===//
1849
1850 Re-implement atomic builtins __sync_add_and_fetch() and __sync_sub_and_fetch
1851 properly.
1852
1853 When the return value is not used (i.e. only care about the value in the
1854 memory), x86 does not have to use add to implement these. Instead, it can use
1855 add, sub, inc, dec instructions with the "lock" prefix.
1856
1857 This is currently implemented using a bit of instruction selection trick. The
1858 issue is the target independent pattern produces one output and a chain and we
1859 want to map it into one that just output a chain. The current trick is to select
1860 it into a MERGE_VALUES with the first definition being an implicit_def. The
1861 proper solution is to add new ISD opcodes for the no-output variant. DAG
1862 combiner can then transform the node before it gets to target node selection.
1863
1864 Problem #2 is we are adding a whole bunch of x86 atomic instructions when in
1865 fact these instructions are identical to the non-lock versions. We need a way to
1866 add target specific information to target nodes and have this information
1867 carried over to machine instructions. Asm printer (or JIT) can use this
1868 information to add the "lock" prefix.
1869
1870 //===---------------------------------------------------------------------===//