time, not at spiller time). *Note* however that this can only be done
if Y is dead. Here's a testcase:
-%.str_3 = external global [15 x sbyte] ; <[15 x sbyte]*> [#uses=0]
-implementation ; Functions:
-declare void %printf(int, ...)
-void %main() {
+@.str_3 = external global [15 x i8]
+declare void @printf(i32, ...)
+define void @main() {
build_tree.exit:
- br label %no_exit.i7
-no_exit.i7: ; preds = %no_exit.i7, %build_tree.exit
- %tmp.0.1.0.i9 = phi double [ 0.000000e+00, %build_tree.exit ], [ %tmp.34.i18, %no_exit.i7 ] ; <double> [#uses=1]
- %tmp.0.0.0.i10 = phi double [ 0.000000e+00, %build_tree.exit ], [ %tmp.28.i16, %no_exit.i7 ] ; <double> [#uses=1]
- %tmp.28.i16 = add double %tmp.0.0.0.i10, 0.000000e+00
- %tmp.34.i18 = add double %tmp.0.1.0.i9, 0.000000e+00
- br bool false, label %Compute_Tree.exit23, label %no_exit.i7
-Compute_Tree.exit23: ; preds = %no_exit.i7
- tail call void (int, ...)* %printf( int 0 )
- store double %tmp.34.i18, double* null
- ret void
+ br label %no_exit.i7
+
+no_exit.i7: ; preds = %no_exit.i7, %build_tree.exit
+ %tmp.0.1.0.i9 = phi double [ 0.000000e+00, %build_tree.exit ],
+ [ %tmp.34.i18, %no_exit.i7 ]
+ %tmp.0.0.0.i10 = phi double [ 0.000000e+00, %build_tree.exit ],
+ [ %tmp.28.i16, %no_exit.i7 ]
+ %tmp.28.i16 = add double %tmp.0.0.0.i10, 0.000000e+00
+ %tmp.34.i18 = add double %tmp.0.1.0.i9, 0.000000e+00
+ br i1 false, label %Compute_Tree.exit23, label %no_exit.i7
+
+Compute_Tree.exit23: ; preds = %no_exit.i7
+ tail call void (i32, ...)* @printf( i32 0 )
+ store double %tmp.34.i18, double* null
+ ret void
}
We currently emit:
//===---------------------------------------------------------------------===//
-Currently the x86 codegen isn't very good at mixing SSE and FPStack
-code:
-
-unsigned int foo(double x) { return x; }
-
-foo:
- subl $20, %esp
- movsd 24(%esp), %xmm0
- movsd %xmm0, 8(%esp)
- fldl 8(%esp)
- fisttpll (%esp)
- movl (%esp), %eax
- addl $20, %esp
- ret
-
-This will be solved when we go to a dynamic programming based isel.
-
-//===---------------------------------------------------------------------===//
-
Lower memcpy / memset to a series of SSE 128 bit move instructions when it's
feasible.
//===---------------------------------------------------------------------===//
-Teach the coalescer to commute 2-addr instructions, allowing us to eliminate
-the reg-reg copy in this example:
-
-float foo(int *x, float *y, unsigned c) {
- float res = 0.0;
- unsigned i;
- for (i = 0; i < c; i++) {
- float xx = (float)x[i];
- xx = xx * y[i];
- xx += res;
- res = xx;
- }
- return res;
-}
-
-LBB_foo_3: # no_exit
- cvtsi2ss %XMM0, DWORD PTR [%EDX + 4*%ESI]
- mulss %XMM0, DWORD PTR [%EAX + 4*%ESI]
- addss %XMM0, %XMM1
- inc %ESI
- cmp %ESI, %ECX
-**** movaps %XMM1, %XMM0
- jb LBB_foo_3 # no_exit
-
-//===---------------------------------------------------------------------===//
-
Codegen:
if (copysign(1.0, x) == copysign(1.0, y))
into:
//===---------------------------------------------------------------------===//
-"converting 64-bit constant pool entry to 32-bit not necessarily beneficial"
-http://llvm.org/PR1264
-
-For this test case:
-
-define double @foo(double %x) {
- %y = mul double %x, 5.000000e-01
- ret double %y
-}
-
-llc -march=x86-64 currently produces a 32-bit constant pool entry and this code:
-
- cvtss2sd .LCPI1_0(%rip), %xmm1
- mulsd %xmm1, %xmm0
-
-instead of just using a 64-bit constant pool entry with this:
-
- mulsd .LCPI1_0(%rip), %xmm0
-
-This is due to the code in ExpandConstantFP in LegalizeDAG.cpp. It notices that
-x86-64 indeed has an instruction to load a 32-bit float from memory and convert
-it into a 64-bit float in a register, however it doesn't notice that this isn't
-beneficial because it prevents the load from being folded into the multiply.
-
-//===---------------------------------------------------------------------===//
-
These functions:
#include <xmmintrin.h>
define i64 @ccosf(float %z.0, float %z.1) nounwind readonly {
entry:
- %tmp6 = sub float -0.000000e+00, %z.1 ; <float> [#uses=1]
- %tmp20 = tail call i64 @ccoshf( float %tmp6, float %z.0 ) nounwind readonly ; <i64> [#uses=1]
- ret i64 %tmp20
+ %tmp6 = sub float -0.000000e+00, %z.1 ; <float> [#uses=1]
+ %tmp20 = tail call i64 @ccoshf( float %tmp6, float %z.0 ) nounwind readonly
+ ret i64 %tmp20
}
This currently compiles to:
insertions.
See comments in LowerINSERT_VECTOR_ELT_SSE4.
+
+//===---------------------------------------------------------------------===//
+
+On a random note, SSE2 should declare insert/extract of 2 x f64 as legal, not
+Custom. All combinations of insert/extract reg-reg, reg-mem, and mem-reg are
+legal, it'll just take a few extra patterns written in the .td file.
+
+Note: this is not a code quality issue; the custom lowered code happens to be
+right, but we shouldn't have to custom lower anything. This is probably related
+to <2 x i64> ops being so bad.
+
+//===---------------------------------------------------------------------===//
+
+'select' on vectors and scalars could be a whole lot better. We currently
+lower them to conditional branches. On x86-64 for example, we compile this:
+
+double test(double a, double b, double c, double d) { return a<b ? c : d; }
+
+to:
+
+_test:
+ ucomisd %xmm0, %xmm1
+ ja LBB1_2 # entry
+LBB1_1: # entry
+ movapd %xmm3, %xmm2
+LBB1_2: # entry
+ movapd %xmm2, %xmm0
+ ret
+
+instead of:
+
+_test:
+ cmpltsd %xmm1, %xmm0
+ andpd %xmm0, %xmm2
+ andnpd %xmm3, %xmm0
+ orpd %xmm2, %xmm0
+ ret
+
+For unpredictable branches, the later is much more efficient. This should
+just be a matter of having scalar sse map to SELECT_CC and custom expanding
+or iseling it.
+
+//===---------------------------------------------------------------------===//
+
+Take the following code:
+
+#include <xmmintrin.h>
+__m128i doload64(short x) {return _mm_set_epi16(x,x,x,x,x,x,x,x);}
+
+LLVM currently generates the following on x86:
+doload64:
+ movzwl 4(%esp), %eax
+ movd %eax, %xmm0
+ punpcklwd %xmm0, %xmm0
+ pshufd $0, %xmm0, %xmm0
+ ret
+
+gcc's generated code:
+doload64:
+ movd 4(%esp), %xmm0
+ punpcklwd %xmm0, %xmm0
+ pshufd $0, %xmm0, %xmm0
+ ret
+
+LLVM should be able to generate the same thing as gcc. This looks like it is
+just a matter of matching (scalar_to_vector (load x)) to movd.
+
+//===---------------------------------------------------------------------===//
+