... which should only be one imul instruction.
+This can be done with a custom expander, but it would be nice to move this to
+generic code.
+
//===---------------------------------------------------------------------===//
This should be one DIV/IDIV instruction, not a libcall:
This can be done trivially with a custom legalizer. What about overflow
though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
-//===---------------------------------------------------------------------===//
-
-Some targets (e.g. athlons) prefer freep to fstp ST(0):
-http://gcc.gnu.org/ml/gcc-patches/2004-04/msg00659.html
-
-//===---------------------------------------------------------------------===//
-
-This should use fiadd on chips where it is profitable:
-double foo(double P, int *I) { return P+*I; }
-
-We have fiadd patterns now but the followings have the same cost and
-complexity. We need a way to specify the later is more profitable.
-
-def FpADD32m : FpI<(ops RFP:$dst, RFP:$src1, f32mem:$src2), OneArgFPRW,
- [(set RFP:$dst, (fadd RFP:$src1,
- (extloadf64f32 addr:$src2)))]>;
- // ST(0) = ST(0) + [mem32]
-
-def FpIADD32m : FpI<(ops RFP:$dst, RFP:$src1, i32mem:$src2), OneArgFPRW,
- [(set RFP:$dst, (fadd RFP:$src1,
- (X86fild addr:$src2, i32)))]>;
- // ST(0) = ST(0) + [mem32int]
-
-//===---------------------------------------------------------------------===//
-
-The FP stackifier needs to be global. Also, it should handle simple permutates
-to reduce number of shuffle instructions, e.g. turning:
-
-fld P -> fld Q
-fld Q fld P
-fxch
-
-or:
-
-fxch -> fucomi
-fucomi jl X
-jg X
-
-Ideas:
-http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02410.html
-
-
//===---------------------------------------------------------------------===//
Improvements to the multiply -> shift/add algorithm:
Another useful one would be ~0ULL >> X and ~0ULL << X.
+One better solution for 1LL << x is:
+ xorl %eax, %eax
+ xorl %edx, %edx
+ testb $32, %cl
+ sete %al
+ setne %dl
+ sall %cl, %eax
+ sall %cl, %edx
+
+But that requires good 8-bit subreg support.
+
+64-bit shifts (in general) expand to really bad code. Instead of using
+cmovs, we should expand to a conditional branch like GCC produces.
+
//===---------------------------------------------------------------------===//
Compile this:
//===---------------------------------------------------------------------===//
-Add a target specific hook to DAG combiner to handle SINT_TO_FP and
-FP_TO_SINT when the source operand is already in memory.
-
-//===---------------------------------------------------------------------===//
-
-Model X86 EFLAGS as a real register to avoid redudant cmp / test. e.g.
-
- cmpl $1, %eax
- setg %al
- testb %al, %al # unnecessary
- jne .BB7
-
-//===---------------------------------------------------------------------===//
-
Count leading zeros and count trailing zeros:
int clz(int X) { return __builtin_clz(X); }
//===---------------------------------------------------------------------===//
-Open code rint,floor,ceil,trunc:
-http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02006.html
-http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02011.html
-
-//===---------------------------------------------------------------------===//
-
-Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
-
-Expand these to calls of sin/cos and stores:
- double sincos(double x, double *sin, double *cos);
- float sincosf(float x, float *sin, float *cos);
- long double sincosl(long double x, long double *sin, long double *cos);
-
-Doing so could allow SROA of the destination pointers. See also:
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
-
-//===---------------------------------------------------------------------===//
-
The instruction selector sometimes misses folding a load into a compare. The
pattern is written as (cmp reg, (load p)). Because the compare isn't
commutative, it is not matched with the load on both sides. The dag combiner
//===---------------------------------------------------------------------===//
-LSR should be turned on for the X86 backend and tuned to take advantage of its
-addressing modes.
-
-//===---------------------------------------------------------------------===//
-
-When compiled with unsafemath enabled, "main" should enable SSE DAZ mode and
-other fast SSE modes.
-
-//===---------------------------------------------------------------------===//
-
-Think about doing i64 math in SSE regs.
-
-//===---------------------------------------------------------------------===//
-
-The DAG Isel doesn't fold the loads into the adds in this testcase. The
-pattern selector does. This is because the chain value of the load gets
-selected first, and the loads aren't checking to see if they are only used by
-and add.
-
-.ll:
-
-int %test(int* %x, int* %y, int* %z) {
- %X = load int* %x
- %Y = load int* %y
- %Z = load int* %z
- %a = add int %X, %Y
- %b = add int %a, %Z
- ret int %b
-}
-
-dag isel:
-
-_test:
- movl 4(%esp), %eax
- movl (%eax), %eax
- movl 8(%esp), %ecx
- movl (%ecx), %ecx
- addl %ecx, %eax
- movl 12(%esp), %ecx
- movl (%ecx), %ecx
- addl %ecx, %eax
- ret
-
-pattern isel:
-
-_test:
- movl 12(%esp), %ecx
- movl 4(%esp), %edx
- movl 8(%esp), %eax
- movl (%eax), %eax
- addl (%edx), %eax
- addl (%ecx), %eax
- ret
-
-This is bad for register pressure, though the dag isel is producing a
-better schedule. :)
-
-//===---------------------------------------------------------------------===//
-
-This testcase should have no SSE instructions in it, and only one load from
-a constant pool:
-
-double %test3(bool %B) {
- %C = select bool %B, double 123.412, double 523.01123123
- ret double %C
-}
-
-Currently, the select is being lowered, which prevents the dag combiner from
-turning 'select (load CPI1), (load CPI2)' -> 'load (select CPI1, CPI2)'
-
-The pattern isel got this one right.
-
-//===---------------------------------------------------------------------===//
-
-We need to lower switch statements to tablejumps when appropriate instead of
-always into binary branch trees.
-
-//===---------------------------------------------------------------------===//
-
-SSE doesn't have [mem] op= reg instructions. If we have an SSE instruction
-like this:
-
- X += y
-
-and the register allocator decides to spill X, it is cheaper to emit this as:
-
-Y += [xslot]
-store Y -> [xslot]
+How about intrinsics? An example is:
+ *res = _mm_mulhi_epu16(*A, _mm_mul_epu32(*B, *C));
-than as:
-
-tmp = [xslot]
-tmp += y
-store tmp -> [xslot]
-
-..and this uses one fewer register (so this should be done at load folding
-time, not at spiller time). *Note* however that this can only be done
-if Y is dead. Here's a testcase:
-
-%.str_3 = external global [15 x sbyte] ; <[15 x sbyte]*> [#uses=0]
-implementation ; Functions:
-declare void %printf(int, ...)
-void %main() {
-build_tree.exit:
- br label %no_exit.i7
-no_exit.i7: ; preds = %no_exit.i7, %build_tree.exit
- %tmp.0.1.0.i9 = phi double [ 0.000000e+00, %build_tree.exit ], [ %tmp.34.i18, %no_exit.i7 ] ; <double> [#uses=1]
- %tmp.0.0.0.i10 = phi double [ 0.000000e+00, %build_tree.exit ], [ %tmp.28.i16, %no_exit.i7 ] ; <double> [#uses=1]
- %tmp.28.i16 = add double %tmp.0.0.0.i10, 0.000000e+00
- %tmp.34.i18 = add double %tmp.0.1.0.i9, 0.000000e+00
- br bool false, label %Compute_Tree.exit23, label %no_exit.i7
-Compute_Tree.exit23: ; preds = %no_exit.i7
- tail call void (int, ...)* %printf( int 0 )
- store double %tmp.34.i18, double* null
- ret void
-}
-
-We currently emit:
-
-.BBmain_1:
- xorpd %XMM1, %XMM1
- addsd %XMM0, %XMM1
-*** movsd %XMM2, QWORD PTR [%ESP + 8]
-*** addsd %XMM2, %XMM1
-*** movsd QWORD PTR [%ESP + 8], %XMM2
- jmp .BBmain_1 # no_exit.i7
-
-This is a bugpoint reduced testcase, which is why the testcase doesn't make
-much sense (e.g. its an infinite loop). :)
-
-//===---------------------------------------------------------------------===//
+compiles to
+ pmuludq (%eax), %xmm0
+ movl 8(%esp), %eax
+ movdqa (%eax), %xmm1
+ pmulhuw %xmm0, %xmm1
-None of the FPStack instructions are handled in
-X86RegisterInfo::foldMemoryOperand, which prevents the spiller from
-folding spill code into the instructions.
+The transformation probably requires a X86 specific pass or a DAG combiner
+target specific hook.
//===---------------------------------------------------------------------===//
_test:
movl 8(%esp), %ebx
- xor %eax, %eax
+ xor %eax, %eax
cmpl %ebx, 4(%esp)
setl %al
ret
//===---------------------------------------------------------------------===//
-We should generate 'test' instead of 'cmp' in various cases, e.g.:
-
-bool %test(int %X) {
- %Y = shl int %X, ubyte 1
- %C = seteq int %Y, 0
- ret bool %C
-}
-bool %test(int %X) {
- %Y = and int %X, 8
- %C = seteq int %Y, 0
- ret bool %C
-}
-
-This may just be a matter of using 'test' to write bigger patterns for X86cmp.
-
-//===---------------------------------------------------------------------===//
-
-SSE should implement 'select_cc' using 'emulated conditional moves' that use
-pcmp/pand/pandn/por to do a selection instead of a conditional branch:
-
-double %X(double %Y, double %Z, double %A, double %B) {
- %C = setlt double %A, %B
- %z = add double %Z, 0.0 ;; select operand is not a load
- %D = select bool %C, double %Y, double %z
- ret double %D
-}
-
-We currently emit:
-
-_X:
- subl $12, %esp
- xorpd %xmm0, %xmm0
- addsd 24(%esp), %xmm0
- movsd 32(%esp), %xmm1
- movsd 16(%esp), %xmm2
- ucomisd 40(%esp), %xmm1
- jb LBB_X_2
-LBB_X_1:
- movsd %xmm0, %xmm2
-LBB_X_2:
- movsd %xmm2, (%esp)
- fldl (%esp)
- addl $12, %esp
- ret
-
-//===---------------------------------------------------------------------===//
-
We should generate bts/btr/etc instructions on targets where they are cheap or
when codesize is important. e.g., for:
//===---------------------------------------------------------------------===//
-It's not clear whether we should use pxor or xorps / xorpd to clear XMM
-registers. The choice may depend on subtarget information. We should do some
-more experiments on different x86 machines.
-
-//===---------------------------------------------------------------------===//
-
Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
get this:
//===---------------------------------------------------------------------===//
-Currently the x86 codegen isn't very good at mixing SSE and FPStack
-code:
-
-unsigned int foo(double x) { return x; }
-
-foo:
- subl $20, %esp
- movsd 24(%esp), %xmm0
- movsd %xmm0, 8(%esp)
- fldl 8(%esp)
- fisttpll (%esp)
- movl (%esp), %eax
- addl $20, %esp
- ret
-
-This will be solved when we go to a dynamic programming based isel.
-
-//===---------------------------------------------------------------------===//
-
Should generate min/max for stuff like:
void minf(float a, float b, float *X) {
//===---------------------------------------------------------------------===//
-Investigate whether it is better to codegen the following
-
- %tmp.1 = mul int %x, 9
-to
-
- movl 4(%esp), %eax
- leal (%eax,%eax,8), %eax
-
-as opposed to what llc is currently generating:
-
- imull $9, 4(%esp), %eax
-
-Currently the load folding imull has a higher complexity than the LEA32 pattern.
-
-//===---------------------------------------------------------------------===//
-
We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
We should leave these as libcalls for everything over a much lower threshold,
since libc is hand tuned for medium and large mem ops (avoiding RFO for large
//===---------------------------------------------------------------------===//
-Lower memcpy / memset to a series of SSE 128 bit move instructions when it's
-feasible.
-
-//===---------------------------------------------------------------------===//
-
-Teach the coalescer to commute 2-addr instructions, allowing us to eliminate
-the reg-reg copy in this example:
-
-float foo(int *x, float *y, unsigned c) {
- float res = 0.0;
- unsigned i;
- for (i = 0; i < c; i++) {
- float xx = (float)x[i];
- xx = xx * y[i];
- xx += res;
- res = xx;
- }
- return res;
-}
-
-LBB_foo_3: # no_exit
- cvtsi2ss %XMM0, DWORD PTR [%EDX + 4*%ESI]
- mulss %XMM0, DWORD PTR [%EAX + 4*%ESI]
- addss %XMM0, %XMM1
- inc %ESI
- cmp %ESI, %ECX
-**** movaps %XMM1, %XMM0
- jb LBB_foo_3 # no_exit
-
-//===---------------------------------------------------------------------===//
-
-Codegen:
- if (copysign(1.0, x) == copysign(1.0, y))
-into:
- if (x^y & mask)
-when using SSE.
-
-//===---------------------------------------------------------------------===//
-
Optimize this into something reasonable:
x * copysign(1.0, y) * copysign(1.0, z)
//===---------------------------------------------------------------------===//
-Use movhps to update upper 64-bits of a v4sf value. Also movlps on lower half
-of a v4sf value.
-
-//===---------------------------------------------------------------------===//
-
-Better codegen for vector_shuffles like this { x, 0, 0, 0 } or { x, 0, x, 0}.
-Perhaps use pxor / xorp* to clear a XMM register first?
-
-//===---------------------------------------------------------------------===//
-
Adding to the list of cmp / test poor codegen issues:
int test(__m128 *A, __m128 *B) {
//===---------------------------------------------------------------------===//
-How to decide when to use the "floating point version" of logical ops? Here are
-some code fragments:
+We generate significantly worse code for this than GCC:
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
+http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
- movaps LCPI5_5, %xmm2
- divps %xmm1, %xmm2
- mulps %xmm2, %xmm3
- mulps 8656(%ecx), %xmm3
- addps 8672(%ecx), %xmm3
- andps LCPI5_6, %xmm2
- andps LCPI5_1, %xmm3
- por %xmm2, %xmm3
- movdqa %xmm3, (%edi)
+There is also one case we do worse on PPC.
- movaps LCPI5_5, %xmm1
- divps %xmm0, %xmm1
- mulps %xmm1, %xmm3
- mulps 8656(%ecx), %xmm3
- addps 8672(%ecx), %xmm3
- andps LCPI5_6, %xmm1
- andps LCPI5_1, %xmm3
- orps %xmm1, %xmm3
- movaps %xmm3, 112(%esp)
- movaps %xmm3, (%ebx)
+//===---------------------------------------------------------------------===//
-Due to some minor source change, the later case ended up using orps and movaps
-instead of por and movdqa. Does it matter?
+If shorter, we should use things like:
+movzwl %ax, %eax
+instead of:
+andl $65535, %EAX
+
+The former can also be used when the two-addressy nature of the 'and' would
+require a copy to be inserted (in X86InstrInfo::convertToThreeAddress).
//===---------------------------------------------------------------------===//
-Use movddup to splat a v2f64 directly from a memory source. e.g.
+Bad codegen:
+
+char foo(int x) { return x; }
+
+_foo:
+ movl 4(%esp), %eax
+ shll $24, %eax
+ sarl $24, %eax
+ ret
+
+SIGN_EXTEND_INREG can be implemented as (sext (trunc)) to take advantage of
+sub-registers.
+
+//===---------------------------------------------------------------------===//
-#include <emmintrin.h>
+Consider this:
-void test(__m128d *r, double A) {
- *r = _mm_set1_pd(A);
+typedef struct pair { float A, B; } pair;
+void pairtest(pair P, float *FP) {
+ *FP = P.A+P.B;
}
-llc:
+We currently generate this code with llvmgcc4:
-_test:
- movsd 8(%esp), %xmm0
- unpcklpd %xmm0, %xmm0
- movl 4(%esp), %eax
- movapd %xmm0, (%eax)
- ret
+_pairtest:
+ subl $12, %esp
+ movl 20(%esp), %eax
+ movl %eax, 4(%esp)
+ movl 16(%esp), %eax
+ movl %eax, (%esp)
+ movss (%esp), %xmm0
+ addss 4(%esp), %xmm0
+ movl 24(%esp), %eax
+ movss %xmm0, (%eax)
+ addl $12, %esp
+ ret
-icc:
+we should be able to generate:
+_pairtest:
+ movss 4(%esp), %xmm0
+ movl 12(%esp), %eax
+ addss 8(%esp), %xmm0
+ movss %xmm0, (%eax)
+ ret
-_test:
- movl 4(%esp), %eax
- movddup 8(%esp), %xmm0
- movapd %xmm0, (%eax)
+The issue is that llvmgcc4 is forcing the struct to memory, then passing it as
+integer chunks. It does this so that structs like {short,short} are passed in
+a single 32-bit integer stack slot. We should handle the safe cases above much
+nicer, while still handling the hard cases.
+
+//===---------------------------------------------------------------------===//
+
+Another instruction selector deficiency:
+
+void %bar() {
+ %tmp = load int (int)** %foo
+ %tmp = tail call int %tmp( int 3 )
+ ret void
+}
+
+_bar:
+ subl $12, %esp
+ movl L_foo$non_lazy_ptr, %eax
+ movl (%eax), %eax
+ call *%eax
+ addl $12, %esp
ret
+The current isel scheme will not allow the load to be folded in the call since
+the load's chain result is read by the callseq_start.
+
//===---------------------------------------------------------------------===//
-A Mac OS X IA-32 specific ABI bug wrt returning value > 8 bytes:
-http://llvm.org/bugs/show_bug.cgi?id=729
+Don't forget to find a way to squash noop truncates in the JIT environment.
+
+//===---------------------------------------------------------------------===//
+
+Implement anyext in the same manner as truncate that would allow them to be
+eliminated.
+
+//===---------------------------------------------------------------------===//
+
+How about implementing truncate / anyext as a property of machine instruction
+operand? i.e. Print as 32-bit super-class register / 16-bit sub-class register.
+Do this for the cases where a truncate / anyext is guaranteed to be eliminated.
+For IA32 that is truncate from 32 to 16 and anyext from 16 to 32.
+
+//===---------------------------------------------------------------------===//
+
+For this:
+
+int test(int a)
+{
+ return a * 3;
+}
+
+We currently emits
+ imull $3, 4(%esp), %eax
+
+Perhaps this is what we really should generate is? Is imull three or four
+cycles? Note: ICC generates this:
+ movl 4(%esp), %eax
+ leal (%eax,%eax,2), %eax
+
+The current instruction priority is based on pattern complexity. The former is
+more "complex" because it folds a load so the latter will not be emitted.
+
+Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
+should always try to match LEA first since the LEA matching code does some
+estimate to determine whether the match is profitable.
+
+However, if we care more about code size, then imull is better. It's two bytes
+shorter than movl + leal.
+
+//===---------------------------------------------------------------------===//
+
+Implement CTTZ, CTLZ with bsf and bsr.
+
+//===---------------------------------------------------------------------===//
+
+It appears gcc place string data with linkonce linkage in
+.section __TEXT,__const_coal,coalesced instead of
+.section __DATA,__const_coal,coalesced.
+Take a look at darwin.h, there are other Darwin assembler directives that we
+do not make use of.
+
+//===---------------------------------------------------------------------===//
+
+We should handle __attribute__ ((__visibility__ ("hidden"))).
+
+//===---------------------------------------------------------------------===//
+
+int %foo(int* %a, int %t) {
+entry:
+ br label %cond_true
+
+cond_true: ; preds = %cond_true, %entry
+ %x.0.0 = phi int [ 0, %entry ], [ %tmp9, %cond_true ]
+ %t_addr.0.0 = phi int [ %t, %entry ], [ %tmp7, %cond_true ]
+ %tmp2 = getelementptr int* %a, int %x.0.0
+ %tmp3 = load int* %tmp2 ; <int> [#uses=1]
+ %tmp5 = add int %t_addr.0.0, %x.0.0 ; <int> [#uses=1]
+ %tmp7 = add int %tmp5, %tmp3 ; <int> [#uses=2]
+ %tmp9 = add int %x.0.0, 1 ; <int> [#uses=2]
+ %tmp = setgt int %tmp9, 39 ; <bool> [#uses=1]
+ br bool %tmp, label %bb12, label %cond_true
+
+bb12: ; preds = %cond_true
+ ret int %tmp7
+}
+
+is pessimized by -loop-reduce and -indvars
+
+//===---------------------------------------------------------------------===//
+
+u32 to float conversion improvement:
+
+float uint32_2_float( unsigned u ) {
+ float fl = (int) (u & 0xffff);
+ float fh = (int) (u >> 16);
+ fh *= 0x1.0p16f;
+ return fh + fl;
+}
+
+00000000 subl $0x04,%esp
+00000003 movl 0x08(%esp,1),%eax
+00000007 movl %eax,%ecx
+00000009 shrl $0x10,%ecx
+0000000c cvtsi2ss %ecx,%xmm0
+00000010 andl $0x0000ffff,%eax
+00000015 cvtsi2ss %eax,%xmm1
+00000019 mulss 0x00000078,%xmm0
+00000021 addss %xmm1,%xmm0
+00000025 movss %xmm0,(%esp,1)
+0000002a flds (%esp,1)
+0000002d addl $0x04,%esp
+00000030 ret
+
+//===---------------------------------------------------------------------===//
+
+When using fastcc abi, align stack slot of argument of type double on 8 byte
+boundary to improve performance.
+
+//===---------------------------------------------------------------------===//
+
+Codegen:
+
+int f(int a, int b) {
+ if (a == 4 || a == 6)
+ b++;
+ return b;
+}
+
+
+as:
+
+or eax, 2
+cmp eax, 6
+jz label
+
+//===---------------------------------------------------------------------===//
+
+GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
+simplifications for integer "x cmp y ? a : b". For example, instead of:
+
+int G;
+void f(int X, int Y) {
+ G = X < 0 ? 14 : 13;
+}
+
+compiling to:
+
+_f:
+ movl $14, %eax
+ movl $13, %ecx
+ movl 4(%esp), %edx
+ testl %edx, %edx
+ cmovl %eax, %ecx
+ movl %ecx, _G
+ ret
+
+it could be:
+_f:
+ movl 4(%esp), %eax
+ sarl $31, %eax
+ notl %eax
+ addl $14, %eax
+ movl %eax, _G
+ ret
+
+etc.
+
+//===---------------------------------------------------------------------===//
+
+Currently we don't have elimination of redundant stack manipulations. Consider
+the code:
+
+int %main() {
+entry:
+ call fastcc void %test1( )
+ call fastcc void %test2( sbyte* cast (void ()* %test1 to sbyte*) )
+ ret int 0
+}
+
+declare fastcc void %test1()
+
+declare fastcc void %test2(sbyte*)
+
+
+This currently compiles to:
+
+ subl $16, %esp
+ call _test5
+ addl $12, %esp
+ subl $16, %esp
+ movl $_test5, (%esp)
+ call _test6
+ addl $12, %esp
+
+The add\sub pair is really unneeded here.
+
+//===---------------------------------------------------------------------===//
+
+We generate really bad code in some cases due to lowering SETCC/SELECT at
+legalize time, which prevents the post-legalize dag combine pass from
+understanding the code. As a silly example, this prevents us from folding
+stuff like this:
+
+bool %test(ulong %x) {
+ %tmp = setlt ulong %x, 4294967296
+ ret bool %tmp
+}
+
+into x.h == 0
+
+//===---------------------------------------------------------------------===//
+
+We currently compile sign_extend_inreg into two shifts:
+
+long foo(long X) {
+ return (long)(signed char)X;
+}
+
+becomes:
+
+_foo:
+ movl 4(%esp), %eax
+ shll $24, %eax
+ sarl $24, %eax
+ ret
+
+This could be:
+
+_foo:
+ movsbl 4(%esp),%eax
+ ret
+
+//===---------------------------------------------------------------------===//
+
+Consider the expansion of:
+
+uint %test3(uint %X) {
+ %tmp1 = rem uint %X, 255
+ ret uint %tmp1
+}
+
+Currently it compiles to:
+
+...
+ movl $2155905153, %ecx
+ movl 8(%esp), %esi
+ movl %esi, %eax
+ mull %ecx
+...
+
+This could be "reassociated" into:
+
+ movl $2155905153, %eax
+ movl 8(%esp), %ecx
+ mull %ecx
+
+to avoid the copy. In fact, the existing two-address stuff would do this
+except that mul isn't a commutative 2-addr instruction. I guess this has
+to be done at isel time based on the #uses to mul?
+