[X86] Prefer blendps over insertps codegen for one special case
authorSanjay Patel <spatel@rotateright.com>
Fri, 20 Mar 2015 21:19:52 +0000 (21:19 +0000)
committerSanjay Patel <spatel@rotateright.com>
Fri, 20 Mar 2015 21:19:52 +0000 (21:19 +0000)
commit39110ecd35f9ed643bf335b94789871b297bf03a
tree0bd7fd0f1f38d5c987c12be1c4b0358edc01e0df
parent5155a78d187b8bb9311be87aaf3f8f7046d7ca21
[X86] Prefer blendps over insertps codegen for one special case

With this patch, for this one exact case, we'll generate:

  blendps %xmm0, %xmm1, $1

instead of:

  insertps %xmm0, %xmm1, $0

If there's a memory operand available for load folding and we're
optimizing for size, we'll still generate the insertps.

The detailed performance data motivation for this may be found in D7866;
in summary, blendps has 2-3x throughput vs. insertps on widely used chips.

Differential Revision: http://reviews.llvm.org/D8332

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@232850 91177308-0d34-0410-b5e6-96231b3b80d8
lib/Target/X86/X86ISelLowering.cpp
test/CodeGen/X86/sse41.ll