Programming Languages Research Group: Git

author	Chandler Carruth <chandlerc@gmail.com>
	Sat, 30 May 2015 10:35:03 +0000 (10:35 +0000)
committer	Chandler Carruth <chandlerc@gmail.com>
	Sat, 30 May 2015 10:35:03 +0000 (10:35 +0000)
commit	fa68750e54ebabefcafece3b0a26d88dbe67280e
tree	f93e92ee7321547990662ee2a3801b75bfbf746a	tree \| snapshot
parent	215bfbf9eabd2f4842c13d1800a748070b5e15ee	commit \| diff

[x86] Unify the horizontal adding used for popcount lowering taking the
best approach of each.

For vNi16, we use SHL + ADD + SRL pattern that seem easily the best.

For vNi32, we use the PUNPCK + PSADBW + PACKUSWB pattern. In some cases
there is a huge improvement with this in IACA's estimated throughput --
over 2x higher throughput!!!! -- but the measurements are too good to be
true. In one narrow case, the SHL + ADD + SHL + ADD + SRL pattern looks
slightly faster, but I'm not sure I believe any of the measurements at
this point. Both are the exact same uops though. Hard to be confident of
anything past that.

If anyone wants to collect very detailed (Agner-level) timings with the
result of this patch, or with the i32 case replaced with SHL + ADD + SHl
+ ADD + SRL, I'd be very interested. Note that you'll need to test it on
both Ivybridge and Haswell, with both SSE3, SSSE3, and AVX selected as
I saw unique behavior in each of these buckets with IACA all of which
should be checked against measured performance.

But this patch is still a useful improvement by dropping duplicate work
and getting the much nicer PSADBW lowering for v2i64.

I'd still like to rephrase this in terms of generic horizontal sum. It's
a bit lame to have a special case of that just for popcount.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238652 91177308-0d34-0410-b5e6-96231b3b80d8

lib/Target/X86/X86ISelLowering.cpp		diff \| blob \| history
test/CodeGen/X86/vector-popcnt-128.ll		diff \| blob \| history
test/CodeGen/X86/vector-popcnt-256.ll		diff \| blob \| history