granicus.if.org Git - llvm/commit

author	Chandler Carruth <chandlerc@gmail.com>
	Fri, 21 Nov 2014 13:56:05 +0000 (13:56 +0000)
committer	Chandler Carruth <chandlerc@gmail.com>
	Fri, 21 Nov 2014 13:56:05 +0000 (13:56 +0000)
commit	bd357588a106dc7c828c57ad8048e82003d638de
tree	72dd176dd5ff772e1674d62f57e267551fd7b5d7	tree \| snapshot
parent	a5f45765105c19c2342794a7b13d52c74c9aab47	commit \| diff

[x86] Teach the x86 vector shuffle lowering to detect mergable 128-bit
lanes.

By special casing these we can often either reduce the total number of
shuffles significantly or reduce the number of (high latency on Haswell)
AVX2 shuffles that potentially cross 128-bit lanes. Even when these
don't actually cross lanes, they have much higher latency to support
that. Doing two of them and a blend is worse than doing a single insert
across the 128-bit lanes to blend and then doing a single interleaved
shuffle.

While this seems like a narrow case, it kept cropping up on me and the
difference is *huge* as you can see in many of the test cases. I first
hit this trying to perfectly fix the interleaving shuffle patterns used
by Halide for AVX2.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222533 91177308-0d34-0410-b5e6-96231b3b80d8

lib/Target/X86/X86ISelLowering.cpp		diff \| blob \| history
test/CodeGen/X86/avx-vperm2x128.ll		diff \| blob \| history
test/CodeGen/X86/vector-shuffle-256-v16.ll		diff \| blob \| history
test/CodeGen/X86/vector-shuffle-256-v32.ll		diff \| blob \| history
test/CodeGen/X86/vector-shuffle-256-v4.ll		diff \| blob \| history
test/CodeGen/X86/vector-shuffle-256-v8.ll		diff \| blob \| history
test/CodeGen/X86/vector-shuffle-512-v8.ll		diff \| blob \| history
test/CodeGen/X86/vector-shuffle-combining.ll		diff \| blob \| history