Jingyue Wu [Fri, 29 Aug 2014 15:30:20 +0000 (15:30 +0000)]
[NVPTX] Make the alignment an explicit argument to ldu/ldg
Summary:
Instead of specifying the alignment as metadata which may be destroyed by
transformation passes, make the alignment the second argument to ldu/ldg
intrinsic calls.
Test Plan:
ldu-ldg.ll
ldu-i8.ll
ldu-reg-plus-offset.ll
Tim Northover [Fri, 29 Aug 2014 13:05:18 +0000 (13:05 +0000)]
AArch64: skip select/setcc combine in complex case.
In an llvm-stress generated test, we were trying to create a v0iN type and
asserting when that failed. This case could probably be handled by the
function, but not without added complexity and the situation it arises in is
sufficiently odd that there's probably no benefit anyway.
[AArch64] FPLoadBalancing: move ownership of the chain to its current accumulator register
and forget about the previously used accumulator.
Coming up with a simple testcase is not easy, as this highly depends on
what the register allocator is doing: this issue showed up while working
with the PBQP allocator, which produced a different allocation scheme.
A testcase would need to come up with chain starting in D[0-7], then
moving to D[8-15], followed by a call to a function whose regmask
clobbers the starting accumulator in D[0-7], then another use of the chain.
Fixed some formatting, added some invariant checks while there.
Job Noorman [Fri, 29 Aug 2014 08:23:53 +0000 (08:23 +0000)]
Do not assume the value passed to memset is an i32.
The code in SelectionDAG::getMemset for some reason assumes the value passed to
memset is an i32. This breaks the generated code for targets that only have
registers smaller than 32 bits because the value might get split into multiple
registers by the calling convention. See the test for the MSP430 target included
in the patch for an example.
This patch ensures that nothing is assumed about the type of the value. Instead,
the type is taken from the selected overload of the llvm.memset intrinsic.
Juergen Ributzka [Fri, 29 Aug 2014 00:19:21 +0000 (00:19 +0000)]
[FastISel][AArch64] Don't fold instructions that are not in the same basic block.
This fix checks first if the instruction to be folded (e.g. sign-/zero-extend,
or shift) is in the same machine basic block as the instruction we are folding
into.
Not doing so can result in incorrect code, because the value might not be
live-out of the basic block, where the value is defined.
Reid Kleckner [Thu, 28 Aug 2014 22:42:00 +0000 (22:42 +0000)]
Don't promote byval pointer arguments when padding matters
Don't promote byval pointer arguments when when their size in bits is
not equal to their alloc size in bits. This can happen for x86_fp80,
where the size in bits is 80 but the alloca size in bits in 128.
Promoting these types can break passing unions of x86_fp80s and other
types.
Jim Grosbach [Thu, 28 Aug 2014 22:08:28 +0000 (22:08 +0000)]
AArch64: More correctly constrain target vector extend lowering.
The AArch64 target lowering for [zs]ext of vectors is set up to handle
input simple types and expects the generic SDag path to do something reasonable
with anything that's not a simple type. The code, however, was only
checking that the result type was a simple type and assuming that
implied that the source type would also be a simple type. That's not a
valid assumption, as operations like "zext <1 x i1> %0 to <1 x i32>"
demonstrate. The fix is to simply explicitly validate the source type
as well as the result type.
Rafael Espindola [Thu, 28 Aug 2014 20:13:31 +0000 (20:13 +0000)]
On MachO, don't put non-private constants in mergeable sections.
On MachO, putting a symbol that doesn't start with a 'L' or 'l' in one of the
__TEXT,__literal* sections prevents the linker from merging the context of the
section.
Since private GVs are the ones the get mangled to start with 'L' or 'l', we now
only put those on the __TEXT,__literal* sections.
Sanjay Patel [Thu, 28 Aug 2014 18:59:22 +0000 (18:59 +0000)]
Fix a logic bug in x86 vector codegen: sext (zext (x) ) != sext (x) (PR20472).
Remove a block of code from LowerSIGN_EXTEND_INREG() that was added with:
http://llvm.org/viewvc/llvm-project?view=revision&revision=177421
And caused:
http://llvm.org/bugs/show_bug.cgi?id=20472 (more analysis here)
http://llvm.org/bugs/show_bug.cgi?id=18054
The testcases confirm that we (1) don't remove a zext op that is necessary and (2) generate
a pmovz instead of punpck if SSE4.1 is available. Although pmovz is 1 byte longer, it allows
folding of the load, and so saves 3 bytes overall.
Owen Anderson [Thu, 28 Aug 2014 17:49:58 +0000 (17:49 +0000)]
Do not introduce new shuffle patterns after operation legalization if SHUFFLE_VECTOR
was marked custom. The target independent DAG combine has no way to know if
the shuffles it is introducing are ones that the target could support or not.
David Majnemer [Thu, 28 Aug 2014 10:08:37 +0000 (10:08 +0000)]
InstCombine: Remove redundant combines
InstSimplify already handles icmp (X+Y), X (and things like it)
appropriately. The first thing that InstCombine does is run
InstSimplify on the instruction.
Erik Eckstein [Thu, 28 Aug 2014 07:04:02 +0000 (07:04 +0000)]
Fix: SLPVectorizer tried to move an instruction which was replaced by a vector instruction.
For a detailed description of the problem see the comment in the test file.
The problematic moveBefore() calls are not required anymore because the new
scheduling algorithm ensures a correct ordering anyway.
Chandler Carruth [Thu, 28 Aug 2014 03:52:45 +0000 (03:52 +0000)]
[x86] Inline an SSE4 helper function for INSERT_VECTOR_ELT lowering, no
functionality changed.
Separating this into two functions wasn't helping. There was a decent
amount of boilerplate duplicated, and some subsequent refactorings here
will pull even more common code out.
David Majnemer [Thu, 28 Aug 2014 03:34:28 +0000 (03:34 +0000)]
InstSimplify: Move a transform from InstCombine to InstSimplify
Several combines involving icmp (shl C2, %X) C1 can be simplified
without introducing any new instructions. Move them to InstSimplify;
while we are at it, make them more powerful.
Juergen Ributzka [Thu, 28 Aug 2014 02:06:55 +0000 (02:06 +0000)]
[FastISel] Undo phi node updates when falling-back to SelectionDAG.
The included test case would fail, because the MI PHI node would have two
operands from the same predecessor.
This problem occurs when a switch instruction couldn't be selected. This happens
always, because there is no default switch support for FastISel to begin with.
The problem was that FastISel would first add the operand to the PHI nodes and
then fall-back to SelectionDAG, which would then in turn add the same operands
to the PHI nodes again.
This fix removes these duplicate PHI node operands by reseting the
PHINodesToUpdate to its original state before FastISel tried to select the
instruction.
Juergen Ributzka [Thu, 28 Aug 2014 00:09:46 +0000 (00:09 +0000)]
[FastISel]
Currently instructions are folded very aggressively for AArch64 into the memory
operation, which can lead to the use of killed operands:
%vreg1<def> = ADDXri %vreg0<kill>, 2
%vreg2<def> = LDRBBui %vreg0, 2
... = ... %vreg1 ...
This usually happens when the result is also used by another non-memory
instruction in the same basic block, or any instruction in another basic block.
This fix teaches hasTrivialKill to not only check the LLVM IR that the value has
a single use, but also to check if the register that represents that value has
already been used. This can happen when the instruction with the use was folded
into another instruction (in this particular case a load instruction).
Juergen Ributzka [Wed, 27 Aug 2014 22:52:33 +0000 (22:52 +0000)]
[FastISel][AArch64] Don't fold instructions too aggressively into the memory operation.
Currently instructions are folded very aggressively into the memory operation,
which can lead to the use of killed operands:
%vreg1<def> = ADDXri %vreg0<kill>, 2
%vreg2<def> = LDRBBui %vreg0, 2
... = ... %vreg1 ...
This usually happens when the result is also used by another non-memory
instruction in the same basic block, or any instruction in another basic block.
If the computed address is used by only memory operations in the same basic
block, then it is safe to fold them. This is because all memory operations will
fold the address computation and the original computation will never be emitted.
Juergen Ributzka [Wed, 27 Aug 2014 21:38:33 +0000 (21:38 +0000)]
[FastISel][AArch64] Fix simplify address when the address comes from a shift.
When the address comes directly from a shift instruction then the address
computation cannot be folded into the memory instruction, because the zero
register is not available as a base register. Simplify addess needs to emit the
shift instruction and use the result as base register.
Juergen Ributzka [Wed, 27 Aug 2014 21:04:52 +0000 (21:04 +0000)]
[FastISel][AArch64] Use the zero register for stores.
Use the zero register directly when possible to avoid an unnecessary register
copy and a wasted register at -O0. This also uses integer stores to store a
positive floating-point zero. This saves us from materializing the positive zero
in a register and then storing it.
Juergen Ributzka [Wed, 27 Aug 2014 20:47:33 +0000 (20:47 +0000)]
[FastISel] Fix a potential bug in FastEmitInst_ri
FastEmitInst_ri was constraining the first operand without checking if it is
a virtual register. Use constrainOperandRegClass as all the other
FastEmitInst_* functions.
David Blaikie [Wed, 27 Aug 2014 20:14:18 +0000 (20:14 +0000)]
Convert a few more cases of direct intialization of unique_ptrs from MemoryBuffer::getMemBuffer to move initialization now that it returns by unique_ptr instead of raw pointer.
Reid Kleckner [Wed, 27 Aug 2014 20:10:38 +0000 (20:10 +0000)]
X86 MC: Handle instructions like fxsave that match multiple operand sizes
Instructions like 'fxsave' and control flow instructions like 'jne'
match any operand size. The loop I added to the Intel syntax matcher
assumed that using a different size would give a different instruction.
Now it handles the case where we get the same instruction for different
memory operand sizes.
This also allows us to remove the hack we had for unsized absolute
memory operands, because we can successfully match things like 'jnz'
without reporting ambiguity. Removing this hack uncovered test case
involving 'fadd' that was ambiguous. The memory operand could have been
single or double precision.
David Majnemer [Wed, 27 Aug 2014 20:08:37 +0000 (20:08 +0000)]
InstCombine: Combine gep X, (Y-X) to Y
We try to perform this transform in InstSimplify but we aren't always
able to. Sometimes, we need to insert a bitcast if X and Y don't have
the same time.
Alexey Samsonov [Wed, 27 Aug 2014 19:36:53 +0000 (19:36 +0000)]
Use BitVector instead of int in R600 SIISelLowering.
int may not have enough bits in it, which was detected by UBSan
bootstrap (it reported left shift by a too large constant).
David Majnemer [Wed, 27 Aug 2014 18:03:46 +0000 (18:03 +0000)]
InstSimplify: Compute comparison ranges for left shift instructions
'shl nuw CI, x' produces [CI, CI << CLZ(CI)]
'shl nsw CI, x' produces [CI << CLO(CI)-1, CI] if CI is negative
'shl nsw CI, x' produces [CI, CI << CLZ(CI)-1] if CI is non-negative
Oliver Stannard [Wed, 27 Aug 2014 16:16:04 +0000 (16:16 +0000)]
Teach the AArch64 backend about v4f16 and v8f16
This teaches the AArch64 backend to deal with the operations required
to deal with the operations on v4f16 and v8f16 which are exposed by
NEON intrinsics, plus the add, sub, mul and div operations.
Chandler Carruth [Wed, 27 Aug 2014 11:39:47 +0000 (11:39 +0000)]
[x86] Fix a regression introduced with r213897 for 32-bit targets where
we stopped efficiently lowering sextload using the SSE41 instructions
for that operation.
This is a consequence of a bad predicate I used thinking of the memory
access needs. The code actually handles the cases where the predicate
doesn't apply, and handles them much better. =] Simple fix and a test
case added. Fixes PR20767.
Chandler Carruth [Wed, 27 Aug 2014 11:22:16 +0000 (11:22 +0000)]
[SDAG] Re-instate r215611 with a fix to a pesky X86 DAG combine.
This combine is essentially combining target-specific nodes back into target
independent nodes that it "knows" will be combined yet again by a target
independent DAG combine into a different set of target-independent nodes that
are legal (not custom though!) and thus "ok". This seems... deeply flawed. The
crux of the problem is that we don't combine un-legalized shuffles that are
introduced by legalizing other operations, and thus we don't see a very
profitable combine opportunity. So the backend just forces the input to that
combine to re-appear.
However, for this to work, the conditions detected to re-form the unlegalized
nodes must be *exactly* right. Previously, failing this would have caused poor
code (if you're lucky) or a crasher when we failed to select instructions.
After r215611 we would fall back into the legalizer. In some cases, this just
"fixed" the crasher by produces bad code. But in the test case added it caused
the legalizer and the dag combiner to iterate forever.
The fix is to make the alignment checking in the x86 side of things match the
alignment checking in the generic DAG combine exactly. This isn't really a
satisfying or principled fix, but it at least make the code work as intended.
It also highlights that it would be nice to detect the availability of under
aligned loads for a given type rather than bailing on this optimization. I've
left a FIXME to document this.
Original commit message for r215611 which covers the rest of the chang:
[SDAG] Fix a case where we would iteratively legalize a node during
combining by replacing it with something else but not re-process the
node afterward to remove it.
In a truly remarkable stroke of bad luck, this would (in the test case
attached) end up getting some other node combined into it without ever
getting re-processed. By adding it back on to the worklist, in addition
to deleting the dead nodes more quickly we also ensure that if it
*stops* being dead for any reason it makes it back through the
legalizer. Without this, the test case will end up failing during
instruction selection due to an and node with a type we don't have an
instruction pattern for.
It took many million runs of the shuffle fuzz tester to find this.
We supported transforming:
(gep i8* X, -(ptrtoint Y))
to:
(inttoptr (sub (ptrtoint X), (ptrtoint Y)))
However, this only fired if 'X' had type i8*. Generalize this to
support various types of different sizes. This results in much better
CodeGen, especially for pointers to packed structs.
Juergen Ributzka [Wed, 27 Aug 2014 00:58:30 +0000 (00:58 +0000)]
[FastISel][AArch64] Fix address simplification.
When a shift with extension or an add with shift and extension cannot be folded
into the memory operation, then the address calculation has to be materialized
separately. While doing so the code forgot to consider a possible sign-/zero-
extension. This fix folds now also the sign-/zero-extension into the add or
shift instruction which is used to materialize the address.
Reid Kleckner [Tue, 26 Aug 2014 20:32:34 +0000 (20:32 +0000)]
MC: Split the x86 asm matcher implementations by dialect
The existing matcher has lots of AT&T assembly dialect assumptions baked
into it. In particular, the hack for resolving the size of a memory
operand by appending the four most common suffixes doesn't work at all.
The Intel assembly dialect mnemonic table has ambiguous entries, so we
need to try matching multiple times with different operand sizes, since
that's the only way to choose different instruction variants.
This makes us more compatible with gas's implementation of Intel
assembly syntax. MSVC assumes you want byte-sized operations for the
instructions that we reject as ambiguous.