Tom Stellard [Thu, 2 Jun 2016 21:01:43 +0000 (21:01 +0000)]
Merging part of r259297:
We need to correctly initialize the AMDGPUPromoteAlloca pass, because
later commits will add tests that try to pass the -amdgpu-promote-alloca
flag to opt.
AMDGPU/SI: Fix commuting of 32-bit VOPC instructions
Summary:
We didn't have entries in the commuting table for the 32-bit
instructions. I don't think we hit this problem now, but we
will once uniform branching is enabled. Tests will come in
a later commit.
When no device name is specified, default to kaveri
for HSA since SI is not supported and it woud fail.
Default to "tahiti" instead of "SI" since these are
effectively the same, and tahiti is an actual device.
Move default device handling to the TargetMachine
rather than the AMDGPUSubtarget. The module ISA version
is computed from the device name provided with the target
machine, so the attributes printed by the AsmPrinter were
inconsistent with those computed in the subtarget.
Also remove DevName field from subtarget since it's redundant
with getCPU() in the superclass.
The promote alloca pass didn't handle these intrinsics and crashed.
These intrinsics should accept any address space, but for now just
erase them to avoid breaking.
When MemoryDependenceAnalysis hits a CFG with many transparent blocks,
the algorithm easily degrades into quadratic memory and time complexity.
The easiest example is a long chain of BBs that don't otherwise use a
location. The caching will add an entry for every intermediate block and
limiting the number of results doesn't help as no results are produced
until a definition is found.
Introduce a limit similar to the existing instructions-per-block limit.
This limit counts the total number of blocks checked. If the limit is
reached, entries are considered unknown. The initial value is 1000,
which avoids regressions for normal sized functions while still
limiting edge cases to reasnable memory consumption and execution time.
Fix typing on generated LXV2DX/STXV2DX instructions
[PPC] Previously when casting generic loads to LXV2DX/ST instructions we
would leave the original load return type in place allowing for an
assertion failure when we merge two equivalent LXV2DX nodes with
different types.
This patch fixes a bug (PR26827) when using anti-aliasing in store
merging. This sets the chain users of the component stores to point to
the new store instead of the component stores chain parent.
FastIsel is not supported for microMIPS, thus it needs to be disabled.
Test micromips-zero-mat-uses.ll is deleted since the tested sequence of instructions is not generated for microMIPS without FastISel.
Differential Revision: http://reviews.llvm.org/D15892
[MIPS][LLVM-MC] Fix Disassemble of Negative Offset
Patch by Nitesh Jain.
Summary: The type of Imm in MipsDisassembler.cpp was incorrect since SignExtend64 return int64_t type.As per the MIPSr6 doc ,the offset is added to the address of the instruction following the branch (not the branch itself), to form a PC-relative effective target address hence “4” is added to the offset. The offset of some test case are update to reflect the changes due to “ + 4 ” offset and new test case for negative offset are added.
The new X86 shuffle lowering can do just fine without transforming vselects
into vector_shuffles. It looks like the only thing this code does right now
is cause trouble - in particular, it can lead to combine/legalization infinite
loops.
Note that it's not completely NFC, since some of the shuffle masks get inverted,
which may cause slight differences further down the line. We may want to find
a way to invert those masks, but that's orthogonal to this commit.
This means that LTO_SYMBOL_SCOPE_DEFAULT_CAN_BE_HIDDEN will not be set
in a few cases.
This should have no impact in ld64 since it doesn't use lazy loading
when merging modules and that is when it checks
LTO_SYMBOL_SCOPE_DEFAULT_CAN_BE_HIDDEN.
[X86] Correctly select registers to pop into for x86_64
When trying to replace an add to esp with pops, we need to choose dead
registers to pop into. Registers clobbered by the call and not imp-def'd
by it should be safe. Except that it's not enough to check the register
itself isn't defined, we also need to make sure no overlapping registers
are defined either.
Tom Stellard [Wed, 11 May 2016 13:54:46 +0000 (13:54 +0000)]
Merging r265097:
Partial-rebuilding /home/tstellar/llvm/.git/svn/refs/remotes/origin/master/.rev_map.91177308-0d34-0410-b5e6-96231b3b80d8 ...
Currently at 268516 = 990ef3411fc39ac61eae0bcfaf25f824209a76a7
r268517 = d1310b73a355972c610bc15c043906e0f4b2e072
Done rebuilding /home/tstellar/llvm/.git/svn/refs/remotes/origin/master/.rev_map.91177308-0d34-0410-b5e6-96231b3b80d8
------------------------------------------------------------------------
r265097 | cycheng | 2016-03-31 19:05:29 -0700 (Thu, 31 Mar 2016) | 17 lines
Fix Sub-register Rewriting in Aggressive Anti-Dependence Breaker
Previously, HandleLastUse would delete RegRef information for sub-registers
if they were dead even if their corresponding super-register were still live.
If the super-register were later renamed, then the definitions of the
sub-register would not be updated appropriately. This patch alters the
behavior so that RegInfo information for sub-registers is only deleted when
the sub-register and super-register are both dead.
This resolves PR26775. This is the mirror image of Hal's r227311 commit.
Author: Tom Jablin (tjablin)
Reviewers: kbarton uweigand nemanjai hfinkel
[X86] Make sure we do not clobber RBX with cmpxchg when used as a base pointer.
cmpxchg[8|16]b uses RBX as one of its argument.
In other words, using this instruction clobbers RBX as it is defined to hold one
the input. When the backend uses dynamically allocated stack, RBX is
used as a reserved register for the base pointer.
Reserved registers have special semantic that only the target understands and
enforces, because of that, the register allocator don’t use them, but also,
don’t try to make sure they are used properly (remember it does not know how
they are supposed to be used).
Therefore, when RBX is used as a reserved register but defined by something that
is not compatible with that use, the register allocator will not fix the
surrounding code to make sure it gets saved and restored properly around the
broken code. This is the responsibility of the target to do the right thing with
its reserved register.
To fix that, when the base pointer needs to be preserved, we use a different
pseudo instruction for cmpxchg that save rbx.
That pseudo takes two more arguments than the regular instruction:
- One is the value to be copied into RBX to set the proper value for the
comparison.
- The other is the virtual register holding the save of the value of RBX as the
base pointer. This saving is done as part of isel (i.e., we emit a copy from
rbx).
Note: The actual modeling of the pseudo is a bit more complicated to make sure
the interferes that appears after the pseudo gets expanded are properly modeled
before that expansion.
[X86] Make sure it is safe to clobber EFLAGS, if need be, when choosing
the prologue.
Do not use basic blocks that have EFLAGS live-in as prologue if we need
to realign the stack. Realigning the stack uses AND instruction and this
clobbers EFLAGS.
An other alternative would have been to save and restore EFLAGS around
the stack realignment code, but this is likely inefficient.
Merging r264335:
------------------------------------------------------------------------
r264335 | dim | 2016-03-24 21:39:17 +0100 (Thu, 24 Mar 2016) | 17 lines
Add <atomic> to ThreadPool.h, since std::atomic is used
Summary:
Apparently, when compiling with gcc 5.3.2 for powerpc64, the order of
headers is such that it gets an error about std::atomic<> use in
ThreadPool.h, since this header is not included explicitly. See also:
https://llvm.org/bugs/show_bug.cgi?id=27058
Fix this by including <atomic>. Patch by Bryan Drewery.
This patch corresponds to review:
http://reviews.llvm.org/D17294
It ensures that whatever block we are emitting the prologue/epilogue into, we
have the necessary scratch registers. It takes away the hard-coded register
numbers for use as scratch registers as registers that are guaranteed to be
available in the function prologue/epilogue are not guaranteed to be available
within the function body. Since we shrink-wrap, the prologue/epilogue may end
up in the function body.
------------------------------------------------------------------------
The patch has a necessary call to a function inside an assert. Which is fine
when you have asserts turned on. Not so much when they're off. Sorry about
the regression.
------------------------------------------------------------------------
This is what was meant to be in the initial commit to fix this bug. The
parens were missing. This commit also adds a test case for the bug and
has undergone full testing on PPC and X86.
------------------------------------------------------------------------
[X86ISelLowering] Fix TLSADDR lowering when shrink-wrapping is enabled.
TLSADDR nodes are lowered into actuall calls inside MC. In order to prevent
shrink-wrapping from pushing prologue/epilogue past them (which result
in TLS variables being accessed before the stack frame is set up), we
put markers, so that the stack gets adjusted properly.
Thanks to Quentin Colombet for guidance/help on how to fix this problem!
Hans Wennborg [Fri, 19 Feb 2016 21:35:00 +0000 (21:35 +0000)]
Merging r261360:
------------------------------------------------------------------------
r261360 | dim | 2016-02-19 12:14:11 -0800 (Fri, 19 Feb 2016) | 19 lines
Fix incorrect selection of AVX512 sqrt when OptForSize is on
Summary:
When optimizing for size, sqrt calls can be incorrectly selected as
AVX512 VSQRT instructions. This is because X86InstrAVX512.td has a
`Requires<[OptForSize]>` in its `avx512_sqrt_scalar` multiclass
definition. Even if the target does not support AVX512, the class can
apparently still be chosen, leading to an incorrect selection of
`vsqrtss`.
In PR26625, this lead to an assertion: Reg >= X86::FP0 && Reg <=
X86::FP6 && "Expected FP register!", because the `vsqrtss` instruction
requires an XMM register, which is not available on i686 CPUs.
[IR] Straighten out bundle overload of IRBuilder::CreateCall
IRBuilder has two ways of putting bundle operands on calls: the default
operand bundle, and an overload of CreateCall that takes an operand
bundle list.
Previously, this overload used a default argument of None. This made it
impossible to distinguish between the case were the caller doesn't care
about bundles, and the case where the caller explicitly wants no
bundles. We behaved as if they wanted the latter behavior rather than
the former, which led to problems with simplifylibcalls and WinEH.
This change fixes it by making the parameter non-optional, so we can
distinguish these two cases.
------------------------------------------------------------------------
[X86] Fix a shrink-wrapping miscompile around __chkstk
__chkstk clobbers EAX. If EAX is live across the prologue, then we have
to take extra steps to save it. We already had code to do this if EAX
was a register parameter. This change adapts it to work when shrink
wrapping is used.
------------------------------------------------------------------------
PruneEH had functionality idential to removeUnwindEdge.
Consolidate around removeUnwindEdge.
No functionality change is intended.
------------------------------------------------------------------------
[LoopStrengthReduce] Don't rewrite PHIs with incoming values from CatchSwitches
Bail out if we have a PHI on an EHPad that gets a value from a
CatchSwitchInst. Because the CatchSwitchInst cannot be split, there is
no good place to stick any instructions.
This fixes PR26373.
------------------------------------------------------------------------
[SPARC] Repair floating-point condition encodings in assembly parser.
The encodings for floating point conditions A(lways) and N(ever) were
incorrectly specified for the assembly parser, per Sparc manual v8 page
121. This change corrects that mistake.
Also, strangely, all of the branch instructions already had MC test
cases, except for the broken ones. Added the tests.
Summary:
Document the LLVM_BUILD_LLVM_DYLIB and LLVM_LINK_LLVM_DYLIB
CMake options, move BUILD_SHARED_LIBS out of frequently-used,
and add a note/warning to BUILD_SHARED_LIBS.
Hans Wennborg [Fri, 12 Feb 2016 19:03:12 +0000 (19:03 +0000)]
Merging r260703:
------------------------------------------------------------------------
r260703 | hans | 2016-02-12 11:02:39 -0800 (Fri, 12 Feb 2016) | 11 lines
[CMake] don't build libLTO when LLVM_ENABLE_PIC is OFF
When cmake is run with -DLLVM_ENABLE_PIC=OFF, build fails while
linking shared library libLTO.so, because its dependencies are built
with -fno-PIC. More details here: https://llvm.org/bugs/show_bug.cgi?id=26484.
This diff reverts r252652 (git 9fd4377ddb83aee3c049dc8757e7771edbb8ee71),
which removed check NOT LLVM_ENABLE_PIC before disabling build for libLTO.so.
AMDGPU: Release the scavenged offset register during VGPR spill
Summary:
This fixes a crash where subsequent spills would be unable to scavenge
a register. In particular, it fixes a crash in piglit's
spec@glsl-1.50@execution@geometry@max-input-components (the test still
has a shader that fails to compile because of too many SGPR spills, but
at least it doesn't crash any more).
Hans Wennborg [Fri, 12 Feb 2016 01:42:38 +0000 (01:42 +0000)]
Merging r260587:
------------------------------------------------------------------------
r260587 | pete | 2016-02-11 13:10:40 -0800 (Thu, 11 Feb 2016) | 13 lines
Set load alignment on aggregate loads.
When optimizing a extractvalue(load), we generate a load from the
aggregate type. This load didn't have alignment set and so would
get the alignment of the type. This breaks when the type is packed
and so the alignment should be lower.
For example, loading { int, int } would give us alignment of 4, but
the original load from this type may have an alignment of 1 if packed.
[DWARFDebug] Fix another case of overlapping ranges
Summary:
In r257979, I added code to ensure that we wouldn't merge DebugLocEntries if
the pieces they describe overlap. Unfortunately, I failed to cover the case,
where there may have multiple active Expressions in the entry, in which case we
need to make sure that no two values overlap before we can perform the merge.
[SystemZ] Fix wrong-code generation for certain always-false conditions
We've found another bug in the code generation logic conditions for a
certain class of always-false conditions, those of the form
if ((a & 1) < 0)
These only reach the back end when compiling without optimization.
The bug was introduced by the choice of using TEST UNDER MASK
to implement a check for
if ((a & MASK) < VAL)
as
if ((a & MASK) == 0)
where VAL is less than the the lowest bit of MASK. This is correct
in all cases except for VAL == 0, in which case the original
condition is always false, but the replacement isn't.
This is a simple fix for a PowerPC intrinsic that was incorrectly defined
(the return type was incorrect).
------------------------------------------------------------------------
Using the load immediate only when the immediate (whether signed or unsigned)
can fit in a 16-bit signed field. Namely, from -32768 to 32767 for signed and
0 to 65535 for unsigned. This patch also ensures that we sign-extend under the
right conditions.
------------------------------------------------------------------------
Enable the %s modifier in inline asm template string
This patch corresponds to review:
http://reviews.llvm.org/D16847
There are some files in glibc that use the output operand modifier even though
it was deprecated in GCC. This patch just adds support for it to prevent issues
with such files.
------------------------------------------------------------------------
Address NDEBUG-related linkage issues for Value::assertModuleIsMaterialized()
The IR/Value class had a linkage issue present when LLVM was built
as a library, and the LLVM library build time had different settings
for NDEBUG than the client of the LLVM library. Clients could get
into a state where the LLVM lib expected
Value::assertModuleIsMaterialized() to be inline-defined in the header
but clients expected that method to be defined in the LLVM library.
See this llvm-commits thread for more details:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20160201/329667.html
------------------------------------------------------------------------
Since LI/LIS sign extend the constant passed into the instruction we should
check that the sign extended constant fits into 16-bits if we want a
zero extended value, otherwise go ahead and put it together piecemeal.
This regresses a test in LoopVectorize, so I'll need to go away and think about how to solve this in a way that isn't broken.
From the writeup in PR26071:
What's happening is that ComputeKnownZeroes is telling us that all bits except the LSB are zero. We're then deciding that only the LSB needs to be demanded from the icmp's inputs.
This is where we're wrong - we're assuming that after simplification the bits that were known zero will continue to be known zero. But they're not - during trivialization the upper bits get changed (because an XOR isn't shrunk), so the icmp fails.
The fault is in demandedbits - its contract does clearly state that a non-demanded bit may either be zero or one.
------------------------------------------------------------------------
ARM: don't mangle DAG constant if it has more than one use
The basic optimisation was to convert (mul $LHS, $complex_constant) into
roughly "(shl (mul $LHS, $simple_constant), $simple_amt)" when it was expected
to be cheaper. The original logic checks that the mul only has one use (since
we're mangling $complex_constant), but when used in even more complex
addressing modes there may be an outer addition that can pick up the wrong
value too.
I *think* the ARM addressing-mode problem is actually unreachable at the
moment, but that depends on complex assessments of the profitability of
pre-increment addressing modes so I've put a real check in there instead of an
assertion.
------------------------------------------------------------------------
[InstCombine] avoid an insertelement transformation that induces the opposite extractelement fold (PR26354)
We would infinite loop because we created a shufflevector that was wider than
needed and then failed to combine that with the insertelement. When subsequently
visiting the extractelement from that shuffle, we see that it's unnecessary,
delete it, and trigger another visit to the insertelement.