Simon Dardis [Thu, 26 Jan 2017 10:19:02 +0000 (10:19 +0000)]
[mips] N64 static relocation model support
This patch makes one change to GOT handling and two changes to N64's
relocation model handling. Furthermore, the jumptable encodings have
been corrected for static N64.
Big GOT handling is now done via a new SDNode MipsGotHi - this node is
unconditionally lowered to an lui instruction.
The first change to N64's relocation handling is the lifting of the
restriction that N64 always uses PIC. Now it is possible to target static
environments.
The second change adds support for 64 bit symbols and enables them by
default. Previously N64 had patterns for sym32 mode only. In this mode all
symbols are assumed to have 32 bit addresses. sym32 mode support
is selectable with attribute 'sym32'. A follow on patch for clang will
add the necessary frontend parameter.
Diana Picus [Thu, 26 Jan 2017 09:20:47 +0000 (09:20 +0000)]
[ARM] GlobalISel: Load i1, i8 and i16 args from stack
Add support for loading i1, i8 and i16 arguments from the stack, with or without
the ABI extension flags.
When the ABI extension flags are present, we load a 4-byte value, otherwise we
preserve the size of the load and let the instruction selector replace it with a
LDRB/LDRH. This generates the same thing as DAGISel.
Chandler Carruth [Thu, 26 Jan 2017 08:31:54 +0000 (08:31 +0000)]
[PM] Use PoisoningVH correctly when merely deleting entries in a map
with it.
This code was dereferencing the PoisoningVH which isn't allowed once it
is poisoned. But the code itself really doesn't need to access the
pointer, it is just doing the safe stuff of clearing out data structures
keyed on the pointer value.
Change the code to use iterators to erase directly from a DenseMap. This
is also substantially more efficient as it avoids lots of hashing and
lookups to do the erasure. DenseMap supports iterating behind the
iteration which is fairly easy to implement.
Sadly, I don't have a test case here. I'm not even close and I don't
know that I ever will be. The issue is that several of the tricky
aspects of fixing this only show up when you cause the stack's
SmallVector to be in *EXACTLY* the right location. I only ever got
a reproduction for those with Clang, and only with *exactly* the right
command line flags. Any adjustment, even to seemingly unrelated flags,
would make partial and half-way solutions magically start to "work". In
good news, all of this was caught with the LLVM test suite. Also, there
is no *specific* code here that is untested, just that the old pattern
of code won't immediately fail on any test case I've managed to
contrive.
Jonas Paulsson [Thu, 26 Jan 2017 07:03:25 +0000 (07:03 +0000)]
[TargetTransformInfo] Refactor and improve getScalarizationOverhead()
Refactoring to remove duplications of this method.
New method getOperandsScalarizationOverhead() that looks at the present unique
operands and add extract costs for them. Old behaviour was to just add extract
costs for one operand of the type always, which still happens in
getArithmeticInstrCost() if no operands are provided by the caller.
This is a good start of improving on this, but there are more places
that can be improved by using getOperandsScalarizationOverhead().
Review: Hal Finkel
https://reviews.llvm.org/D29017
Craig Topper [Thu, 26 Jan 2017 05:17:13 +0000 (05:17 +0000)]
[X86] Add demanded elts support for the inputs to pclmul intrinsic
This intrinsic uses bit 0 and bit 4 of an immediate argument to determine which bits of its inputs to read. This patch uses this information to simplify the demanded elements of the input vectors.
Chandler Carruth [Thu, 26 Jan 2017 02:13:50 +0000 (02:13 +0000)]
[PM] Simplify the new PM interface to the loop unroller and expose two
factory functions for the two modes the loop unroller is actually used
in in-tree: simplified full-unrolling and the entire thing including
partial unrolling.
I've also wired these up to nice names so you can express both of these
being in a pipeline easily. This is a precursor to actually enabling
these parts of the O2 pipeline.
Chandler Carruth [Thu, 26 Jan 2017 02:07:20 +0000 (02:07 +0000)]
[Loops] Restructure the LoopInfo verify function so that it more
directly walks the current loop structure verifying that a matching
structure can be found in a freshly computed version.
Also pull things out of containers when necessary once an issue is found
and print them directly.
This makes it substantially easier to debug verification failures as
the process stops at the exact point in the loop nest where they diverge
and has in easily accessed local variables (or printed to stderr
already) the loops and other information needed to analyze the failure.
[LoopUnroll] Properly update loopinfo for runtime unrolling by 2
Even when we don't create a remainder loop (that is, when we unroll by 2), we
may duplicate nested loops into the remainder. This is complicated by the fact
the remainder may itself be either inserted into an outer loop, or at the top
level. In the latter case, we may need to create new top-level loops.
Kevin Enderby [Wed, 25 Jan 2017 23:57:32 +0000 (23:57 +0000)]
Change the test added in r293099 so it does not have the string "llvm-nm" to fix
the clang-x86-windows-msvc2015 bot as the name is "llvm-nm.EXE" in that case.
Adam Nemet [Wed, 25 Jan 2017 23:20:33 +0000 (23:20 +0000)]
New OptimizationRemarkEmitter pass for MIR
This allows MIR passes to emit optimization remarks with the same level
of functionality that is available to IR passes.
It also hooks up the greedy register allocator to report spills. This
allows for interesting use cases like increasing interleaving on a loop
until spilling of registers is observed.
I still need to experiment whether reporting every spill scales but this
demonstrates for now that the functionality works from llc
using -pass-remarks*=<pass>.
Adam Nemet [Wed, 25 Jan 2017 23:20:25 +0000 (23:20 +0000)]
[OptDiag] Split code region out of DiagnosticInfoOptimizationBase
Code region is the only part of this class that is IR-specific. Code
region is moved down in the inheritance tree to a new derived class,
called DiagnosticInfoIROptimization.
All the existing remarks are derived from this new class now.
This allows the new MIR pass-remark classes to be derived from
DiagnosticInfoOptimizationBase.
Also because we keep the name DiagnosticInfoOptimizationBase, the clang
parts don't need any adjustment.
Kevin Enderby [Wed, 25 Jan 2017 21:33:38 +0000 (21:33 +0000)]
Add a warning when the llvm-nm -print-size flag is used on a Mach-O file as
Mach-O files don’t have size information about the symbols in the object file
format unlike ELF.
Also add the part of the fix to llvm-nm that was missed with r290001 so
-arch armv7m works.
Daniel Jasper [Wed, 25 Jan 2017 21:21:08 +0000 (21:21 +0000)]
Revert "[PPC] Give unaligned memory access lower cost on processor that supports it"
This reverts commit r292680. It is causing significantly worse
performance and test timeouts in our internal builds. I have already
routed reproduction instructions your way.
Zachary Turner [Wed, 25 Jan 2017 21:17:40 +0000 (21:17 +0000)]
[pdb] Correctly parse the hash adjusters table from TPI stream.
This is not a list of pairs, it is a hash table data structure. We now
correctly parse this out and dump it from llvm-pdbdump.
We still need to understand the conditions that lead to a type
getting an entry in the hash adjuster table. That will be done
in a followup investigation / patch.
Tim Northover [Wed, 25 Jan 2017 20:58:26 +0000 (20:58 +0000)]
SDag: fix how initial loads are formed when splitting vector ops.
Later code expects the vector loads produced to be directly
concatenable, which means we shouldn't pad anything except the last load
produced with UNDEF.
Daniel Berlin [Wed, 25 Jan 2017 20:56:19 +0000 (20:56 +0000)]
MemorySSA: Link all defs together into an intrusive defslist, to make updater easier
Summary:
This is the first in a series of patches to add a simple, generalized updater to MemorySSA.
For MemorySSA, every def is may-def, instead of the normal must-def.
(the best way to think of memoryssa is "everything is really one variable, with different versions of that variable at different points in the program).
This means when updating, we end up having to do a bunch of work to touch defs below and above us.
In order to support this quickly, i have ilist'd all the defs for each block. ilist supports tags, so this is quite easy. the only slightly messy part is that you can't have two iplists for the same type that differ only whether they have the ownership part enabled or not, because the traits are for the value type.
The verifiers have been updated to test that the def order is correct.
Serge Rogatch [Wed, 25 Jan 2017 20:21:49 +0000 (20:21 +0000)]
[XRay][AArch64] More staging for tail call support in XRay on AArch64 - in LLVM
Summary:
This patch prepares more for tail call support in XRay. Until the logging part supports tail calls, this is just staging, so it seems LLVM part is mostly ready with this patch.
Related: https://reviews.llvm.org/D28948 (compiler-rt)
Matthias Braun [Wed, 25 Jan 2017 17:12:10 +0000 (17:12 +0000)]
PowerPC: Slight cleanup of getReservedRegs(); NFC
Change getReservedRegs() to not mark a register as reserved and then
revert that decision in some cases. Motivated by the discussion in
https://reviews.llvm.org/D29056
Artur Pilipenko [Wed, 25 Jan 2017 16:00:44 +0000 (16:00 +0000)]
[Guards] Introduce loop-predication pass
This patch introduces guard based loop predication optimization. The new LoopPredication pass tries to convert loop variant range checks to loop invariant by widening checks across loop iterations. For example, it will convert
for (i = 0; i < n; i++) {
guard(i < len);
...
}
to
for (i = 0; i < n; i++) {
guard(n - 1 < len);
...
}
After this transformation the condition of the guard is loop invariant, so loop-unswitch can later unswitch the loop by this condition which basically predicates the loop by the widened condition:
if (n - 1 < len)
for (i = 0; i < n; i++) {
...
}
else
deoptimize
This patch relies on an NFC change to make ScalarEvolution::isMonotonicPredicate public (revision 293062).
Martin Bohme [Wed, 25 Jan 2017 14:28:19 +0000 (14:28 +0000)]
[ARM] GlobalISel: Fix stack-use-after-scope bug.
Summary:
Lifetime extension wasn't triggered on the result of BuildMI because the
reference was non-const. However, instead of adding a const, I've
removed the reference entirely as RVO should kick in anyway.
Alexey Bataev [Wed, 25 Jan 2017 09:54:38 +0000 (09:54 +0000)]
[SLP] Improve horizontal vectorization for non-power-of-2 number of
instructions.
If number of instructions in horizontal reduction list is not power of 2
then only PowerOf2Floor(NumberOfInstructions) last elements are actually
vectorized, other instructions remain scalar. Patch tries to vectorize
the remaining elements either.
whitequark [Wed, 25 Jan 2017 09:32:30 +0000 (09:32 +0000)]
Mark @llvm.powi.* as safe to speculatively execute.
Floating point intrinsics in LLVM are generally not speculatively
executed, since most of them are defined to behave the same as libm
functions, which set errno.
However, the @llvm.powi.* intrinsics do not correspond to any libm
function, and lacks any defined error handling semantics in LangRef.
It most certainly does not alter errno.
Artur Pilipenko [Wed, 25 Jan 2017 08:53:31 +0000 (08:53 +0000)]
[DAGCombiner] Match load by bytes idiom and fold it into a single load. Attempt #2.
The previous patch (https://reviews.llvm.org/rL289538) got reverted because of a bug. Chandler also requested some changes to the algorithm.
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20161212/413479.html
This is an updated patch. The key difference is that collectBitProviders (renamed to calculateByteProvider) now collects the origin of one byte, not the whole value. It simplifies the implementation and allows to stop the traversal earlier if we know that the result won't be used.
From the original commit:
Match a pattern where a wide type scalar value is loaded by several narrow loads and combined by shifts and ors. Fold it into a single load or a load and a bswap if the targets supports it.
Assuming little endian target:
i8 *a = ...
i32 val = a[0] | (a[1] << 8) | (a[2] << 16) | (a[3] << 24)
=>
i32 val = *((i32)a)
This optimization was discussed on llvm-dev some time ago in "Load combine pass" thread. We came to the conclusion that we want to do this transformation late in the pipeline because in presence of atomic loads load widening is irreversible transformation and it might hinder other optimizations.
Eventually we'd like to support folding patterns like this where the offset has a variable and a constant part:
i32 val = a[i] | (a[i + 1] << 8) | (a[i + 2] << 16) | (a[i + 3] << 24)
Matching the pattern above is easier at SelectionDAG level since address reassociation has already happened and the fact that the loads are adjacent is clear. Understanding that these loads are adjacent at IR level would have involved looking through geps/zexts/adds while looking at the addresses.
The general scheme is to match OR expressions by recursively calculating the origin of individual bytes which constitute the resulting OR value. If all the OR bytes come from memory verify that they are adjacent and match with little or big endian encoding of a wider value. If so and the load of the wider type (and bswap if needed) is allowed by the target generate a load and a bswap if needed.
Diana Picus [Wed, 25 Jan 2017 08:10:40 +0000 (08:10 +0000)]
[ARM] GlobalISel: Support i8/i16 ABI extensions
At the moment, this means supporting the signext/zeroext attribute on the return
type of the function. For function arguments, signext/zeroext should be handled
by the caller, so there's nothing for us to do until we start lowering calls.
Note that this does not include support for other extensions (i8 to i16), those
will be added later.
Serge Pavlov [Wed, 25 Jan 2017 07:58:10 +0000 (07:58 +0000)]
Do not verify dominator tree if it has no roots
If dominator tree has no roots, the pass that calculates it is
likely to be skipped. It occures, for instance, in the case of
entities with linkage available_externally. Do not run tree
verification in such case.
Coby Tayree [Wed, 25 Jan 2017 07:09:42 +0000 (07:09 +0000)]
[X86]Enable the use of 'mov' with a 64bit GPR and a large immediate
Enable the next form (intel style):
"mov <reg64>, <largeImm>"
which is should be available,
where <largeImm> stands for immediates which exceed the range of a singed 32bit integer
Akira Hatanaka [Wed, 25 Jan 2017 06:21:51 +0000 (06:21 +0000)]
[SimplifyCFG] Do not sink and merge inline-asm instructions.
Conservatively disable sinking and merging inline-asm instructions as doing so
can potentially create arguments that cannot satisfy the inline-asm constraints.
For example, SimplifyCFG used to do the following transformation:
Matt Arsenault [Wed, 25 Jan 2017 06:08:42 +0000 (06:08 +0000)]
DAG: Recognize no-signed-zeros-fp-math attribute
clang already emits this with -cl-no-signed-zeros, but codegen
doesn't do anything with it. Treat it like the other fast math
attributes, and change one place to use it.
Matt Arsenault [Wed, 25 Jan 2017 04:25:02 +0000 (04:25 +0000)]
AMDGPU: Implement early ifcvt target hooks.
Leave early ifcvt disabled for now since there are some
shader-db regressions.
This causes some immediate improvements, but could be better.
The cost checking that the pass does is based on critical path
length for out of order CPUs which we do not want so it skips out
on many cases we want.
Chandler Carruth [Wed, 25 Jan 2017 02:49:01 +0000 (02:49 +0000)]
[PM] Teach LoopUnroll to update the LPM infrastructure as it unrolls
loops.
We do this by reconstructing the newly added loops after the unroll
completes to avoid threading pass manager details through all the mess
of the unrolling infrastructure.
I've enabled some extra assertions in the LPM to try and catch issues
here and enabled a bunch of unroller tests to try and make sure this is
sane.
Currently, I'm manually running loop-simplify when needed. That should
go away once it is folded into the LPM infrastructure.
Ahmed Bougacha [Wed, 25 Jan 2017 02:41:38 +0000 (02:41 +0000)]
[GlobalISel] Generate selector for more integer binop patterns.
This surprisingly isn't NFC because there are patterns to select GPR
sub to SUBSWrr (rather than SUBWrr/rs); SUBS is later optimized to
SUB if NZCV is dead. From ISel's perspective, both are fine.
Gor Nishanov [Wed, 25 Jan 2017 02:25:54 +0000 (02:25 +0000)]
[coroutines] Spill the result of the invoke instruction correctly
Summary:
When we decide that the result of the invoke instruction need to be spilled, we need to insert the spill into a block that is on the normal edge coming out of the invoke instruction. (Prior to this change the code would insert the spill immediately after the invoke instruction, which breaks the IR, since invoke is a terminator instruction).
In the following example, we will split the edge going into %cont and insert the spill there.
Justin Bogner [Wed, 25 Jan 2017 00:16:53 +0000 (00:16 +0000)]
GlobalISel: Use the correct types when translating landingpad instructions
There was a bug here where we were using p0 instead of s32 for the
selector type in the landingpad. Instead of hardcoding these types we
should get the types from the landingpad instruction directly.
Note that we replicate an assert from SDAG here to only support
two-valued landingpads.
Matt Arsenault [Tue, 24 Jan 2017 22:18:39 +0000 (22:18 +0000)]
AMDGPU: Remove spurious out branches after a kill
The sequence like this:
v_cmpx_le_f32_e32 vcc, 0, v0
s_branch BB0_30
s_cbranch_execnz BB0_30
; BB#29:
exp null off, off, off, off done vm
s_endpgm
BB0_30:
; %endif110
is likely wrong. The s_branch instruction will unconditionally jump
to BB0_30 and the skip block (exp done + endpgm) inserted for
performing the kill instruction will never be executed. This results
in a GPU hang with Star Ruler 2.
The s_branch instruction is added during the "Control Flow Optimizer"
pass which seems to re-organize the basic blocks, and we assume
that SI_KILL_TERMINATOR is always the last instruction inside a
basic block. Thus, after inserting a skip block we just go to the
next BB without looking at the subsequent instructions after the
kill, and the s_branch op is never removed.
Instead, we should remove the unconditional out branches and let
skip the two instructions if the exec mask is non-zero.
This patch fixes the GPU hang and doesn't introduce any regressions
with "make check".