Matt Arsenault [Wed, 25 Jan 2017 04:25:02 +0000 (04:25 +0000)]
AMDGPU: Implement early ifcvt target hooks.
Leave early ifcvt disabled for now since there are some
shader-db regressions.
This causes some immediate improvements, but could be better.
The cost checking that the pass does is based on critical path
length for out of order CPUs which we do not want so it skips out
on many cases we want.
Chandler Carruth [Wed, 25 Jan 2017 02:49:01 +0000 (02:49 +0000)]
[PM] Teach LoopUnroll to update the LPM infrastructure as it unrolls
loops.
We do this by reconstructing the newly added loops after the unroll
completes to avoid threading pass manager details through all the mess
of the unrolling infrastructure.
I've enabled some extra assertions in the LPM to try and catch issues
here and enabled a bunch of unroller tests to try and make sure this is
sane.
Currently, I'm manually running loop-simplify when needed. That should
go away once it is folded into the LPM infrastructure.
Ahmed Bougacha [Wed, 25 Jan 2017 02:41:38 +0000 (02:41 +0000)]
[GlobalISel] Generate selector for more integer binop patterns.
This surprisingly isn't NFC because there are patterns to select GPR
sub to SUBSWrr (rather than SUBWrr/rs); SUBS is later optimized to
SUB if NZCV is dead. From ISel's perspective, both are fine.
Gor Nishanov [Wed, 25 Jan 2017 02:25:54 +0000 (02:25 +0000)]
[coroutines] Spill the result of the invoke instruction correctly
Summary:
When we decide that the result of the invoke instruction need to be spilled, we need to insert the spill into a block that is on the normal edge coming out of the invoke instruction. (Prior to this change the code would insert the spill immediately after the invoke instruction, which breaks the IR, since invoke is a terminator instruction).
In the following example, we will split the edge going into %cont and insert the spill there.
Justin Bogner [Wed, 25 Jan 2017 00:16:53 +0000 (00:16 +0000)]
GlobalISel: Use the correct types when translating landingpad instructions
There was a bug here where we were using p0 instead of s32 for the
selector type in the landingpad. Instead of hardcoding these types we
should get the types from the landingpad instruction directly.
Note that we replicate an assert from SDAG here to only support
two-valued landingpads.
Matt Arsenault [Tue, 24 Jan 2017 22:18:39 +0000 (22:18 +0000)]
AMDGPU: Remove spurious out branches after a kill
The sequence like this:
v_cmpx_le_f32_e32 vcc, 0, v0
s_branch BB0_30
s_cbranch_execnz BB0_30
; BB#29:
exp null off, off, off, off done vm
s_endpgm
BB0_30:
; %endif110
is likely wrong. The s_branch instruction will unconditionally jump
to BB0_30 and the skip block (exp done + endpgm) inserted for
performing the kill instruction will never be executed. This results
in a GPU hang with Star Ruler 2.
The s_branch instruction is added during the "Control Flow Optimizer"
pass which seems to re-organize the basic blocks, and we assume
that SI_KILL_TERMINATOR is always the last instruction inside a
basic block. Thus, after inserting a skip block we just go to the
next BB without looking at the subsequent instructions after the
kill, and the s_branch op is never removed.
Instead, we should remove the unconditional out branches and let
skip the two instructions if the exec mask is non-zero.
This patch fixes the GPU hang and doesn't introduce any regressions
with "make check".
Dehao Chen [Tue, 24 Jan 2017 21:05:51 +0000 (21:05 +0000)]
Explicitly promote indirect calls before sample profile annotation.
Summary: In iterative sample pgo where profile is collected from PGOed binary, we may see indirect call targets promoted and inlined in the profile. Before profile annotation, we need to make this happen in order to annotate correctly on IR. This patch explicitly promotes these indirect calls and inlines them before profile annotation.
Demangle: correct demangling for CV-qualified functions
When demangling a CV-qualified function type with a final reference type
parameter, we would treat the reference type parameter as a r-value ref
accidentally. This would result in the improper decoration of the
function type itself.
Ivan Krasin [Tue, 24 Jan 2017 19:58:59 +0000 (19:58 +0000)]
Revert [AMDGPU][mc][tests][NFC] Add coverage/smoke tests for Gfx7 and Gfx8.
Reason: broke ASAN bots with a global buffer overflow.
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-fast/builds/2291
Each test contains 20-30K test cases but takes only several (from 4 to 10)
seconds to complete on average machine. The tests cover the majority of
AMDGPU Gfx7/Gfx8 instructions, including many dark corners, and intended
to quickly find out if something is broken.
Daniel Berlin [Tue, 24 Jan 2017 19:55:36 +0000 (19:55 +0000)]
Remove the load hoisting code of MLSM, it is completely subsumed by GVNHoist
Summary:
GVNHoist performs all the optimizations that MLSM does to loads, in a
more general way, and in a faster time bound (MLSM is N^3 in most
cases, N^4 in a few edge cases).
This disables the load portion.
Note that the way ld_hoist_st_sink.ll is written makes one think that
the loads should be moved to the while.preheader block, but
1. Neither MLSM nor GVNHoist do it (they both move them to identical places).
2. MLSM couldn't possibly do it anyway, as the while.preheader block
is not the head of the diamond, while.body is. (GVNHoist could do it
if it was legal).
3. At a glance, it's not legal anyway because the in-loop load
conflict with the in-loop store, so the loads must stay in-loop.
I am happy to update the test to use update_test_checks so that
checking is tighter, just was going to do it as a followup.
Note that i can find no particular benefit to the store portion on any
real testcase/benchmark i have (even size-wise). If we really still
want it, i am happy to commit to writing a targeted store sinker, just
taking the code from the MemorySSA port of MergedLoadStoreMotion
(which is N^2 worst case, and N most of the time).
We can do what it does in a much better time bound.
We also should be both hoisting and sinking stores, not just sinking
them, anyway, since whether we should hoist or sink to merge depends
basically on luck of the draw of where the blockers are placed.
When demangling a CV-qualified function type with a final parameter with
a reference type, we would insert the CV qualification on the parameter
rather than the function, and in the process adjust the insertion point
by one extra, splitting the type name. This avoids doing so, even
though the attribution is still incorrect.
Regalloc creates COPY instructions which do not formally use VALU.
That results in v_mov instructions displaced after exec mask modification.
One pass which do it is SIOptimizeExecMasking, but potentially it can be
done by other passes too.
This patch adds a pass immediately after regalloc to add implicit exec
use operand to all VGPR copy instructions.
Sanjay Patel [Tue, 24 Jan 2017 17:03:24 +0000 (17:03 +0000)]
[InstSimplify] try to eliminate icmp Pred (add nsw X, C1), C2
I was surprised to see that we're missing icmp folds based on 'add nsw' in InstCombine,
but we should handle the InstSimplify cases first because that could make the InstCombine
code simpler.
Simon Pilgrim [Tue, 24 Jan 2017 16:56:23 +0000 (16:56 +0000)]
[X86][AVX2] Removed FIXME comment and regenerated test.
The comment talked about replacing vpmovzxwd+vpslld+vpsrad with vpmovsxwd - which isn't valid as we're sign extending a <8 x i1> bool vector not an all/nobits <8 x i16>
Geoff Berry [Tue, 24 Jan 2017 16:36:07 +0000 (16:36 +0000)]
[SelectionDAG] Handle inverted conditions when splitting into multiple branches.
Summary:
When conditional branches with complex conditions are split into
multiple branches in SelectionDAGBuilder::FindMergedConditions, also
handle inverted conditions. These may sometimes appear without having
been optimized by InstCombine when CodeGenPrepare decides to sink and
duplicate cmp instructions, causing them to have only one use. This
problem can be increased by e.g. GVNHoist hiding more cmps from
InstCombine by combining equivalent cmps from different blocks.
For example codegen X & !(Y | Z) as:
jmp_if_X TmpBB
jmp FBB
TmpBB:
jmp_if_notY Tmp2BB
jmp FBB
Tmp2BB:
jmp_if_notZ TBB
jmp FBB
Chandler Carruth [Tue, 24 Jan 2017 12:55:57 +0000 (12:55 +0000)]
[PH] Replace uses of AssertingVH from members of analysis results with
a lazy-asserting PoisoningVH.
AssertVH is fundamentally incompatible with cache-invalidation of
analysis results. The invaliadtion happens after the AssertingVH has
already fired. Instead, use a PoisoningVH that will assert if the
dangling handle is ever used rather than merely be assigned or
destroyed.
This patch also removes all of the (numerous) doomed attempts to work
around this fundamental incompatibility. It is a pretty significant
simplification IMO.
The most interesting change is in the Inliner where we still do some
clearing because we don't want to rely on the coarse grained
invalidation strategy of the containing pass manager. However, I prefer
the approach that contains this logic to the cleanup phase of the
Inliner, and I think we could enhance the CGSCC analysis management
layer to make this even better in the future if desired.
The rest is straight cleanup.
I've also added a test for one of the harder cases to work around: when
a *module analysis* contains many AssertingVHes pointing at functions.
Chandler Carruth [Tue, 24 Jan 2017 12:34:47 +0000 (12:34 +0000)]
[PM] Introduce a PoisoningVH as a (more expensive) alternative to
AssertingVH that delays any reported error until the handle is *used*.
This allows data structures to contain handles which become dangling
provided the data structure is cleaned up afterward rather than used for
anything interesting.
The implementation is moderately horrible in part because it works to
leave AssertingVH in place, undisturbed. If at some point there is
consensus that this is simply how AssertingVH should be used, it can be
substantially simplified.
This remains a boring pointer in a non-asserts build as you would
expect. The only place we pay cost is in asserts builds.
I plan to use this as a basis for replacing the asserting VHs that
currently dangle in the new PM until invalidation occurs in both LVI and
SCEV.
Artem Tamazov [Tue, 24 Jan 2017 12:22:01 +0000 (12:22 +0000)]
[AMDGPU][mc][tests][NFC] Add coverage/smoke tests for Gfx7 and Gfx8.
Each test contains 20-30K test cases but takes only several (from 4 to 10)
seconds to complete on average machine. The tests cover the majority of
AMDGPU Gfx7/Gfx8 instructions, including many dark corners, and intended
to quickly find out if something is broken.
Pavel Labath [Tue, 24 Jan 2017 11:35:26 +0000 (11:35 +0000)]
Fix fs::set_current_path unit test
The test fails when there is a symlink on the path because then the path
returned by current_path will not match the one we have set. Instead of
doing a string match check the unique id of the two files.
Pavel Labath [Tue, 24 Jan 2017 10:57:01 +0000 (10:57 +0000)]
[Support] Use O_CLOEXEC only when declared
Summary:
Use the O_CLOEXEC flag only when it is available. Some old systems (e.g.
SLES10) do not support this flag. POSIX explicitly guarantees that this
flag can be checked for using #if, so there is no need for a CMake
check.
In case O_CLOEXEC is not supported, fall back to fcntl(FD_CLOEXEC)
instead.
Summary:
This adds a cross-platform way of setting the current working directory
analogous to the existing current_path() function used for retrieving
it. The function will be used in lldb.
Greg Parker [Tue, 24 Jan 2017 09:58:02 +0000 (09:58 +0000)]
[lit] Allow boolean expressions in REQUIRES and XFAIL and UNSUPPORTED
A `lit` condition line is now a comma-separated list of boolean expressions.
Comma-separated expressions act as if each expression were on its own
condition line:
For REQUIRES, if every expression is true then the test will run.
For UNSUPPORTED, if every expression is false then the test will run.
For XFAIL, if every expression is false then the test is expected to succeed.
As a special case "XFAIL: *" expects the test to fail.
Examples:
# Test is expected fail on 64-bit Apple simulators and pass everywhere else
XFAIL: x86_64 && apple && !macosx
# Test is unsupported on Windows and on non-Ubuntu Linux
# and supported everywhere else
UNSUPPORTED: linux && !ubuntu, system-windows
Syntax:
* '&&', '||', '!', '(', ')'. 'true' is true. 'false' is false.
* Each test feature is a true identifier.
* Substrings of the target triple are true identifiers for UNSUPPORTED
and XFAIL, but not for REQUIRES. (This matches the current behavior.)
* All other identifiers are false.
* Identifiers are [-+=._a-zA-Z0-9]+
Greg Parker [Tue, 24 Jan 2017 08:45:50 +0000 (08:45 +0000)]
[lit] Allow boolean expressions in REQUIRES and XFAIL and UNSUPPORTED
A `lit` condition line is now a comma-separated list of boolean expressions.
Comma-separated expressions act as if each expression were on its own
condition line:
For REQUIRES, if every expression is true then the test will run.
For UNSUPPORTED, if every expression is false then the test will run.
For XFAIL, if every expression is false then the test is expected to succeed.
As a special case "XFAIL: *" expects the test to fail.
Examples:
# Test is expected fail on 64-bit Apple simulators and pass everywhere else
XFAIL: x86_64 && apple && !macosx
# Test is unsupported on Windows and on non-Ubuntu Linux
# and supported everywhere else
UNSUPPORTED: linux && !ubuntu, system-windows
Syntax:
* '&&', '||', '!', '(', ')'. 'true' is true. 'false' is false.
* Each test feature is a true identifier.
* Substrings of the target triple are true identifiers for UNSUPPORTED
and XFAIL, but not for REQUIRES. (This matches the current behavior.)
* All other identifiers are false.
* Identifiers are [-+=._a-zA-Z0-9]+
Lang Hames [Tue, 24 Jan 2017 06:13:47 +0000 (06:13 +0000)]
[Orc][RPC] Refactor ParallelCallGroup to decouple it from RPCEndpoint.
This refactor allows parallel calls to be made via an arbitrary async call
dispatcher. In particular, this allows ParallelCallGroup to be used with
derived RPC classes that expose custom async RPC call operations.
Serge Pavlov [Tue, 24 Jan 2017 05:52:07 +0000 (05:52 +0000)]
Make VerifyDomInfo and VerifyLoopInfo global variables
Verifications of dominator tree and loop info are expensive operations
so they are disabled by default. They can be enabled by command line
options -verify-dom-info and -verify-loop-info. These options however
enable checks only in files Dominators.cpp and LoopInfo.cpp. If some
transformation changes dominaror tree and/or loop info, it would be
convenient to place similar checks to the files implementing the
transformation.
This change makes corresponding flags global, so they can be used in
any file to optionally turn verification on.
Jonas Paulsson [Tue, 24 Jan 2017 05:43:03 +0000 (05:43 +0000)]
[SystemZ] Gracefully fail in GeneralShuffle::add() instead of assertion.
The GeneralShuffle::add() method used to have an assert that made sure that
source elements were at least as big as the destination elements. This was
wrong, since it is actually expected that an EXTRACT_VECTOR_ELT node with a
smaller source element type than the return type gets extended.
Therefore, instead of asserting this, it is just checked and if this is the
case 'false' is returned from the GeneralShuffle::add() method. This case
should be very rare and is not handled further by the backend.
Allow DenseSet::iterators to be conveted to and compared with const_iterator
Summary:
This seemed to be an oversight seeing as DenseMap has these conversions.
This patch does the following:
- Adds a default constructor to the iterators.
- Allows DenseSet::ConstIterators to be copy constructed from DenseSet::Iterators
- Allows mutual comparison between Iterators and ConstIterators.
All of these are available in the DenseMap implementation, so the implementation here is trivial.
Craig Topper [Tue, 24 Jan 2017 02:36:59 +0000 (02:36 +0000)]
[SelectionDAG] Teach getNode to simplify a couple easy cases of EXTRACT_SUBVECTOR
Summary:
This teaches getNode to simplify extracting from Undef. This is similar to what is done for EXTRACT_VECTOR_ELT. It also adds support for extracting from CONCAT_VECTOR when we can reuse one of the inputs to the concat. These seem like simple non-target specific optimizations.
For X86 we currently handle undef in extractSubvector, but not all EXTRACT_SUBVECTOR creations go through there.
Ultimately, my motivation here is to simplify extractSubvector and remove custom lowering for EXTRACT_SUBVECTOR since we don't do anything but handle undef and BUILD_VECTOR optimizations, but those should be DAG combines.
Craig Topper [Tue, 24 Jan 2017 02:10:15 +0000 (02:10 +0000)]
[APInt] Remove calls to clearUnusedBits from XorSlowCase and operator^=
Summary:
There's a comment in XorSlowCase that says "0^0==1" which isn't true. 0 xored with 0 is still 0. So I don't think we need to clear any unused bits here.
Now there is no difference between XorSlowCase and AndSlowCase/OrSlowCase other than the operation being performed
Matthias Braun [Tue, 24 Jan 2017 01:12:58 +0000 (01:12 +0000)]
LiveIntervalAnalysis: Calculate liveness even if a superreg is reserved.
A register unit may be allocatable and non-reserved but some of the
register(tuples) built with it are reserved. We still need to calculate
liveness in this case.
Note to out of tree targets: If you start seeing machine verifier errors
with this commit, it probably means that you do not properly mark super
registers of reserved register as reserved. See for example r292836 or
r292870 for example on how to fix that.
Matthias Braun [Tue, 24 Jan 2017 01:12:30 +0000 (01:12 +0000)]
PowerPC: Mark super regs of reserved regs reserved.
When a register like R1 is reserved, X1 should be reserved as well. This
was already done "manually" when 64bit code was enabled, however using
the markSuperRegs() function on the base register is more convenient and
allows to use the checksAllSuperRegsMarked() function even in 32bit mode
to avoid accidental breakage in the future.
This is also necessary to allow https://reviews.llvm.org/D28881
[sanitizer-coverage] emit __sanitizer_cov_trace_pc_guard w/o a preceding 'if' by default. Update the docs, also add deprecation notes around other parts of sanitizer coverage
Running non-LCSSA-preserving LoopSimplify followed by LCSSA on (roughly) the
same loop is incorrect, since LoopSimplify may break LCSSA arbitrarily higher
in the loop nest. Instead, run LCSSA first, and then run LCSSA-preserving
LoopSimplify on the result.
David L. Jones [Mon, 23 Jan 2017 23:16:46 +0000 (23:16 +0000)]
[Analysis] Add LibFunc_ prefix to enums in TargetLibraryInfo. (NFC)
Summary:
The LibFunc::Func enum holds enumerators named for libc functions.
Unfortunately, there are real situations, including libc implementations, where
function names are actually macros (musl uses "#define fopen64 fopen", for
example; any other transitively visible macro would have similar effects).
Strictly speaking, a conforming C++ Standard Library should provide any such
macros as functions instead (via <cstdio>). However, there are some "library"
functions which are not part of the standard, and thus not subject to this
rule (fopen64, for example). So, in order to be both portable and consistent,
the enum should not use the bare function names.
The old enum naming used a namespace LibFunc and an enum Func, with bare
enumerators. This patch changes LibFunc to be an enum with enumerators prefixed
with "LibFFunc_". (Unfortunately, a scoped enum is not sufficient to override
macros.)
[RDF] Add registers to live set even if they are live already
When calculating kills, a register may be considered live because a part
of it is live, but if there is a use of that (whole) register, the whole
register (and its subregisters) need to be added to the live set.
[libFuzzer] mutate empty input using the regular mutators (instead of a custom dummy one). This way when we mutate an empty input there is a chance we will get a dictionary word
Tim Shen [Mon, 23 Jan 2017 22:39:35 +0000 (22:39 +0000)]
[APFloat] Switch from (PPCDoubleDoubleImpl, IEEEdouble) layout to (IEEEdouble, IEEEdouble)
Summary:
This patch changes the layout of DoubleAPFloat, and adjust all
operations to do either:
1) (IEEEdouble, IEEEdouble) -> (uint64_t, uint64_t) -> PPCDoubleDoubleImpl,
then run the old algorithm.
2) Do the right thing directly.
I could split this into two patches, e.g. use
1) for all operatoins first, then incrementally change some of them to
2). I didn't do that, because 1) involves code that converts data between
PPCDoubleDoubleImpl and (IEEEdouble, IEEEdouble) back and forth, and may
pessimize the compiler. Instead, I find easy functions and use
approach 2) for them directly.
Next step is to implement move multiply and divide from 1) to 2). I don't
have plans for other functions in 1).
Kevin Enderby [Mon, 23 Jan 2017 21:13:29 +0000 (21:13 +0000)]
Add support for the x86_thread_state32_t and
in llvm-objdump for Mach-O files add the printing of the
x86_thread_state32_t in the same format as
otool-classic(1) on darwin.
To do this the 32-bit x86 general tread state
needed to be defined in include/llvm/Support/MachO.h .