Reid Kleckner [Wed, 28 Dec 2016 19:05:12 +0000 (19:05 +0000)]
[WinEH] Don't assume endFunction is called while in .text
Jump table emission can switch to .rdata before
WinException::endFunction gets called. Just remember the appropriate
text section we started in and reset back to it when we end the
function. We were already switching sections back from .xdata anyway.
Fixes the first problem in PR31488, so that now COFF switch tables can
live in .rdata if we want them to.
Chandler Carruth [Wed, 28 Dec 2016 11:07:33 +0000 (11:07 +0000)]
[PM] Introduce a devirtualization iteration layer for the new PM.
This is an orthogonal and separated layer instead of being embedded
inside the pass manager. While it adds a small amount of complexity, it
is fairly minimal and the composability and control seems worth the
cost.
The logic for this ends up being nicely isolated and targeted. It should
be easy to experiment with different iteration strategies wrapped around
the CGSCC bottom-up walk using this kind of facility.
The mechanism used to track devirtualization is the simplest one I came
up with. I think it handles most of the cases the existing iteration
machinery handles, but I haven't done a *very* in depth analysis. It
does however match the basic intended semantics, and we can tweak or
tune its exact behavior incrementally as necessary. One thing that we
may want to revisit is freshly building the value handle set on each
iteration. While I don't think this will be a significant cost (it is
strictly fewer value handles but more churn of value handes than the old
call graph), it is conceivable that we'll want a somewhat more clever
tracking mechanism. My hope is to layer that on as a follow up patch
with data supporting any implementation complexity it adds.
This code also provides for a basic count heuristic: if the number of
indirect calls decreases and the number of direct calls increases for
a given function in the SCC, we assume devirtualization is responsible.
This matches the heuristics currently used in the legacy pass manager.
Chandler Carruth [Wed, 28 Dec 2016 10:34:50 +0000 (10:34 +0000)]
[PM] Teach the CGSCC's CG update utility to more carefully invalidate
analyses when we're about to break apart an SCC.
We can't wait until after breaking apart the SCC to invalidate things:
1) Which SCC do we then invalidate? All of them?
2) Even if we invalidate all of them, a newly created SCC may not have
a proxy that will convey the invalidation to functions!
Previously we only invalidated one of the SCCs and too late. This led to
stale analyses remaining in the cache. And because the caching strategy
actually works, they would get used and chaos would ensue.
Doing invalidation early is somewhat pessimizing though if we *know*
that the SCC structure won't change. So it turns out that the design to
make the mutation API force the caller to know the *kind* of mutation in
advance was indeed 100% correct and we didn't do enough of it. So this
change also splits two cases of switching a call edge to a ref edge into
two separate APIs so that callers can clearly test for this and take the
easy path without invalidating when appropriate. This is particularly
important in this case as we expect most inlines to be between functions
in separate SCCs and so the common case is that we don't have to so
aggressively invalidate analyses.
The LCG API change in turn needed some basic cleanups and better testing
in its unittest. No interesting functionality changed there other than
more coverage of the returned sequence of SCCs.
While this seems like an obvious improvement over the current state, I'd
like to revisit the core concept of invalidating within the CG-update
layer at all. I'm wondering if we would be better served forcing the
callers to handle the invalidation beforehand in the cases that they
can handle it. An interesting example is when we want to teach the
inliner to *update and preserve* analyses. But we can cross that bridge
when we get there.
With this patch, the new pass manager an build all of the LLVM test
suite at -O3 and everything passes. =D I haven't bootstrapped yet and
I'm sure there are still plenty of bugs, but this gives a nice baseline
so I'm going to increasingly focus on fleshing out the missing
functionality, especially the bits that are just turned off right now in
order to let us establish this baseline.
Gadi Haber [Wed, 28 Dec 2016 10:12:48 +0000 (10:12 +0000)]
This is a large patch for X86 AVX-512 of an optimization for reducing code size by encoding EVEX AVX-512 instructions using the shorter VEX encoding when possible.
There are cases of AVX-512 instructions that have two possible encodings. This is the case with instructions that use vector registers with low indexes of 0 - 15 and do not use the zmm registers or the mask k registers.
The EVEX encoding prefix requires 4 bytes whereas the VEX prefix can take only up to 3 bytes. Consequently, using the VEX encoding for these instructions results in a code size reduction of ~2 bytes even though it is compiled with the AVX-512 features enabled.
Reviewers: Craig Topper, Zvi Rackoover, Elena Demikhovsky
Differential Revision: https://reviews.llvm.org/D27901
Chandler Carruth [Wed, 28 Dec 2016 03:13:12 +0000 (03:13 +0000)]
[PM] Teach the inliner's call graph update to handle inserting new edges
when they are call edges at the leaf but may (transitively) be reached
via ref edges.
It turns out there is a simple rule: insert everything as a ref edge
which is a safe conservative default. Then we let the existing update
logic handle promoting some of those to call edges.
Note that it would be fairly cheap to make these call edges right away
if that is desirable by testing whether there is some existing call path
from the source to the target. It just seemed like slightly more
complexity in this code path that isn't strictly necessary. If anyone
feels strongly about handling this differently I'm happy to change it.
Chandler Carruth [Wed, 28 Dec 2016 02:24:55 +0000 (02:24 +0000)]
[PM] Disable the loop vectorizer from the new PM's pipeline as it
currenty relies on the old PM's dependency system forming LCSSA.
The new PM will require a different design for this, and for now this is
causing most of the issues I'm currently seeing in testing. I'd like to
get to a testable baseline and then work on re-enabling things one at
a time.
[libFuzzer] add an experimental flag -experimental_len_control=1 that sets max_len to 1M and tries to increases the actual max sizes of mutations very gradually (second attempt)
Replace the use of grep with FileCheck. Tidy up some of the tests. A
few of the tests have been left as weak as previously, though some have
been made more stringent.
Chandler Carruth [Tue, 27 Dec 2016 18:04:11 +0000 (18:04 +0000)]
[PM] Remove a pointless optimization.
There is no need to do this within an analysis. That method shouldn't
even be reached if this predicate holds as the actual useful
optimization is in the analysis manager itself.
Chandler Carruth [Tue, 27 Dec 2016 17:59:22 +0000 (17:59 +0000)]
[PM] Add more dedicated testing to cover the invalidation logic added to
BasicAA in r290603.
I've kept the basic testing in the new PM test file as that also covers
the AAManager invalidation logic. If/when there is a good place for
broader AA testing it could move there.
This test is somewhat unsatisfying as I can't get it to fail even with
ASan outside of explicit checks of the invalidation. Apparently we don't
yet have any test coverage of the BasicAA code paths using either the
domtree or loopinfo -- I made both of them always be null and check-llvm
passed.
Teresa Johnson [Tue, 27 Dec 2016 17:45:09 +0000 (17:45 +0000)]
[ThinLTO] Fix "||" vs "|" mixup.
The effect of the bug was that we would incorrectly create summaries
for global and weak values defined in module asm (since we were
essentially testing for bit 1 which is SF_Undefined, and the
RecordStreamer ignores local undefined references). This would have
resulted in conservatively disabling importing of anything referencing
globals and weaks defined in module asm. Added these cases to the test
which now fails without this bug fix.
Artem Tamazov [Tue, 27 Dec 2016 16:00:11 +0000 (16:00 +0000)]
[AMDGPU][llvm-mc] Predefined symbols to access register counts (.kernel.{v|s}gpr_count)
The feature allows for conditional assembly, filling the entries
of .amd_kernel_code_t etc.
Symbols are defined with value 0 at the beginning of each kernel scope.
After each register usage, the respective symbol is set to:
value = max( value, ( register index + 1 ) )
Thus, at the end of scope the value represents a count of used registers.
Kernel scopes begin at .amdgpu_hsa_kernel directive, end at the
next .amdgpu_hsa_kernel (or EOF, whichever comes first). There is also
dummy scope that lies from the beginning of source file til the
first .amdgpu_hsa_kernel.
Piotr Padlewski [Tue, 27 Dec 2016 15:06:07 +0000 (15:06 +0000)]
[MemDep] Operand visited twice bugfix
Because operand was not marked as seen it was visited twice.
It doesn't change behavior of optimization, it just saves redudant
visit, so no test changes.
Chandler Carruth [Tue, 27 Dec 2016 10:30:45 +0000 (10:30 +0000)]
[PM] Teach BasicAA how to invalidate its result object.
This requires custom handling because BasicAA caches handles to other
analyses and so it needs to trigger indirect invalidation.
This fixes one of the common crashes when using the new PM in real
pipelines. I've also tweaked a regression test to check that we are at
least handling the most immediate case.
I'm going to work at re-structuring this test some to both scale better
(rather than all being in one file) and check more invalidation paths in
a follow-up commit, but I wanted to get the basic bug fix in place.
Chandler Carruth [Tue, 27 Dec 2016 10:16:46 +0000 (10:16 +0000)]
[PM] Disable more of the loop passes -- LCSSA and LoopSimplify are also
not really wired into the loop pass manager in a way that will let us
productively use these passes yet.
This lets the new PM get farther in basic testing which is useful for
establishing a good baseline of "doesn't explode". There are still
plenty of crashers in basic testing though, this just gets rid of some
noise that is well understood and not representing a specific or narrow
bug.
Chandler Carruth [Tue, 27 Dec 2016 08:44:39 +0000 (08:44 +0000)]
[PM] Teach the AAManager and AAResults layer (the worst offender for
inter-analysis dependencies) to use the new invalidation infrastructure.
This teaches it to invalidate itself when any of the peer function
AA results that it uses become invalid. We do this by just tracking the
originating IDs. I've kept it in a somewhat clunky API since some users
of AAResults are outside the new PM right now. We can clean this API up
if/when those users go away.
Secondly, it uses the registration on the outer analysis manager proxy
to trigger deferred invalidation when a module analysis result becomes
invalid.
I've included test cases that specifically try to trigger use-after-free
in both of these cases and they would crash or hang pretty horribly for
me even without ASan. Now they work nicely.
The `InvalidateAnalysis` utility pass required some tweaking to be
useful in this context and it still is pretty garbage. I'd like to
switch it back to the previous implementation and teach the explicit
invalidate method on the AnalysisManager to take care of correctly
triggering indirect invalidation, but I wanted to go ahead and send this
out so folks could see how all of this stuff works together in practice.
And, you know, that it does actually work. =]
Chandler Carruth [Tue, 27 Dec 2016 08:40:39 +0000 (08:40 +0000)]
[PM] Introduce the facilities for registering cross-IR-unit dependencies
that require deferred invalidation.
This handles the other real-world invalidation scenario that we have
cases of: a function analysis which caches references to a module
analysis. We currently do this in the AA aggregation layer and might
well do this in other places as well.
Since this is relative rare, the technique is somewhat more cumbersome.
Analyses need to register themselves when accessing the outer analysis
manager's proxy. This proxy is already necessarily present to allow
access to the outer IR unit's analyses. By registering here we can track
and trigger invalidation when that outer analysis goes away.
To make this work we need to enhance the PreservedAnalyses
infrastructure to support a (slightly) more explicit model for "sets" of
analyses, and allow abandoning a single specific analyses even when
a set covering that analysis is preserved. That allows us to describe
the scenario of preserving all Function analyses *except* for the one
where deferred invalidation has triggered.
We also need to teach the invalidator API to support direct ID calls
instead of always going through a template to dispatch so that we can
just record the ID mapping.
I've introduced testing of all of this both for simple module<->function
cases as well as for more complex cases involving a CGSCC layer.
Much like the previous patch I've not tried to fully update the loop
pass management layer because that layer is due to be heavily reworked
to use similar techniques to the CGSCC to handle updates. As that
happens, we'll have a better testing basis for adding support like this.
Many thanks to both Justin and Sean for the extensive reviews on this to
help bring the API design and documentation into a better state.
Chandler Carruth [Tue, 27 Dec 2016 07:18:43 +0000 (07:18 +0000)]
[PM] Turn on the new PM's inliner in addition to the current one for
most of the inliner test cases.
The inliner involves a bunch of interesting code and tends to be where
most of the issues I've seen experimenting with the new PM lie. All of
these test cases pass, but I'd like to keep some more thorough coverage
here so doing a fairly blanket enabling.
There are a handful of interesting tests I've not enabled yet because
they're focused on the always inliner, or on functionality that doesn't
(yet) exist in the inliner.
Chandler Carruth [Tue, 27 Dec 2016 06:46:20 +0000 (06:46 +0000)]
[PM] Add one of the features left out of the initial inliner patch:
skipping indirectly recursive inline chains.
To do this, we implicitly build an inline stack for each callsite and
check prior to inlining that doing so would not form a cycle. This uses
the exact same technique and even shares some code with the legacy PM
inliner.
This solution remains deeply unsatisfying to me because it means we
cannot actually iterate the inliner externally. Doing so would not be
able to easily detect and avoid such cycles. Some day I would very much
like to have a solution that works without this internal state to detect
cycles, but this is not that day.
Chandler Carruth [Tue, 27 Dec 2016 06:46:16 +0000 (06:46 +0000)]
[PM] Wire up another test to the new pass manager.
Nothing really interesting here, but I had to improve the test to use
variables rather than hard coding value names as we happen to end up
with different value names in the new PM.
[Analysis] Ignore `nobuiltin` on `allocsize` function calls.
We currently ignore the `allocsize` attribute on functions calls with
the `nobuiltin` attribute when trying to lower `@llvm.objectsize`. We
shouldn't care about `nobuiltin` here: `allocsize` is explicitly added
by the user, not inferred based on a function's symbol.
This also makes us no longer check for `allocsize` on intrinsic calls.
This shouldn't matter, since intrinsics should provide the information
we get from `allocsize` on their own.
Chandler Carruth [Tue, 27 Dec 2016 05:00:45 +0000 (05:00 +0000)]
[LCG] Teach the LazyCallGraph to handle visiting the blockaddress
constant expression and to correctly form function reference edges
through them without crashing because one of the operands (the
`BasicBlock` isn't actually a constant despite being an operand of
a constant).
Craig Topper [Tue, 27 Dec 2016 03:46:05 +0000 (03:46 +0000)]
[AVX-512] Add 512-bit unmasked intrinsics for pmuldq and pmuludq so we can add them to InstCombine with the 128 and 256 bit versions.
The 128 and 256 bit masked intrinsics are currently unused by clang. The sse and avx2 unmasked intrinsics are used instead. The new 512-bit intrinsic will be used to do the same. Then all masked versions will removed and autoupgraded.
Chandler Carruth [Tue, 27 Dec 2016 02:47:37 +0000 (02:47 +0000)]
[Inliner] Modernize all of the inliner tests that were using grep.
This mostly involved converting from grep to FileCheck and tidying up
the IR used.
In one case (invoke_test-3.ll) the test had become completely pointless
as we use 'resume' rather than 'unwind' now, and even then it did not
occur at the end of the line.
Craig Topper [Tue, 27 Dec 2016 01:56:30 +0000 (01:56 +0000)]
[AVX-512][InstCombine] Teach InstCombine to turn masked scalar add/sub/mul/div with rounding intrinsics into normal IR operations if the rounding mode is CUR_DIRECTION.
An earlier commit added support for unmasked scalar operations. At that time isel wouldn't generate an optimal sequence for masked operations, but that has now been fixed.
Chandler Carruth [Tue, 27 Dec 2016 01:24:50 +0000 (01:24 +0000)]
[PM] Move the collection of call sites to a more appropriate place
inside of `InlineFunction`. Prior to this, call instructions are
specifically being rewritten and replaced within the inlined region,
invalidating some of the call sites.
Several of these regions are using the same technique to walk the
inlined region so this seems clearly safe up to this point.
I've also added a short circuit to the scan for call sites based on what
other code is doing.
With this, the most common crash I've found in the new inliner code is
fixed. I've turned it on for another test case that covers this
scenario.
I'll make my way through most of the other inliner test cases
just to get some easy coverage next.
Craig Topper [Tue, 27 Dec 2016 00:23:16 +0000 (00:23 +0000)]
[AVX-512][InstCombine] Teach InstCombine to turn packed add/sub/mul/div with rounding intrinsics into normal IR operations if the rounding mode is CUR_DIRECTION.
Chandler Carruth [Mon, 26 Dec 2016 23:43:27 +0000 (23:43 +0000)]
[PM] Teach the always inliner in the new pass manager to support
removing fully-dead comdats without removing dead entries in comdats
with live members.
This factors the core logic out of the current inliner's internals to
a reusable utility and leverages that in both places. The factored out
code should also be (minorly) more efficient in cases where we have very
few dead functions or dead comdats to consider.
I've added a test case to cover this behavior of the always inliner.
This is the last significant bug in the new PM's always inliner I've
found (so far).
Davide Italiano [Mon, 26 Dec 2016 18:10:09 +0000 (18:10 +0000)]
[NewGVN] Change test to reflect difference between GVN and NewGVN.
The current GVN algorithm folds unconditional branches to, it claims,
expose more PRE oportunities. The folding, if really needed,
(which is not sure, as it's not really proved it improves analysis)
can be done by an earlier cleanup pass instead of GVN itself.
Ack'ed/SGTM'd by Daniel Berlin.
Chandler Carruth [Mon, 26 Dec 2016 08:54:01 +0000 (08:54 +0000)]
Test the different scenarios of GlobalDCE and comdats more
systematically and document in the test what all is going on.
This replaces the PR-named test that was the only coverage for GlobalDCE
and comdats previously. I wrote this because I wasn't certain how
comdat DCE was supposed to work and wanted to step through what
GlobalDCE did to fully understand it. After talking to folks and reading
the code and really staring at things it all makes sense but it seemed
good to help write down some of this in a more explicit and fully
covering test case.
For example, it seemed like a bug that GlobalDCE didn't consider comdat
participation of ifuncs. Specifically it seemed like an accident because
testing didn't really cover that case. But in fact, ifuncs specifically
cannot participate in a comdat despite having that API. The new test
case covers this and explicitly documents that DCE gets to fire here
even though there are comdats involved.
Also, we didn't have any positive tests for the challenging cases such
as usage cycles between comdat participants that might make them seem
alive except that there is no external edge into the cycle.
Craig Topper [Mon, 26 Dec 2016 06:33:19 +0000 (06:33 +0000)]
[AVX-512][InstCombine] Teach InstCombine to turn scalar add/sub/mul/div with rounding intrinsics into normal IR operations if the rounding mode is CUR_DIRECTION.
Summary:
I only do this for unmasked cases for now because isel is failing to fold the mask. I'll try to fix that soon.
I'll do the same thing for packed add/sub/mul/div in a future patch.
Craig Topper [Sun, 25 Dec 2016 23:58:57 +0000 (23:58 +0000)]
[AVX-512][InstCombine] Teach InstCombine to converted masked vpermv intrinsics into shufflevector instructions
Summary:
This patch adds support for converting the masked vpermv intrinsics into shufflevector instructions if the indices are constants.
We also need to wrap a select instruction around the shuffle to take care of the masking part. InstCombine will take care of optimizing the select if the mask is constant so I didn't bother checking for that.
Chandler Carruth [Sun, 25 Dec 2016 23:41:14 +0000 (23:41 +0000)]
[ADT] Add a generic concatenating iterator and range (take 2).
This recommits r290512 that was reverted when MSVC failed to compile it. Since
then I've played with various approaches using rextester.com (where I was able
to reproduce the failure) and think that I have a solution thanks in part to
the help of Dave Blaikie! It seems MSVC just has a defective `decltype` in this
version. Manually writing out the type seems to do the trick, even though it is
.... quite complicated.
Original commit message:
This allows both defining convenience iterator/range accessors on types
which walk across N different independent ranges within the object, and
more direct and simple usages with range based for loops such as shown
in the unittest. The same facilities are used for both. They end up
quite small and simple as it happens.
I've also switched an iterator on `Module` to use this. I would like to
add another convenience iterator that includes even more sequences as
part of it and seeing this one already present motivated me to actually
abstract it away and introduce a general utility.
Lang Hames [Sun, 25 Dec 2016 21:55:05 +0000 (21:55 +0000)]
[Orc][RPC] Add a ParallelCallGroup utility for dispatching and waiting on
multiple asynchronous RPC calls.
ParallelCallGroup allows multiple asynchronous calls to be dispatched,
and provides a wait method that blocks until all asynchronous calls have
been executed on the remote and all return value handlers run on the
local machine.
This will allow, for example, the JIT client to issue memory allocation calls
for all sections in parallel, then block until all memory has been allocated
on the remote and the allocated addresses registered with the client, at which
point the JIT client can proceed to applying relocations.
Chandler Carruth [Sun, 25 Dec 2016 09:36:24 +0000 (09:36 +0000)]
Revert r290512: [ADT] Add a generic concatenating iterator and range.
This code doesn't work on MSVC for reasons that elude me and I've not
yet covinced a workaround to compile cleanly so reverting for now while
I play with it.
Chandler Carruth [Sun, 25 Dec 2016 08:22:50 +0000 (08:22 +0000)]
[ADT] Add a generic concatenating iterator and range.
This allows both defining convenience iterator/range accessors on types
which walk across N different independent ranges within the object, and
more direct and simple usages with range based for loops such as shown
in the unittest. The same facilities are used for both. They end up
quite small and simple as it happens.
I've also switched an iterator on `Module` to use this. I would like to
add another convenience iterator that includes even more sequences as
part of it and seeing this one already present motivated me to actually
abstract it away and introduce a general utility.
Mehdi Amini [Sun, 25 Dec 2016 04:22:54 +0000 (04:22 +0000)]
MetadataLoader: replace the tracking of ForwardReferences and UnresolvedNodes with a set-based solution (NFC)
This makes it explicit what is the exact list to handle, and it
looks much more easy to manipulate and understand that the
previous custom tracking of min/max to express the range where
to look for.