Adam Nemet [Wed, 1 Mar 2017 04:31:15 +0000 (04:31 +0000)]
[LV] These remark should have been missed remarks
The practice in LV is that we emit analysis remarks and then finally report
either a missed or applied remark on the final decision whether vectorization
is taking place. On this code path, we were closing with an analysis remark.
Mohammad Shahid [Wed, 1 Mar 2017 03:51:54 +0000 (03:51 +0000)]
[SLP] Fixes the bug due to absence of in order uses of scalars which needs to be available
for VectorizeTree() API.This API uses it for proper mask computation to be used in shufflevector IR.
The fix is to compute the mask for out of order memory accesses while building the vectorizable tree
instead of actual vectorization of vectorizable tree.
Matt Arsenault [Wed, 1 Mar 2017 03:36:04 +0000 (03:36 +0000)]
AMDGPU: Re-do update for branch-relaxation test
Modify the test so that it is still testing something
closer to what it was intended to originally.
I think the original intent was to test the situation where
there was a branch on execz and then unconditional branch
required relaxing.With the change in r296539,
there was no longer and execz branch.
Change the test so that there is now an execz branch inserted.
There is no longer an unconditional branch after the execz branch,
so this might need to be tricked in some other way to keep that
there.
Zachary Turner [Wed, 1 Mar 2017 01:04:16 +0000 (01:04 +0000)]
[PDB] Add an additional test for BinaryStreamRef.
A bug was uncovered where if you have a StreamRef whose ViewOffset
is > 0, then when you call readLongestContiguousChunk it will
succeed even when it shouldn't, and it always return you a
buffer that was taken as if the ViewOffset was 0.
Ahmed Bougacha [Wed, 1 Mar 2017 00:43:42 +0000 (00:43 +0000)]
[CodeGen] Remove dead FastISel code after SDAG emitted a tailcall.
When SDAGISel (top-down) selects a tail-call, it skips the remainder
of the block.
If, before that, FastISel (bottom-up) selected some of the (no-op) next
few instructions, we can end up with dead instructions following the
terminator (selected by SDAGISel).
We need to erase them, as we know they aren't necessary (in addition to
being incorrect).
We already do this when FastISel falls back on the tail-call itself.
Also remove the FastISel-emitted code if we fallback on the
instructions between the tail-call and the return.
Ahmed Bougacha [Wed, 1 Mar 2017 00:43:39 +0000 (00:43 +0000)]
[GlobalISel] Replace all combined G_EXTRACT uses.
Iterating on the use-list we're modifying doesn't work: after the first
iteration, the use-list iterator will point to a MachineOperand
referencing the new register. This caused us to skip the other uses to
replace.
Instead, use MRI.replaceRegWith(), which accounts for this behavior.
Adrian Prantl [Tue, 28 Feb 2017 23:48:42 +0000 (23:48 +0000)]
Teach the IR verifier to reject conflicting debug info for function arguments.
Conflicting debug info for function arguments causes hard-to-debug
assertions in the DWARF backend, so the Verifier should reject it.
For performance reasons this only checks function arguments from
non-inlined debug intrinsics for now.
Dan Gohman [Tue, 28 Feb 2017 23:37:04 +0000 (23:37 +0000)]
[WebAssembly] Convert the remaining unit tests to the new wasm-object-file target.
To facilitate this, add a new hidden command-line option to disable
the explicit-locals pass. That causes llc to emit invalid code that doesn't
have all locals converted to get_local/set_local, however it simplifies
testwriting in many cases.
Daniel Berlin [Tue, 28 Feb 2017 22:57:50 +0000 (22:57 +0000)]
Fix PR 24415 (at least), by making our post-dominator tree behavior sane.
Summary:
Currently, our post-dom tree tries to ignore and remove the effects of
infinite loops. It fails miserably at this, because it tries to do it
ahead of time, and thus can only detect self-loops, and any other type
of infinite loop, it pretends doesn't exist at all.
This can, in a bunch of cases, lead to wrong answers and a completely
empty post-dom tree.
This is a trivial modification of the testcase for PR 6047
Note that we pretend bb3.i doesn't exist.
We also pretend that bb35 post-dominates entry.
While it's true that it does not exit in a theoretical sense, it's not
really helpful to try to ignore the effect and pretend that bb35
post-dominates entry. Worse, we pretend the infinite loop does
nothing (it's usually considered a side-effect), and doesn't even
exist, even when it calls a function. Sadly, this makes it impossible
to use when you are trying to move code safely. All compilers also
create virtual or real single exit nodes (including us), and connect
infinite loops there (which this patch does). In fact, others have
worked around our behavior here, to the point of building their own
post-dom trees:
https://zneak.github.io/fcd/2016/02/17/structuring.html and pointing
out the region infrastructure is near-useless for them with postdom in
this state :(
bb2: ; No predecessors!
ret void
}
```
Printing analysis 'Post-Dominator Tree Construction' for function 'foo':
=============================--------------------------------
Inorder PostDominator Tree:
[1] <<exit node>> {0,1}
:(
(note that even if you ignore the effects of infinite loops, bb2
should be present as an exit node that post-dominates nothing).
This patch changes post-dom to properly handle infinite loops and does
root finding during calculation to prevent empty tress in such cases.
We match gcc's (and the canonical theoretical) behavior for infinite
loops (find the backedge, connect it to the exit block).
Testcases coming as soon as i finish running this on a ton of random graphs :)
Summary:
Update the XRay docs to mention new subcomands to the llvm-xray tool,
and details on FDR mode logging. Also list down available libraries for
use part of the LLVM distribution.
Kevin Enderby [Tue, 28 Feb 2017 21:47:07 +0000 (21:47 +0000)]
Actually add error handling to unpacking the dyld compact bind and
other tables. Providing a helpful error message to what the error is and
where the error occurred based on which opcode it was associated with.
There have been handful of bug fixes dealing with bad bind info in
object files, r294021 and r249845, which only put a band aid on the
problem after a bad bind table was created after unpacking from
its compact info. In these cases a bind table should have never been
created and an error should have simply been generated.
This change puts in place the plumbing to allow checking and returning
of an error when the compact info is unpacked. This follows the model
of iterators that can fail that Lang Hanes designed when fixing the problem
for bad archives r275316 (or r275361).
This change uses one of the existing test cases that now causes an
error instead of printing <<bad library ordinal>> after a bad bind table
is created. The error uses the offset into the opcode table as shown with
the macOS dyldinfo(1) tool to indicate where the error is and which
opcode and which parameter is in error.
For example the exiting test case has this lazy binding opcode table:
In the test case the binary only has one library so setting the library
ordinal to the value of 2 in the BIND_OPCODE_SET_DYLIB_ORDINAL_IMM
opcode at 0x0002 above is an error. This now produces this error message:
% llvm-objdump -lazy-bind bad-ordinal.macho-x86_64
…
llvm-objdump: 'bad-ordinal.macho-x86_64': truncated or malformed object (for BIND_OPCODE_SET_DYLIB_ORDINAL_ULEB bad library ordinal: 2 (max 1) for opcode at: 0x2)
This change provides the plumbing for the error handling and one example
of an error message. Other error checks and test cases will be added in follow
on commits.
Matt Arsenault [Tue, 28 Feb 2017 20:15:43 +0000 (20:15 +0000)]
AMDGPU: Add definitions for ds_{read|write}_b{96|128}
It's not clear to me if this is always better than
doing ds_write2_b64 This adds the constraint of
a 128-bit register input instead of a pair of
64-bit.
If during scheduling we have identified that we cannot keep optimistic
occupancy increase critical register pressure limit and try scheduling
of the whole function again. In this case blocks with smaller pressure
will have a chance for better scheduling.
Dehao Chen [Tue, 28 Feb 2017 18:09:44 +0000 (18:09 +0000)]
Add function importing info from samplepgo profile to the module summary.
Summary: For SamplePGO, the profile may contain cross-module inline stacks. As we need to make sure the profile annotation happens when all the hot inline stacks are expanded, we need to pass this info to the module importer so that it can import proper functions if necessary. This patch implemented this feature by emitting cross-module targets as part of function entry metadata. In the module-summary phase, the metadata is used to build call edges that points to functions need to be imported.
James Y Knight [Tue, 28 Feb 2017 18:05:41 +0000 (18:05 +0000)]
Workaround MSVC bug when using TrailingObjects from a template.
MSVC appears to be getting confused as to whether OverloadToken is
supposed to be public or not.
This was discovered by code in Swift, and has been reported to
microsoft by hughbe:
https://connect.microsoft.com/VisualStudio/feedback/details/3116517
This change introduces new method to estimate register pressure in
GCNScheduler. Standard RPTracker gives huge error due to the following
reasons:
1. It does not account for live-ins or live-outs if value is not used
in the region itself. That creates a huge error in a very common case
if there are a lot of live-thu registers.
2. It does not properly count subregs.
3. It assumes a register used as an input operand can be reused as an
output. This is not always possible by itself, this is not what RA
will finally do in many cases for various reasons not limited to RA's
inability to do so, and this is not so if the value is actually a
live-thu.
In addition we can now see clear separation between live-in pressure
which we cannot change with the scheduling and tentative pressure
which we can change.
Adrian Prantl [Tue, 28 Feb 2017 16:58:13 +0000 (16:58 +0000)]
Strip debug info when inlining into a nodebug function.
The LLVM backend cannot produce any debug info for an llvm::Function
without a DISubprogram attachment. When inlining a debug-info-carrying
function into a nodebug function, there is therefore no reason to keep
any debug info intrinsic calls or debug locations on the instructions.
Craig Topper [Tue, 28 Feb 2017 16:52:05 +0000 (16:52 +0000)]
[DAGISel] When checking if chain node is foldable, make sure the intermediate nodes have a single use across all results not just the result that was used to reach the chain node.
This recovers a test case that was severely broken by r296476, my making sure we don't create ADD/ADC that loads and stores when there is also a flag dependency.
[AMDGPU] Fix read-undef flags when schedule is reverted
If two subregs of the same register are defined and we need to revert
schedule changing def order, we will end up with both instructions
having def,read-undef flags because adjustLaneLiveness() will only set
this flag but will not remove it.
Fix this by removing read-undef flags before calling adjustLaneLiveness.
David Bozier [Tue, 28 Feb 2017 16:02:37 +0000 (16:02 +0000)]
[Stack Protection] Add diagnostic information for why stack protection was applied to a function
Stack Smash Protection is not completely free, so in hot code, the overhead it causes can cause performance issues. By adding diagnostic information for which functions have SSP and why, a user can quickly determine what they can do to stop SSP being applied to a specific hot function.
This change adds a remark that is reported by the stack protection code when an instruction or attribute is encountered that causes SSP to be applied.
Nirav Dave [Tue, 28 Feb 2017 14:24:15 +0000 (14:24 +0000)]
In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled.
Recommiting after fixup of 32-bit aliasing sign offset bug in DAGCombiner.
* Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search and chain alias analysis which only
checks for parallel stores through the chain subgraph. This is cleaner
as the separation of non-interfering loads/stores from the
store-merging logic.
When merging stores search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited.
This improves the quality of the output SelectionDAG and the output
Codegen (save perhaps for some ARM cases where we correctly constructs
wider loads, but then promotes them to float operations which appear
but requires more expensive constant generation).
Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the chain aggregation in the merged stores across code
paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seems sufficient to not cause regressions in
tests.
5. Remove Chain dependencies of Memory operations on CopyfromReg
nodes as these are captured by data dependence
6. Forward loads-store values through tokenfactors containing
{CopyToReg,CopyFromReg} Values.
7. Peephole to convert buildvector of extract_vector_elt to
extract_subvector if possible (see
CodeGen/AArch64/store-merge.ll)
8. Store merging for the ARM target is restricted to 32-bit as
some in some contexts invalid 64-bit operations are being
generated. This can be removed once appropriate checks are
added.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable, improving load-store forwarding. One test in
particular is worth noting:
CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
forwarding converts a load-store pair into a parallel store and
a memory-realized bitcast of the same value. However, because we
lose the sharing of the explicit and implicit store values we
must create another local store. A similar transformation
happens before SelectionDAG as well.
Daniel Sanders [Tue, 28 Feb 2017 14:21:31 +0000 (14:21 +0000)]
[globalisel] Change LLT constructor string into an LLT subclass that knows how to generate it.
Summary:
This will allow future patches to inspect the details of the LLT. The implementation is now split between
the Support and CodeGen libraries to allow TableGen to use this class without introducing layering concerns.
Thanks to Ahmed Bougacha for finding a reasonable way to avoid the layering issue and providing the version of this patch without that problem.
Diana Picus [Tue, 28 Feb 2017 14:17:53 +0000 (14:17 +0000)]
[ARM] GlobalISel: Lower i32 and fp call parameters on the stack
Lower i32, float and double parameters that need to live on the stack. This
boils down to creating some G_GEPs starting from the stack pointer and storing
the values there. During the process we also keep track of the stack size and
use the final value in the ADJCALLSTACKDOWN/UP instructions.
We currently assert for smaller types, since they usually require extensions.
They will be handled in a separate patch.
Sanne Wouda [Tue, 28 Feb 2017 10:34:48 +0000 (10:34 +0000)]
[Assembler] Add test for !srcloc references in assembler diags
Summary:
clang adds !srcloc metadata to inline assembly in LLVM bitcode generated
for inline assembly in C. The value of this !srcloc is passed to the
diagnostics handler if the inline assembly generates a diagnostic.
clang is able to turn this cookie back to a location in the C source
file.
To test this functionality without a dependency, make llc print the
!srcloc metadata if it is present. The added test uses this mechanism
to test that the correct !srclocs are passed to the diag handler.
Vassil Vassilev [Tue, 28 Feb 2017 07:11:59 +0000 (07:11 +0000)]
Allow externally dlopen-ed libraries to be registered as permanent libraries.
This is also useful in cases when llvm is in a shared library. First we dlopen
the llvm shared library and then we register it as a permanent library in order
to keep the JIT and other services working.
Sanjoy Das [Tue, 28 Feb 2017 07:04:49 +0000 (07:04 +0000)]
[ImplicitNullCheck] Add alias analysis usage
Summary:
With this change ImplicitNullCheck optimization uses alias analysis
and can use load/store memory access for implicit null check if there
are other load/store before but memory accesses do not alias.
Matthias Braun [Tue, 28 Feb 2017 00:33:32 +0000 (00:33 +0000)]
Add MIR-level outlining pass
This is a patch for the outliner described in the RFC at:
http://lists.llvm.org/pipermail/llvm-dev/2016-August/104170.html
The outliner is a code-size reduction pass which works by finding
repeated sequences of instructions in a program, and replacing them with
calls to functions. This is useful to people working in low-memory
environments, where sacrificing performance for space is acceptable.
This adds an interprocedural outliner directly before printing assembly.
For reference on how this would work, this patch also includes X86
target hooks and an X86 test.
[CGP] Split some critical edges coming out of indirect branches
Splitting critical edges when one of the source edges is an indirectbr
is hard in general (because it requires changing the memory the indirectbr
reads). But if a block only has a single indirectbr predecessor (which is
the common case), we can simulate splitting that edge by splitting
the destination block, and retargeting the *direct* branches.
This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame()
ends up using an indirect branch with ~100 successors, and passing a constant to
each of those. Since MachineSink can't break indirect critical edges on demand
(and doing this in MIR doesn't look feasible), this causes us to emit about ~100
defs of registers containing constants, which we in the predecessor block, where
only one of those constants is used in each successor. So, at each computed goto,
we needlessly spill about a 100 constants to stack. The end result is that a
clang-compiled python interpreter can be about ~2.5x slower on a simple python
reduction loop than a gcc-compiled interpreter.
Zachary Turner [Tue, 28 Feb 2017 00:04:07 +0000 (00:04 +0000)]
[PDB] Make streams carry their own endianness.
Before the endianness was specified on each call to read
or write of the StreamReader / StreamWriter, but in practice
it's extremely rare for streams to have data encoded in
multiple different endiannesses, so we should optimize for the
99% use case.
This makes the code cleaner and more general, but otherwise
has NFC.
Dan Gohman [Mon, 27 Feb 2017 22:44:37 +0000 (22:44 +0000)]
[MC] Factor out non-COFF handling of COFF-specific directives.
Instead of requiring every non-COFF MCObjectStreamer to implement the
COFF hooks just to do an llvm_unreachable to say that they're not
supported, do the llvm_unreachable in the default implementation, as
suggested by rnk in https://reviews.llvm.org/D26722.