r371901 was overeager and widenScalarDst() and the like in the legalizer
attempt to increment the insert point given in order to add new instructions
after the currently legalizing inst. In cases where the insertion point is not
exactly the current instruction, then callers need to de-compensate for the
behaviour by decrementing the insertion iterator before calling them. It's not
a nice state of affairs, for now just undo the problematic parts of the change.
[PowerPC] Cust lower fpext v2f32 to v2f64 from extract_subvector v4f32
Add the missing piece of r372029.
Somehow when the patch for review D61961 was committed, only the test case
went in and the code didn't. This of course caused all kinds of build bot
breaks.
This patch just adds the code for that patch.
Author: Lei Huang
Differential revision: https://reviews.llvm.org/D61961
Joel E. Denny [Mon, 16 Sep 2019 21:22:29 +0000 (21:22 +0000)]
[lit] Make internal diff work in pipelines
When using lit's internal shell, RUN lines like the following
accidentally execute an external `diff` instead of lit's internal
`diff`:
```
# RUN: program | diff file -
# RUN: not diff file1 file2 | FileCheck %s
```
Such cases exist now, in `clang/test/Analysis` for example. We are
preparing patches to ensure lit's internal `diff` is called in such
cases, which will then fail because lit's internal `diff` cannot
currently be used in pipelines and doesn't recognize `-` as a
command-line option.
To enable pipelines, this patch moves lit's `diff` implementation into
an out-of-process script, similar to lit's `cat` implementation. A
follow-up patch will implement `-` to mean stdin.
Lei Huang [Mon, 16 Sep 2019 20:04:15 +0000 (20:04 +0000)]
[PowerPC] Cust lower fpext v2f32 to v2f64 from extract_subvector v4f32
This is a follow up patch from https://reviews.llvm.org/D57857 to handle
extract_subvector v4f32. For cases where we fpext of v2f32 to v2f64 from
extract_subvector we currently generate on P9 the following:
Steven Wu [Mon, 16 Sep 2019 18:49:54 +0000 (18:49 +0000)]
[LTO][Legacy] Add new C inferface to query libcall functions
Summary:
This is needed to implemented the same approach as lld (implemented in r338434)
for how to handling symbols that can be generated by LTO code generator
but not present in the symbol table for linker that uses legacy C APIs.
libLTO is in charge of providing the list of symbols. Linker is in
charge of implementing the eager loading from static libraries using
the list of symbols.
[PGO] Use linkonce_odr linkage for __profd_ variables in comdat groups
This fixes relocations against __profd_ symbols in discarded sections,
which is PR41380.
In general, instrumentation happens very early, and optimization and
inlining happens afterwards. The counters for a function are calculated
early, and after inlining, counters for an inlined function may be
widely referenced by other functions.
For C++ inline functions of all kinds (linkonce_odr &
available_externally mainly), instr profiling wants to deduplicate these
__profc_ and __profd_ globals. Otherwise the binary would be quite
large.
I made __profd_ and __profc_ comdat in r355044, but I chose to make
__profd_ internal. At the time, I was only dealing with coverage, and in
that case, none of the instrumentation needs to reference __profd_.
However, if you use PGO, then instrumentation passes add calls to
__llvm_profile_instrument_range which reference __profd_ globals. The
solution is to make these globals externally visible by using
linkonce_odr linkage for data as was done for counters.
This is safe because PGO adds a CFG hash to the names of the data and
counter globals, so if different TUs have different globals, they will
get different data and counter arrays.
David Green [Mon, 16 Sep 2019 17:29:07 +0000 (17:29 +0000)]
[ARM] A predicate cast of a predicate cast is a predicate cast
The adds some very basic folding of PREDICATE_CASTS, removing cases when they
are chained together. These would already be removed eventually, as these are
lowered to copies. This just allows it to happen earlier, which can help other
simplifications.
Roman Lebedev [Mon, 16 Sep 2019 16:18:24 +0000 (16:18 +0000)]
[SimplifyCFG] FoldTwoEntryPHINode(): consider *total* speculation cost, not per-BB cost
Summary:
Previously, if the threshold was 2, we were willing to speculatively
execute 2 cheap instructions in both basic blocks (thus we were willing
to speculatively execute cost = 4), but weren't willing to speculate
when one BB had 3 instructions and other one had no instructions,
even thought that would have total cost of 3.
This looks inconsistent to me.
I don't think `cmov`-like instructions will start executing
until both of it's inputs are available: https://godbolt.org/z/zgHePf
So i don't see why the existing behavior is the correct one.
Also, let's add it's own `cl::opt` for this threshold,
with default=4, so it is not stricter than the previous threshold:
will allow to fold when there are 2 BB's each with cost=2.
And since the logic has changed, it will also allow to fold when
one BB has cost=3 and other cost=1, or there is only one BB with cost=4.
This is an alternative solution to D65148:
This fix is mainly motivated by `signbit-like-value-extension.ll` test.
That pattern comes up in JPEG decoding, see e.g.
`Figure F.12 – Extending the sign bit of a decoded value in V`
of `ITU T.81` (JPEG specification).
That branch is not predictable, and it is within the innermost loop,
so the fact that that pattern ends up being stuck with a branch
instead of `select` (i.e. `CMOV` for x86) is unlikely to be beneficial.
[InstCombine] remove unneeded one-use checks for icmp fold
Related folds were added in:
rL125734
...the code comment about register pressure is discussed in
more detail in:
https://bugs.llvm.org/show_bug.cgi?id=2698
But 10 years later, perf testing bzip2 with this change now
shows a slight (0.2% average) improvement on Haswell although
that's probably within test noise.
Given that this is IR canonicalization, we shouldn't be worried
about register pressure though; the backend should be able to
adjust for that as needed.
This is part of solving PR43310 the theoretically right way:
https://bugs.llvm.org/show_bug.cgi?id=43310
...ie, if we don't cripple basic transforms, then we won't
need to add special-case code to detect larger patterns.
rL371940 and rL371981 are related patches in this series.
Simon Pilgrim [Mon, 16 Sep 2019 15:19:11 +0000 (15:19 +0000)]
[ExecutionEngine] Don't dereference a dyn_cast result. NFCI.
The static analyzer is warning about potential null dereferences of dyn_cast<> results - in these cases we can safely use cast<> directly as we know that these cases should all be the correct type, which is why its working atm and anyway cast<> will assert if they aren't.
Now that the vectorizer can do tail-folding (rL367592), and the ARM backend
understands MVE masked loads/stores (rL371932), it's time to add the MVE
tail-folding equivalent of the X86 tests that I added.
Jonas Paulsson [Mon, 16 Sep 2019 14:49:36 +0000 (14:49 +0000)]
[SystemZ] Call erase() on the right MBB in SystemZTargetLowering::emitSelect()
Since MBB was split *before* MI, the MI(s) will reside in JoinMBB (MBB) at
the point of erasing them, so calling StartMBB->erase() is actually wrong,
although it is "working" by all appearances.
David Green [Mon, 16 Sep 2019 13:02:41 +0000 (13:02 +0000)]
[ARM] Fold VCMP into VPT
MVE has VPT instructions, which perform the duties of both a VCMP and a VPST in
a single instruction, performing the compare and starting the VPT block in one.
This teaches the MVEVPTBlockPass to fold them, searching back through the
basicblock for a valid VCMP and creating the VPT from its operands.
There are some changes to the VPT instructions to accommodate this, altering
the order of the operands to match the VCMP better, and changing P0 register
defs to be VPR defs, as is used in other places.
[InstCombine] remove unneeded one-use checks for icmp fold
This fold and several others were added in:
rL125734 <https://reviews.llvm.org/rL125734>
...with no explanation for the one-use checks other than the code
comments about register pressure.
Given that this is IR canonicalization, we shouldn't be worried
about register pressure though; the backend should be able to
adjust for that as needed.
This is part of solving PR43310 the theoretically right way:
https://bugs.llvm.org/show_bug.cgi?id=43310
...ie, if we don't cripple basic transforms, then we won't
need to add special-case code to detect larger patterns.
Simon Pilgrim [Mon, 16 Sep 2019 11:22:44 +0000 (11:22 +0000)]
[VPlanSLP] Don't dereference a cast_or_null<VPInstruction> result. NFCI.
The static analyzer is warning about a potential null dereference of the cast_or_null result, I've split the cast_or_null check from the ->getUnderlyingInstr() call to avoid this, but it appears that we weren't seeing any null pointers in the dumped bundles in the first place.
Simon Pilgrim [Mon, 16 Sep 2019 10:35:09 +0000 (10:35 +0000)]
[SLPVectorizer] Don't dereference a dyn_cast result. NFCI.
The static analyzer is warning about potential null dereferences of dyn_cast<> results - in these cases we can safely use cast<> directly as we know that these cases should all be the correct type, which is why its working atm and anyway cast<> will assert if they aren't.
[SVE][Inline-Asm] Add constraints for SVE predicate registers
Summary:
Adds the following inline asm constraints for SVE:
- Upl: One of the low eight SVE predicate registers, P0 to P7 inclusive
- Upa: SVE predicate register with full range, P0 to P15
Jonas Paulsson [Mon, 16 Sep 2019 07:29:37 +0000 (07:29 +0000)]
[SystemZ] Merge the SystemZExpandPseudo pass into SystemZPostRewrite.
SystemZExpandPseudo:s only job was to expand LOCRMux instructions into jump
sequences. This needs to be done if expandLOCRPseudo() or expandSELRPseudo()
fails to find a legal opcode (all registers "high" or "low"). This task has
now been moved to SystemZPostRewrite while removing the SystemZExpandPseudo
pass.
It is in fact preferred to expand these pseudos directly after register
allocation in SystemZPostRewrite since the hinted register combinations are
then not subject to later optimizations.
D53362 gives a prototype heap-to-stack conversion pass. With addition of new attributes in the attributor, this can now be revisted and improved. This will place it in the Attributor to make it easier to use new attributes (eg. nofree, nosync, willreturn, etc.) and other attributor features.
[InstCombine] remove unneeded one-use checks for icmp fold
This fold and several others were added in:
rL125734
...with no explanation for the one-use checks other than the code
comments about register pressure.
Given that this is IR canonicalization, we shouldn't be worried
about register pressure though; the backend should be able to
adjust for that as needed.
There are similar checks as noted with the TODO comments. I'm
hoping to remove those restrictions too, but if any of these
does cause a regression, it should be easier to correct by making
small, individual commits.
This is part of solving PR43310 the theoretically right way:
https://bugs.llvm.org/show_bug.cgi?id=43310
...ie, if we don't cripple basic transforms, then we won't
need to add special-case code to detect larger patterns.
Simon Pilgrim [Sun, 15 Sep 2019 15:38:26 +0000 (15:38 +0000)]
[DebugInfo] Don't dereference a dyn_cast<PDBSymbolData> result. NFCI.
The static analyzer is warning about a potential null dereference - but as we're in DataMemberLayoutItem we should be able to guarantee that the Symbol is a PDBSymbolData type, allowing us to use cast<PDBSymbolData> - and if not assert will fire for us.
David Green [Sun, 15 Sep 2019 14:14:47 +0000 (14:14 +0000)]
[ARM] Masked loads and stores
Masked loads and store fit naturally with MVE, the instructions being easily
predicated. This adds lowering for the simple cases of masked loads and stores.
It does not yet deal with widening/narrowing or pre/post inc, and so is
currently behind an option.
The llvm masked load intrinsic will accept a "passthru" value, dictating the
values used for the zero masked lanes. In MVE the instructions write 0 to the
zero predicated lanes, so we need to match a passthru that isn't 0 (or undef)
with a select instruction to pull in the correct data after the load.
[SLP] limit vectorization of Constant subclasses (PR33958)
This is a fix for:
https://bugs.llvm.org/show_bug.cgi?id=33958
It seems universally true that we would not want to transform this kind of
sequence on any target, but if that's not correct, then we could view this
as a target-specific cost model problem. We could also white-list ConstantInt,
ConstantFP, etc. rather than blacklist Global and ConstantExpr.
The Inst parameter returns the encoded instruction, the Scratch parameter is used internally for manipulating operands and is exposed so that the underlying storage can be reused between calls to getBinaryCodeForInstr. The goal is to elide any APInt constructions that we can.
Similarly the operand encoding prototype changes to:
Fangrui Song [Sat, 14 Sep 2019 01:36:31 +0000 (01:36 +0000)]
[llvm-objcopy] Ignore -B --binary-architecture=
GNU objcopy documents that -B is only useful with architecture-less
input (i.e. "binary" or "ihex"). After D67144, -O defaults to -I, and
-B is essentially a NOP.
* If -O is binary/ihex, GNU objcopy ignores -B.
* If -O is elf*, -B provides the e_machine field in GNU objcopy.
So to convert a blob to an ELF, `-I binary -B i386:x86-64 -O elf64-x86-64` has to be specified.
`-I binary -B i386:x86-64 -O elf64-x86-64` creates an ELF with its
e_machine field set to EM_NONE in GNU objcopy, but a regular x86_64 ELF
in elftoolchain elfcopy. Follow the elftoolchain approach (ignoring -B)
to simplify code. Users that expect their command line portable should
specify -B.
Fangrui Song [Sat, 14 Sep 2019 01:18:47 +0000 (01:18 +0000)]
[llvm-ar] Uncapitalize error messages and delete full stop
Most GNU binutils don't append full stops in error messages. This
convention has been adopted by a bunch of LLVM binary utilities. Make
llvm-ar follow the convention as well.
This adds a reproducer dump commands which makes it possible to inspect
a reproducer from inside LLDB. Currently it supports the Files, Commands
and Version providers. I'm planning to add support for the GDB Remote
provider in a follow-up patch.
Thomas Lively [Fri, 13 Sep 2019 22:54:41 +0000 (22:54 +0000)]
[WebAssembly] Narrowing and widening SIMD ops
Summary:
Implements target-specific LLVM intrinsics and clang builtins for
these new SIMD operations, as described at https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md#integer-to-integer-narrowing.
[GlobalISel] Fix insertion point of new instructions to be after PHIs.
For some reason we sometimes insert new instructions one instruction before
the first non-PHI when legalizing. This can result in having non-PHI
instructions before PHIs, which mean that PHI elimination doesn't catch them.
* std::move the error extracted from the parsing creation to avoid asserts
* print a newline after the error message
* create the parser from the metadata
Because memory intrinsics are handled differently than other calls, we need to
check them for tail call eligiblity in the legalizer. This allows us to still
inline them when it's beneficial to do so, but also tail call when possible.
This adds simple tail calling support for when the intrinsic is followed by a
return.
It ports the attribute checks from `TargetLowering::isInTailCallPosition` into
a similarly-named function in LegalizerHelper.cpp. The target-specific
`isUsedByReturnOnly` hook is not ported here.
Update tailcall-mem-intrinsics.ll to show that GlobalISel can now tail call
memory intrinsics.
Update legalize-memcpy-et-al.mir to have a case where we don't tail call.
Sebastian Pop [Fri, 13 Sep 2019 19:28:30 +0000 (19:28 +0000)]
[aarch64] move custom isel of extract_vector_elt to td file - NFC
In preparation for def-pat selection of dot product instructions,
this patch moves the custom instruction selection of extract_vector_elt
to the td file. Without this change it is impossible to catch a pattern that
starts with an extract_vector_elt: the custom cpp code is executed first
ahead of the patterns in the td files that are only executed at the end of
the switch statement in SelectCode(Node).
With this patch applied, it becomes possible to select a different pattern
that starts with extract_vector_elt by selecting a higher complexity than
this pattern.
The patch has been tested on aarch64-linux with make check-all.