Robert Lougher [Tue, 19 Mar 2019 20:24:28 +0000 (20:24 +0000)]
[TailCallElim] Add tailcall elimination pass to LTO pipelines
LTO provides additional opportunities for tailcall elimination due to
link-time inlining and visibility of nocapture attribute. Testing showed
negligible impact on compilation times.
Philip Reames [Tue, 19 Mar 2019 20:10:00 +0000 (20:10 +0000)]
Demanded elements support for masked.load and masked.gather
Teach instcombine to propagate demanded elements through a masked load or masked gather instruction. This is in the broader context of improving vector pointer instcombine under https://reviews.llvm.org/D57140.
Matt Arsenault [Tue, 19 Mar 2019 19:33:12 +0000 (19:33 +0000)]
CodeGen: Refactor regallocator command line and target selection
This will allow targets more flexibility to replace the
register allocator core passes. In a future commit,
AMDGPU will run the core register assignment passes
twice, and will also want to disallow using the
standard -regalloc option.
Matt Arsenault [Tue, 19 Mar 2019 19:16:04 +0000 (19:16 +0000)]
RegAllocFast: Do not allocate registers for undef uses
Do not actually allocate a register for an undef use. Previously we we
would create unnecessary reload instruction for undef uses where the
register wasn't live.
Matt Arsenault [Tue, 19 Mar 2019 19:01:34 +0000 (19:01 +0000)]
RegAllocFast: Remove early selection loop, the spill calculation will report cost 0 anyway for free regs
The 2nd loop calculates spill costs but reports free registers as cost
0 anyway, so there is little benefit from having a separate early
loop.
Surprisingly this is not NFC, as many register are marked regDisabled
so the first loop often picks up later registers unnecessarily instead
of the first one available in the allocation order...
Philip Reames [Tue, 19 Mar 2019 18:27:18 +0000 (18:27 +0000)]
Allow unordered loads to be considered invariant in CodeGen
The actual code change is fairly straight forward, but exercising it isn't. First, it turned out we weren't adding the appropriate flags in SelectionDAG. Second, it turned out that we've got some optimization gaps, so obvious test cases don't work.
My first attempt (in atomic-unordered.ll) points out a deficiency in our peephole-opt folding logic which I plan to fix separately. Instead, I'm exercising this through MachineLICM.
[Remarks] Add a new Remark / RemarkParser abstraction
This adds a Remark class that allows us to share code when working with
remarks.
The C API has been updated to reflect this. Instead of the parser
generating C structs, it's now using a C++ object that is used through
opaque pointers in C. This gives us much more flexibility on what
changes we can make to the internal state of the object and interacts
much better with scenarios where the library is used through dlopen.
* C API updates:
* move from C structs to opaque pointers and functions
* the remark type is now an enum instead of a string
* unit tests updates:
* use mostly the C++ API
* keep one test for the C API
* rename to YAMLRemarksParsingTest
* a typo was fixed: AnalysisFPCompute -> AnalysisFPCommute.
* a new error message was added: "expected a remark tag."
* llvm-opt-report has been updated to use the C++ parser instead of the
C API
Nikita Popov [Tue, 19 Mar 2019 17:53:56 +0000 (17:53 +0000)]
[ValueTracking] Use computeConstantRange() for unsigned add/sub overflow
Improve computeOverflowForUnsignedAdd/Sub in ValueTracking by
intersecting the computeConstantRange() result into the ConstantRange
created from computeKnownBits(). This allows us to detect some
additional never/always overflows conditions that can't be determined
from known bits.
This revision also adds basic handling for constants to
computeConstantRange(). Non-splat vectors will be handled in a followup.
The signed case will also be handled in a followup, as it needs some
more groundwork.
Philip Reames [Tue, 19 Mar 2019 17:20:49 +0000 (17:20 +0000)]
[AtomicExpand] Fix a crash bug when lowering unordered loads to cmpxchg
Add tests for wider atomic loads and stores. In the process, fix a crasher where we appearently handled unorder stores, but not loads, when lowering to cmpxchg idioms.
Justin Bogner [Tue, 19 Mar 2019 16:52:00 +0000 (16:52 +0000)]
[DAGCombine] Fix a miscompile when reducing BUILD_VECTORs to a shuffle
In r311255 we added a case where we split vectors whose elements are
all derived from the same input vector so that we could shuffle it
more efficiently. In doing so, createBuildVecShuffle was taught to
adjust for the fact that all indices would be based off of the first
vector when this happens, but it's possible for the code that checked
that to fire incorrectly if we happen to have a BUILD_VECTOR of
extracts from subvectors and don't hit this new optimization.
Instead of trying to detect if we've split the vector by checking if
we have extracts from the same base vector, we can just pass that
information into createBuildVecShuffle, avoiding the miscompile.
Philip Reames [Tue, 19 Mar 2019 16:46:56 +0000 (16:46 +0000)]
[Tests] Update to newer ISA
There are some issues w/missed opts on older platforms, but that's not the purpose of this test. Using a newer API points out that some TODOs are already handled, and allows addition of tests to exercise other issues (future patch.)
Sanjay Patel [Tue, 19 Mar 2019 16:39:17 +0000 (16:39 +0000)]
[InstCombine] fold logic-of-nan-fcmps (PR41069)
Combine 2 fcmps that are checking for nan-ness:
and (fcmp ord X, 0), (and (fcmp ord Y, 0), Z) --> and (fcmp ord X, Y), Z
or (fcmp uno X, 0), (or (fcmp uno Y, 0), Z) --> or (fcmp uno X, Y), Z
This is an exact match for a minimal reassociation pattern.
If we want to handle this more generally that should go in
the reassociate pass and allow removing this code.
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=41069
Simon Pilgrim [Tue, 19 Mar 2019 16:24:55 +0000 (16:24 +0000)]
[SelectionDAG] Handle unary SelectPatternFlavor for ABS case in SelectionDAGBuilder::visitSelect
These changes are related to PR37743 and include:
SelectionDAGBuilder::visitSelect handles the unary SelectPatternFlavor::SPF_ABS case to build ABS node.
Delete the redundant recognizer of the integer ABS pattern from the DAGCombiner.
Add promoting the integer ABS node in the LegalizeIntegerType.
Expand-based legalization of integer result for the ABS nodes.
Expand-based legalization of ABS vector operations.
Add some integer abs testcases for different typesizes for Thumb arch
Add the custom ABS expanding and change the SAD pattern recognizer for X86 arch: The i64 result of the ABS is expanded to:
tmp = (SRA, Hi, 31)
Lo = (UADDO tmp, Lo)
Hi = (XOR tmp, (ADDCARRY tmp, hi, Lo:1))
Lo = (XOR tmp, Lo)
The "detectZextAbsDiff" function is changed for the recognition of pattern with the ABS node. Given a ABS node, detect the following pattern:
(ABS (SUB (ZERO_EXTEND a), (ZERO_EXTEND b))).
Change integer abs testcases for codegen with the ABS node support for AArch64.
Indicate that the ABS is legal for the i64 type when the NEON is supported.
Change the integer abs testcases to show changing of codegen.
Add combine and legalization of ABS nodes for Thumb arch.
Extend 'matchSelectPattern' to recognize the ABS patterns with ICMP_SGE condition.
For discussion, see https://bugs.llvm.org/show_bug.cgi?id=37743
Jordan Rupprecht [Tue, 19 Mar 2019 16:09:54 +0000 (16:09 +0000)]
[llvm-ar] Support N [count] modifier
Summary:
GNU ar supports the 'N' count modifier for the extract (x) and delete (d) operations. When an archive contains multiple members with the same name, this can be used to extract (or delete) them individually. For example:
```
$ llvm-ar t archive.a
foo
foo
$ llvm-ar x archive.a
-> Writes foo twice, overwriting it the second time :( :(
$ llvm-ar xN 1 archive.a foo && mv foo foo.1
$ llvm-ar xN 2 archive.a foo && mv foo foo.2
-> Write foo twice, renaming it in between invocations to preserve all versions
```
Neil Henning [Tue, 19 Mar 2019 15:50:24 +0000 (15:50 +0000)]
[AMDGPU] Ban i8 min3 promotion.
I found this really weird WWM-related case whereby through the WWM
transformations our isel lowering was trying to promote 2 min's into a
min3 for the i8 type, which our hardware doesn't support.
The new min3_i8.ll test case would previously spew the error:
Simon Atanasyan [Tue, 19 Mar 2019 15:15:35 +0000 (15:15 +0000)]
[mips] Fix crash on recursive using of .set
Switch to the `MCParserUtils::parseAssignmentExpression` for parsing
assignment expressions in the `.set` directive reduces code and allows
to print an error message instead of crashing in case of incorrect
recursive using of the `.set`.
Fix for the bug https://bugs.llvm.org/show_bug.cgi?id=41053.
As discussed on PR41125 and D59363, we have a mismatch between icmp eq/ne cases with an undef operand:
When the other operand is constant we fold to undef (handled in ConstantFoldCompareInstruction)
When the other operand is non-constant we fold to a bool constant based on isTrueWhenEqual (handled in SimplifyICmpInst).
Neither is really wrong, but this patch changes the logic in SimplifyICmpInst to consistently fold to undef.
The NewGVN test change is annoying (as with most heavily reduced tests) but AFAICT I have kept the purpose of the test based on rL291968.
Petar Jovanovic [Tue, 19 Mar 2019 13:49:03 +0000 (13:49 +0000)]
[DebugInfoMetadata] Move main subprogram DIFlag into DISPFlags
Moving subprogram specific flags into DISPFlags makes IR code more readable.
In addition, we provide free space in DIFlags for other
'non-subprogram-specific' debug info flags.
Markus Lavin [Tue, 19 Mar 2019 13:16:28 +0000 (13:16 +0000)]
[DebugInfo] Introduce DW_OP_LLVM_convert
Introduce a DW_OP_LLVM_convert Dwarf expression pseudo op that allows
for a convenient way to perform type conversions on the Dwarf expression
stack. As an additional bonus it paves the way for using other Dwarf
v5 ops that need to reference a base_type.
The new DW_OP_LLVM_convert is used from lib/Transforms/Utils/Local.cpp
to perform sext/zext on debug values but mainly the patch is about
preparing terrain for adding other Dwarf v5 ops that need to reference a
base_type.
For Dwarf v5 the op maps to DW_OP_convert and for earlier versions a
complex shift & mask pattern is generated to emulate sext/zext.
This is a recommit of r356442 with trivial fixes for the failing tests.
Serge Guelton [Tue, 19 Mar 2019 09:14:09 +0000 (09:14 +0000)]
Use response file when generating LLVM-C.dll
As discovered in D56774 the command line gets to long, so use a response file
to give the script the libs. This change has been tested and is confirmed
working for me.
Commited on behalf of Jakob Bornecrantz.
Differential Revision: https://reviews.llvm.org/D56781
Markus Lavin [Tue, 19 Mar 2019 08:48:19 +0000 (08:48 +0000)]
[DebugInfo] Introduce DW_OP_LLVM_convert
Introduce a DW_OP_LLVM_convert Dwarf expression pseudo op that allows
for a convenient way to perform type conversions on the Dwarf expression
stack. As an additional bonus it paves the way for using other Dwarf
v5 ops that need to reference a base_type.
The new DW_OP_LLVM_convert is used from lib/Transforms/Utils/Local.cpp
to perform sext/zext on debug values but mainly the patch is about
preparing terrain for adding other Dwarf v5 ops that need to reference a
base_type.
For Dwarf v5 the op maps to DW_OP_convert and for earlier versions a
complex shift & mask pattern is generated to emulate sext/zext.
Heejin Ahn [Tue, 19 Mar 2019 05:10:39 +0000 (05:10 +0000)]
[WebAssembly] Improve readability of irreducibility tests
Summary:
This adds `preds` comment lines to BB names for readability, while also
fixes some of existing incorrect comment lines. Also deletes a few
unnecessary attributes. Autogenerated by `opt`.
Craig Topper [Mon, 18 Mar 2019 22:06:19 +0000 (22:06 +0000)]
[X86] Add coverage for 16-bit and 64-bit versions of bsf/bsr/bt/btc/btr/bts in the assembly tests that are supposed to provide full coverage. Add coverage for cwtl/cltq/cwtd/cqto as well.
Nikita Popov [Mon, 18 Mar 2019 21:35:19 +0000 (21:35 +0000)]
[ValueTracking][InstSimplify] Support min/max selects in computeConstantRange()
Add support for min/max flavor selects in computeConstantRange(),
which allows us to fold comparisons of a min/max against a constant
in InstSimplify. This was suggested by spatel as an alternative
approach to D59378. I've also added the infinite looping test from
that revision here.
Sam Clegg [Mon, 18 Mar 2019 21:21:12 +0000 (21:21 +0000)]
[WebAssembly] Don't override default implementation of isOffsetFoldingLegal. NFC.
The default implementation does we want and is going to more compatible
with dynamic linking (-fPIC) support that is planned.
This is NFC because currently we only build wasm with
`-relocation-model=static` which in turn means that the default
`isOffsetFoldingLegal` always returns true today.
Nikita Popov [Mon, 18 Mar 2019 21:20:03 +0000 (21:20 +0000)]
[ValueTracking][InstSimplify] Move abs handling into computeConstantRange(); NFC
This is preparation for D59506. The InstructionSimplify abs handling
is moved into computeConstantRange(), which is the general place for
such calculations. This is NFC and doesn't affect the existing tests
in test/Transforms/InstSimplify/icmp-abs-nabs.ll.
Jake Ehrlich [Mon, 18 Mar 2019 20:35:18 +0000 (20:35 +0000)]
[llvm-objcopy] Make .build-id linking atomic
This change makes linking into .build-id atomic and safe to use.
Some users under particular workflows are reporting that this races
more than half the time under particular conditions.
Alexandre Ganea [Mon, 18 Mar 2019 19:13:23 +0000 (19:13 +0000)]
[DebugInfo][PDB] Don't write empty debug streams
Before, empty debug streams were written as 8 bytes (4 bytes signature + 4 bytes for the GlobalRefs count).
With this patch, unused empty streams aren't emitted anymore. Modules now encode 65535 as an 'unused stream' value, by convention.
Also fix the * Linker * contrib section which wasn't correctly emitted previously.
Warren Ristow [Mon, 18 Mar 2019 18:52:35 +0000 (18:52 +0000)]
[SCEV] Guard movement of insertion point for loop-invariants
This reinstates r347934, along with a tweak to address a problem with
PHI node ordering that that commit created (or exposed). (That commit
was reverted at r348426, due to the PHI node issue.)
Original commit message:
r320789 suppressed moving the insertion point of SCEV expressions with
dev/rem operations to the loop header in non-loop-invariant situations.
This, and similar, hoisting is also unsafe in the loop-invariant case,
since there may be a guard against a zero denominator. This is an
adjustment to the fix of r320789 to suppress the movement even in the
loop-invariant case.
This patch follows some ideas from r352866 to optimize the floating
point materialization even further. It changes isFPImmLegal to
considere up to 2 mov instruction or up to 5 in case subtarget has
fused literals.
The rationale is the cost is the same for mov+fmov vs. adrp+ldr; but
the mov+fmov sequence is always better because of the reduced d-cache
pressure. The timings are still the same if you consider movw+movk+fmov
vs. adrp+ldr will be fused (although one instruction longer).
[AArch64] Refactor floating point materialization. NFC
It splits the login of actual instruction emission away from the logic
that figures out the appropriate sequence on AArch64ExpandPseudo::expandMOVImm.
The new function AArch64_IMM::expandMOVImm, which return the list of the
instructions to materialize the immediate constant, is implemented on a
separated unit because it will be used in a subsequent patch to optimize
floating point materialization.
Nirav Dave [Mon, 18 Mar 2019 17:02:38 +0000 (17:02 +0000)]
[DAG] Cleanup unused node in SimplifySelectCC.
Delete temporarily constructed node uses for analysis after it's use,
holding onto original input nodes. Ideally this would be rewritten
without making nodes, but this appears relatively complex.
Neil Henning [Mon, 18 Mar 2019 14:44:28 +0000 (14:44 +0000)]
[AMDGPU] Add an experimental buffer fat pointer address space.
Add an experimental buffer fat pointer address space that is currently
unhandled in the backend. This commit reserves address space 7 as a
non-integral pointer repsenting the 160-bit fat pointer (128-bit buffer
descriptor + 32-bit offset) that is heavily used in graphics workloads
using the AMDGPU backend.
Sanjay Patel [Mon, 18 Mar 2019 14:10:11 +0000 (14:10 +0000)]
[InstCombine] extend rotate-left-by-constant canonicalization to funnel shift
Follow-up to:
rL356338
Rotates are a special case of funnel shift where the 2 input operands
are the same value, but that does not need to be a restriction for the
canonicalization when the shift amount is a constant.
Roman Lebedev [Mon, 18 Mar 2019 11:32:37 +0000 (11:32 +0000)]
[llvm-exegesis] Separate tool options into three categories.
Results in much nicer -help output:
```
$ ./bin/llvm-exegesis -help
USAGE: llvm-exegesis [options]
OPTIONS:
Color Options:
-color - Use colors in output (default=autodetect)
General options:
-enable-cse-in-irtranslator - Should enable CSE in irtranslator
-enable-cse-in-legalizer - Should enable CSE in Legalizer
Generic Options:
-help - Display available options (-help-hidden for more)
-help-list - Display list of available options (-help-list-hidden for more)
-version - Display the version of this program
llvm-exegesis analysis options:
-analysis-clustering-epsilon=<number> - dbscan epsilon for benchmark point clustering
-analysis-clusters-output-file=<string> -
-analysis-display-unstable-clusters - if there is more than one benchmark for an opcode, said benchmarks may end up not being clustered into the same cluster if the measured performance characteristics are different. by default all such opcodes are filtered out. this flag will instead show only such unstable opcodes
-analysis-inconsistencies-output-file=<string> -
-analysis-inconsistency-epsilon=<number> - epsilon for detection of when the cluster is different from the LLVM schedule profile values
-analysis-numpoints=<uint> - minimum number of points in an analysis cluster
llvm-exegesis benchmark options:
-ignore-invalid-sched-class - ignore instructions that do not define a sched class
-mode=<value> - the mode to run
=latency - Instruction Latency
=inverse_throughput - Instruction Inverse Throughput
=uops - Uop Decomposition
=analysis - Analysis
-num-repetitions=<uint> - number of time to repeat the asm snippet
-opcode-index=<int> - opcode to measure, by index
-opcode-name=<string> - comma-separated list of opcodes to measure, by name
-snippets-file=<string> - code snippets to measure
llvm-exegesis options:
-benchmarks-file=<string> - File to read (analysis mode) or write (latency/uops/inverse_throughput modes) benchmark results. “-” uses stdin/stdout.
-mcpu=<string> - cpu name to use for pfm counters, leave empty to autodetect
```
Christof Douma [Mon, 18 Mar 2019 09:21:06 +0000 (09:21 +0000)]
[AArch64] Fix bug 35094 atomicrmw on Armv8.1-A+lse
Fixes https://bugs.llvm.org/show_bug.cgi?id=35094
The Dead register definition pass should leave alone the atomicrmw
instructions on AArch64 (LTE extension). The reason is the following
statement in the Arm ARM:
"The ST<OP> instructions, and LD<OP> instructions where the destination
register is WZR or XZR, are not regarded as doing a read for the purpose
of a DMB LD barrier."
A good example was given in the gcc thread by Will Deacon (linked in the
bugzilla ticket 35094):
P2 (atomic_int* y) {
int r1 = atomic_load_explicit(y,memory_order_relaxed);
}
My understanding is that it is forbidden for r0 == 0 and r1 == 2 after
this test has executed. However, if the relaxed add in P1 compiles to
STADD and the subsequent acquire fence is compiled as DMB LD, then we
don't have any ordering guarantees in P1 and the forbidden result could
be observed.
Matt Arsenault [Sun, 17 Mar 2019 21:31:35 +0000 (21:31 +0000)]
AMDGPU: Partially fix default device for HSA
There are a few different issues, mostly stemming from using
generation based checks for anything instead of subtarget
features. Stop adding flat-address-space as a feature for HSA, as it
should only be a device property. This was incorrectly allowing flat
instructions to select for SI.
Increase the default generation for HSA to avoid the encoding error
when emitting objects. This has some other side effects from various
checks which probably should be separate subtarget features (in the
cost model and for dealing with the DS offset folding issue).
Partial fix for bug 41070. It should probably be an error to try using
amdhsa without flat support.
Nikita Popov [Sun, 17 Mar 2019 21:25:26 +0000 (21:25 +0000)]
[ValueTracking] Use ConstantRange overflow check for signed add; NFC
This is the same change as rL356290, but for signed add. It replaces
the existing ripple logic with the overflow logic in ConstantRange.
This is NFC in that it should return NeverOverflow in exactly the
same cases as the previous implementation. However, it does make
computeOverflowForSignedAdd() more powerful by now also determining
AlwaysOverflows conditions. As none of its consumers handle this yet,
this has no impact on optimization. Making use of AlwaysOverflows
in with.overflow folding will be handled as a followup.
Craig Topper [Sun, 17 Mar 2019 21:21:37 +0000 (21:21 +0000)]
[X86] Remove the _alt forms of XOP VPCOM instructions. Use a combination of custom printing and custom parsing to achieve the same result and more
Previously we had a regular form of the instruction used when the immediate was 0-7. And _alt form that allowed the full 8 bit immediate. Codegen would always use the 0-7 form since the immediate was always checked to be in range. Assembly parsing would use the 0-7 form when a mnemonic like vpcomtrueb was used. If the immediate was specified directly the _alt form was used. The disassembler would prefer to use the 0-7 form instruction when the immediate was in range and the _alt form otherwise. This way disassembly would print the most readable form when possible.
The assembly parsing for things like vpcomtrueb relied on splitting the mnemonic into 3 pieces. A "vpcom" prefix, an immediate representing the "true", and a suffix of "b". The tablegenerated printing code would similarly print a "vpcom" prefix, decode the immediate into a string, and then print "b".
The _alt form on the other hand parsed and printed like any other instruction with no specialness.
With this patch we drop to one form and solve the disassembly printing issue by doing custom printing when the immediate is 0-7. The parsing code has been tweaked to turn "vpcomtrueb" into "vpcomb" and then the immediate for the "true" is inserted either before or after the other operands depending on at&t or intel syntax.
I'd rather not do the custom printing, but I tried using an InstAlias for each possible mnemonic for all 8 immediates for all 16 combinations of element size, signedness, and memory/register. The code emitted into printAliasInstr ended up checking the number of operands, the register class of each operand, and the immediate for all 256 aliases. This was repeated for both the at&t and intel printer. Despite a lot of common checks between all of the aliases, when compiled with clang at least this commonality was not well optimized. Nor do all the checks seem necessary. Since I want to do a similar thing for vcmpps/pd/ss/sd which have 32 immediate values and 3 encoding flavors, 3 register sizes, etc. This didn't seem to scale well for clang binary size. So custom printing seemed a better trade off.
I also considered just using the InstAlias for the matching and not the printing. But that seemed like it would add a lot of extra rows to the matcher table. Especially given that the 32 immediates for vpcmpps have 46 strings associated with them.