Warren Ristow [Mon, 18 Mar 2019 18:52:35 +0000 (18:52 +0000)]
[SCEV] Guard movement of insertion point for loop-invariants
This reinstates r347934, along with a tweak to address a problem with
PHI node ordering that that commit created (or exposed). (That commit
was reverted at r348426, due to the PHI node issue.)
Original commit message:
r320789 suppressed moving the insertion point of SCEV expressions with
dev/rem operations to the loop header in non-loop-invariant situations.
This, and similar, hoisting is also unsafe in the loop-invariant case,
since there may be a guard against a zero denominator. This is an
adjustment to the fix of r320789 to suppress the movement even in the
loop-invariant case.
This patch follows some ideas from r352866 to optimize the floating
point materialization even further. It changes isFPImmLegal to
considere up to 2 mov instruction or up to 5 in case subtarget has
fused literals.
The rationale is the cost is the same for mov+fmov vs. adrp+ldr; but
the mov+fmov sequence is always better because of the reduced d-cache
pressure. The timings are still the same if you consider movw+movk+fmov
vs. adrp+ldr will be fused (although one instruction longer).
[AArch64] Refactor floating point materialization. NFC
It splits the login of actual instruction emission away from the logic
that figures out the appropriate sequence on AArch64ExpandPseudo::expandMOVImm.
The new function AArch64_IMM::expandMOVImm, which return the list of the
instructions to materialize the immediate constant, is implemented on a
separated unit because it will be used in a subsequent patch to optimize
floating point materialization.
Nirav Dave [Mon, 18 Mar 2019 17:02:38 +0000 (17:02 +0000)]
[DAG] Cleanup unused node in SimplifySelectCC.
Delete temporarily constructed node uses for analysis after it's use,
holding onto original input nodes. Ideally this would be rewritten
without making nodes, but this appears relatively complex.
Neil Henning [Mon, 18 Mar 2019 14:44:28 +0000 (14:44 +0000)]
[AMDGPU] Add an experimental buffer fat pointer address space.
Add an experimental buffer fat pointer address space that is currently
unhandled in the backend. This commit reserves address space 7 as a
non-integral pointer repsenting the 160-bit fat pointer (128-bit buffer
descriptor + 32-bit offset) that is heavily used in graphics workloads
using the AMDGPU backend.
Sanjay Patel [Mon, 18 Mar 2019 14:10:11 +0000 (14:10 +0000)]
[InstCombine] extend rotate-left-by-constant canonicalization to funnel shift
Follow-up to:
rL356338
Rotates are a special case of funnel shift where the 2 input operands
are the same value, but that does not need to be a restriction for the
canonicalization when the shift amount is a constant.
Roman Lebedev [Mon, 18 Mar 2019 11:32:37 +0000 (11:32 +0000)]
[llvm-exegesis] Separate tool options into three categories.
Results in much nicer -help output:
```
$ ./bin/llvm-exegesis -help
USAGE: llvm-exegesis [options]
OPTIONS:
Color Options:
-color - Use colors in output (default=autodetect)
General options:
-enable-cse-in-irtranslator - Should enable CSE in irtranslator
-enable-cse-in-legalizer - Should enable CSE in Legalizer
Generic Options:
-help - Display available options (-help-hidden for more)
-help-list - Display list of available options (-help-list-hidden for more)
-version - Display the version of this program
llvm-exegesis analysis options:
-analysis-clustering-epsilon=<number> - dbscan epsilon for benchmark point clustering
-analysis-clusters-output-file=<string> -
-analysis-display-unstable-clusters - if there is more than one benchmark for an opcode, said benchmarks may end up not being clustered into the same cluster if the measured performance characteristics are different. by default all such opcodes are filtered out. this flag will instead show only such unstable opcodes
-analysis-inconsistencies-output-file=<string> -
-analysis-inconsistency-epsilon=<number> - epsilon for detection of when the cluster is different from the LLVM schedule profile values
-analysis-numpoints=<uint> - minimum number of points in an analysis cluster
llvm-exegesis benchmark options:
-ignore-invalid-sched-class - ignore instructions that do not define a sched class
-mode=<value> - the mode to run
=latency - Instruction Latency
=inverse_throughput - Instruction Inverse Throughput
=uops - Uop Decomposition
=analysis - Analysis
-num-repetitions=<uint> - number of time to repeat the asm snippet
-opcode-index=<int> - opcode to measure, by index
-opcode-name=<string> - comma-separated list of opcodes to measure, by name
-snippets-file=<string> - code snippets to measure
llvm-exegesis options:
-benchmarks-file=<string> - File to read (analysis mode) or write (latency/uops/inverse_throughput modes) benchmark results. “-” uses stdin/stdout.
-mcpu=<string> - cpu name to use for pfm counters, leave empty to autodetect
```
Christof Douma [Mon, 18 Mar 2019 09:21:06 +0000 (09:21 +0000)]
[AArch64] Fix bug 35094 atomicrmw on Armv8.1-A+lse
Fixes https://bugs.llvm.org/show_bug.cgi?id=35094
The Dead register definition pass should leave alone the atomicrmw
instructions on AArch64 (LTE extension). The reason is the following
statement in the Arm ARM:
"The ST<OP> instructions, and LD<OP> instructions where the destination
register is WZR or XZR, are not regarded as doing a read for the purpose
of a DMB LD barrier."
A good example was given in the gcc thread by Will Deacon (linked in the
bugzilla ticket 35094):
P2 (atomic_int* y) {
int r1 = atomic_load_explicit(y,memory_order_relaxed);
}
My understanding is that it is forbidden for r0 == 0 and r1 == 2 after
this test has executed. However, if the relaxed add in P1 compiles to
STADD and the subsequent acquire fence is compiled as DMB LD, then we
don't have any ordering guarantees in P1 and the forbidden result could
be observed.
Matt Arsenault [Sun, 17 Mar 2019 21:31:35 +0000 (21:31 +0000)]
AMDGPU: Partially fix default device for HSA
There are a few different issues, mostly stemming from using
generation based checks for anything instead of subtarget
features. Stop adding flat-address-space as a feature for HSA, as it
should only be a device property. This was incorrectly allowing flat
instructions to select for SI.
Increase the default generation for HSA to avoid the encoding error
when emitting objects. This has some other side effects from various
checks which probably should be separate subtarget features (in the
cost model and for dealing with the DS offset folding issue).
Partial fix for bug 41070. It should probably be an error to try using
amdhsa without flat support.
Nikita Popov [Sun, 17 Mar 2019 21:25:26 +0000 (21:25 +0000)]
[ValueTracking] Use ConstantRange overflow check for signed add; NFC
This is the same change as rL356290, but for signed add. It replaces
the existing ripple logic with the overflow logic in ConstantRange.
This is NFC in that it should return NeverOverflow in exactly the
same cases as the previous implementation. However, it does make
computeOverflowForSignedAdd() more powerful by now also determining
AlwaysOverflows conditions. As none of its consumers handle this yet,
this has no impact on optimization. Making use of AlwaysOverflows
in with.overflow folding will be handled as a followup.
Craig Topper [Sun, 17 Mar 2019 21:21:37 +0000 (21:21 +0000)]
[X86] Remove the _alt forms of XOP VPCOM instructions. Use a combination of custom printing and custom parsing to achieve the same result and more
Previously we had a regular form of the instruction used when the immediate was 0-7. And _alt form that allowed the full 8 bit immediate. Codegen would always use the 0-7 form since the immediate was always checked to be in range. Assembly parsing would use the 0-7 form when a mnemonic like vpcomtrueb was used. If the immediate was specified directly the _alt form was used. The disassembler would prefer to use the 0-7 form instruction when the immediate was in range and the _alt form otherwise. This way disassembly would print the most readable form when possible.
The assembly parsing for things like vpcomtrueb relied on splitting the mnemonic into 3 pieces. A "vpcom" prefix, an immediate representing the "true", and a suffix of "b". The tablegenerated printing code would similarly print a "vpcom" prefix, decode the immediate into a string, and then print "b".
The _alt form on the other hand parsed and printed like any other instruction with no specialness.
With this patch we drop to one form and solve the disassembly printing issue by doing custom printing when the immediate is 0-7. The parsing code has been tweaked to turn "vpcomtrueb" into "vpcomb" and then the immediate for the "true" is inserted either before or after the other operands depending on at&t or intel syntax.
I'd rather not do the custom printing, but I tried using an InstAlias for each possible mnemonic for all 8 immediates for all 16 combinations of element size, signedness, and memory/register. The code emitted into printAliasInstr ended up checking the number of operands, the register class of each operand, and the immediate for all 256 aliases. This was repeated for both the at&t and intel printer. Despite a lot of common checks between all of the aliases, when compiled with clang at least this commonality was not well optimized. Nor do all the checks seem necessary. Since I want to do a similar thing for vcmpps/pd/ss/sd which have 32 immediate values and 3 encoding flavors, 3 register sizes, etc. This didn't seem to scale well for clang binary size. So custom printing seemed a better trade off.
I also considered just using the InstAlias for the matching and not the printing. But that seemed like it would add a lot of extra rows to the matcher table. Especially given that the 32 immediates for vpcmpps have 46 strings associated with them.
Tim Renouf [Sun, 17 Mar 2019 20:48:54 +0000 (20:48 +0000)]
[ARM] Fixed an assumption of power-of-2 vector MVT
I am about to introduce some non-power-of-2 width vector MVTs. This
commit fixes a power-of-2 assumption that my forthcoming change would
otherwise break, as shown by test/CodeGen/ARM/vcvt_combine.ll and
vdiv_combine.ll.
Nikita Popov [Sun, 17 Mar 2019 20:24:02 +0000 (20:24 +0000)]
[ConstantRange] Add fromKnownBits() method
Following the suggestion in D59450, I'm moving the code for constructing
a ConstantRange from KnownBits out of ValueTracking, which also allows us
to test this code independently.
I'm adding this method to ConstantRange rather than KnownBits (which
would have been a bit nicer API wise) to avoid creating a dependency
from Support to IR, where ConstantRange lives.
Sanjay Patel [Sun, 17 Mar 2019 19:08:00 +0000 (19:08 +0000)]
[InstCombine] canonicalize rotate right by constant to rotate left
This was noted as a backend problem:
https://bugs.llvm.org/show_bug.cgi?id=41057
...and subsequently fixed for x86:
rL356121
But we should canonicalize these in IR for the benefit of all targets
and improve IR analysis such as CSE.
David Green [Sun, 17 Mar 2019 16:11:22 +0000 (16:11 +0000)]
[ARM] Search backwards for CMP when combining into CBZ
The constant island pass currently only looks at the instruction immediately
before a branch for a CMP to fold into a CBZ/CBNZ. This extends it to search
backwards for the instruction that defines CPSR. We need to ensure that the
register is not overridden between the CMP and the branch.
Nikita Popov [Sun, 17 Mar 2019 15:45:38 +0000 (15:45 +0000)]
[DAGCombine] Fold (x & ~y) | y patterns
Fold (x & ~y) | y and it's four commuted variants to x | y. This pattern
can in particular appear when a vselect c, x, -1 is expanded to
(x & ~c) | (-1 & c) and combined to (x & ~c) | c.
This change has some overlap with D59066, which avoids creating a
vselect of this form in the first place during uaddsat expansion.
Sanjay Patel [Sun, 17 Mar 2019 14:57:40 +0000 (14:57 +0000)]
[TargetLowering] improve the default expansion of uaddsat/usubsat
This is a subset of what was proposed in:
D59006
...and may overlap with test changes from:
D59174
...but it seems like a good general optimization to turn selects
into bitwise-logic when possible because we never know exactly
what can happen at this stage of DAG combining depending on how
the target has defined things.
Alex Bradbury [Sun, 17 Mar 2019 12:00:58 +0000 (12:00 +0000)]
[RISCV] Fix RISCVAsmParser::ParseRegister and add tests
RISCVAsmParser::ParseRegister is called from AsmParser::parseRegisterOrNumber,
which in turn is called when processing CFI directives. The RISC-V
implementation wasn't setting RegNo, and so was incorrect. This patch address
that and adds cfi directive tests that demonstrate the fix. A follow-up patch
will factor out the register parsing logic shared between ParseRegister and
parseRegister.
Simon Pilgrim [Sat, 16 Mar 2019 17:36:26 +0000 (17:36 +0000)]
[DAGCombine] combineShuffleOfScalars - handle non-zero SCALAR_TO_VECTOR indices (PR41097)
rL356292 reduces the size of scalar_to_vector if we know the upper bits are undef - which means that shuffles may find they are suddenly referencing scalar_to_vector elements other than zero - so make sure we handle this as undef.
Yonghong Song [Sat, 16 Mar 2019 15:36:31 +0000 (15:36 +0000)]
[BPF] Add BTF Var and DataSec Support
Two new kinds, BTF_KIND_VAR and BTF_KIND_DATASEC, are added.
BTF_KIND_VAR has the following specification:
btf_type.name: var name
btf_type.info: type kind
btf_type.type: var type
// btf_type is followed by one u32
u32: varinfo (currently, only 0 - static, 1 - global allocated in elf sections)
Not all globals are supported in this patch. The following globals are supported:
. static variables with or without section attributes
. global variables with section attributes
The inclusion of globals with section attributes
is for future potential extraction of key/value
type id's from map definition.
BTF_KIND_DATASEC has the following specification:
btf_type.name: section name associated with variable or
one of .data/.bss/.readonly
btf_type.info: type kind and vlen for # of variables
btf_type.size: 0
#vlen number of the following:
u32: id of corresponding BTF_KIND_VAR
u32: in-session offset of the var
u32: the size of memory var occupied
At the time of debug info emission, the data section
size is unknown, so the btf_type.size = 0 for
BTF_KIND_DATASEC. The loader can patch it during
loading time.
The in-session offseet of the var is only available
for static variables. For global variables, the
loader neeeds to assign the global variable symbol value in
symbol table to in-section offset.
The size of memory is used to specify the amount of the
memory a variable occupies. Typically, it equals to
the type size, but for certain structures, e.g.,
struct tt {
int a;
int b;
char c[];
};
static volatile struct tt s2 = {3, 4, "abcdefghi"};
The static variable s2 has size of 20.
Note that for BTF_KIND_DATASEC name, the section name
does not contain object name. The compiler does have
input module name. For example, two cases below:
. clang -target bpf -O2 -g -c test.c
The compiler knows the input file (module) is test.c
and can generate sec name like test.data/test.bss etc.
. clang -target bpf -O2 -g -emit-llvm -c test.c -o - |
llc -march=bpf -filetype=obj -o test.o
The llc compiler has the input file as stdin, and
would generate something like stdin.data/stdin.bss etc.
which does not really make sense.
For any user specificed section name, e.g.,
static volatile int a __attribute__((section("id1")));
static volatile const int b __attribute__((section("id2")));
The DataSec with name "id1" and "id2" does not contain
information whether the section is readonly or not.
The loader needs to check the corresponding elf section
flags for such information.
A simple example:
-bash-4.4$ cat t.c
int g1;
int g2 = 3;
const int g3 = 4;
static volatile int s1;
struct tt {
int a;
int b;
char c[];
};
static volatile struct tt s2 = {3, 4, "abcdefghi"};
static volatile const int s3 = 4;
int m __attribute__((section("maps"), used)) = 4;
int test() { return g1 + g2 + g3 + s1 + s2.a + s3 + m; }
-bash-4.4$ clang -target bpf -O2 -g -S t.c
Checking t.s, 4 BTF_KIND_VAR's are generated (s1, s2, s3 and m).
4 BTF_KIND_DATASEC's are generated with names
".data", ".bss", ".rodata" and "maps".
Signed-off-by: Yonghong Song <yhs@fb.com>
Differential Revision: https://reviews.llvm.org/D59441
Heejin Ahn [Sat, 16 Mar 2019 05:38:57 +0000 (05:38 +0000)]
[WebAssembly] Make rethrow take an except_ref type argument
Summary:
In the new wasm EH proposal, `rethrow` takes an `except_ref` argument.
This change was missing in r352598.
This patch adds `llvm.wasm.rethrow.in.catch` intrinsic. This is an
intrinsic that's gonna eventually be lowered to wasm `rethrow`
instruction, but this intrinsic can appear only within a catchpad or a
cleanuppad scope. Also this intrinsic needs to be invokable - otherwise
EH pad successor for it will not be correctly generated in clang.
This also adds lowering logic for this intrinsic in
`SelectionDAGBuilder::visitInvoke`. This routine is basically a
specialized and simplified version of
`SelectionDAGBuilder::visitTargetIntrinsic`, but we can't use it
because if is only for `CallInst`s.
This deletes the previous `llvm.wasm.rethrow` intrinsic and related
tests, which was meant to be used within a `__cxa_rethrow` library
function. Turned out this needs some more logic, so the intrinsic for
this purpose will be added later.
LateEHPrepare takes a result value of `catch` and inserts it into
matching `rethrow` as an argument.
`RETHROW_IN_CATCH` is a pseudo instruction that serves as a link between
`llvm.wasm.rethrow.in.catch` and the real wasm `rethrow` instruction. To
generate a `rethrow` instruction, we need an `except_ref` argument,
which is generated from `catch` instruction. But `catch` instrutions are
added in LateEHPrepare pass, so we use `RETHROW_IN_CATCH`, which takes
no argument, until we are able to correctly lower it to `rethrow` in
LateEHPrepare.
Heejin Ahn [Sat, 16 Mar 2019 04:46:05 +0000 (04:46 +0000)]
[WebAssembly] Method order change in LateEHPrepare (NFC)
Summary:
Currently the order of these methods does not matter, but the following
CL needs to have this order changed. Merging the order change and the
semantics change within a CL complicates the diff, so submitting the
order change first.
Heejin Ahn [Sat, 16 Mar 2019 03:00:19 +0000 (03:00 +0000)]
[WebAssembly] Irreducible control flow rewrite
Summary:
Rewrite WebAssemblyFixIrreducibleControlFlow to a simpler and cleaner
design, which directly computes reachability and other properties
itself. This avoids previous complexity and bugs. (The new graph
analyses are very similar to how the Relooper algorithm would find loop
entries and so forth.)
This fixes a few bugs, including where we had a false positive and
thought fannkuch was irreducible when it was not, which made us much
larger and slower there, and a reverse bug where we missed
irreducibility. On fannkuch, we used to be 44% slower than asm2wasm and
are now 4% faster.
Fedor Sergeev [Fri, 15 Mar 2019 22:15:23 +0000 (22:15 +0000)]
[TimePasses] allow -time-passes reporting into a custom stream
TimePassesHandler object (implementation of time-passes for new pass manager)
gains ability to report into a stream customizable per-instance (per pipeline).
Intended use is to specify separate time-passes output stream per each compilation,
setting up TimePasses member of StandardInstrumentation during PassBuilder setup.
That allows to get independent non-overlapping pass-times reports for parallel
independent compilations (in JIT-like setups).
By default it still puts timing reports into the info-output-file stream
(created by CreateInfoOutputFile every time report is requested).
Unit-test added for non-default case, and it also allowed to discover that print() does not work
as declared - it did not reset the timers, leading to yet another report being printed into the default stream.
Fixed print() to actually reset timers according to what was declared in print's comments before.
Eli Friedman [Fri, 15 Mar 2019 21:44:49 +0000 (21:44 +0000)]
[ARM] Add MachineVerifier logic for some Thumb1 instructions.
tMOVr and tPUSH/tPOP/tPOP_RET have register constraints which can't be
expressed in TableGen, so check them explicitly. I've unfortunately run
into issues with both of these recently; hopefully this saves some time
for someone else in the future.
Roman Lebedev [Fri, 15 Mar 2019 21:17:53 +0000 (21:17 +0000)]
[X86] Promote i8 CMOV's (PR40965)
Summary:
@mclow.lists brought up this issue up in IRC, it came up during
implementation of libc++ `std::midpoint()` implementation (D59099)
https://godbolt.org/z/oLrHBP
Currently LLVM X86 backend only promotes i8 CMOV if it came from 2x`trunc`.
This differential proposes to always promote i8 CMOV.
There are several concerns here:
* Is this actually more performant, or is it just the ASM that looks cuter?
* Does this result in partial register stalls?
* What about branch predictor?
# Indeed, performance should be the main point here.
Let's look at a simple microbenchmark: {F8412076}
```
#include "benchmark/benchmark.h"
template <typename T>
static void CustomArguments(benchmark::internal::Benchmark* b) {
const size_t L2SizeBytes = 2 * 1024 * 1024;
// What is the largest range we can check to always fit within given L2 cache?
const size_t MaxLen = L2SizeBytes / /*total bufs*/ 2 /
/*maximal elt size*/ sizeof(T) / /*safety margin*/ 2;
b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN);
}
// Both of the values are random.
// The comparison is unpredictable.
BENCHMARK_TEMPLATE(BM_StdMidpoint, int32_t, RandRand)
->Apply(CustomArguments<int32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, RandRand)
->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int64_t, RandRand)
->Apply(CustomArguments<int64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, RandRand)
->Apply(CustomArguments<uint64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int16_t, RandRand)
->Apply(CustomArguments<int16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, RandRand)
->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int8_t, RandRand)
->Apply(CustomArguments<int8_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, RandRand)
->Apply(CustomArguments<uint8_t>);
// One value is always zero, and another is bigger or equal than zero.
// The comparison is predictable.
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, ZeroRand)
->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, ZeroRand)
->Apply(CustomArguments<uint64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, ZeroRand)
->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, ZeroRand)
->Apply(CustomArguments<uint8_t>);
```
What can we tell from the benchmark?
* `BM_StdMidpoint<[u]int8_t, RandRand>` indeed has the worst performance.
* All `BM_StdMidpoint<uint{8,16,32}_t, ZeroRand>` are all performant, even the 8-bit case.
That is because there we are computing mid point between zero and some random number,
thus if the branch predictor is in use, it is in optimal situation.
* Promoting 8-bit CMOV did improve performance of `BM_StdMidpoint<[u]int8_t, RandRand>`, by -59%..-64%.
# What about branch predictor?
* `BM_StdMidpoint<uint8_t, ZeroRand>` was faster than `BM_StdMidpoint<uint{16,32,64}_t, ZeroRand>`,
which may mean that well-predicted branch is better than `cmov`.
* Promoting 8-bit CMOV degraded performance of `BM_StdMidpoint<uint8_t, ZeroRand>`,
`cmov` is up to +10% worse than well-predicted branch.
* However, i do not believe this is a concern. If the branch is well predicted, then the PGO
will also say that it is well predicted, and LLVM will happily expand cmov back into branch:
https://godbolt.org/z/P5ufig
# What about partial register stalls?
I'm not really able to answer that.
What i can say is that if the branch is unpredictable (if it is predictable, then use PGO and you'll have branch)
in ~50% of cases you will have to pay branch misprediction penalty.
```
$ grep -i MispredictPenalty X86Sched*.td
X86SchedBroadwell.td: let MispredictPenalty = 16;
X86SchedHaswell.td: let MispredictPenalty = 16;
X86SchedSandyBridge.td: let MispredictPenalty = 16;
X86SchedSkylakeClient.td: let MispredictPenalty = 14;
X86SchedSkylakeServer.td: let MispredictPenalty = 14;
X86ScheduleBdVer2.td: let MispredictPenalty = 20; // Minimum branch misdirection penalty.
X86ScheduleBtVer2.td: let MispredictPenalty = 14; // Minimum branch misdirection penalty
X86ScheduleSLM.td: let MispredictPenalty = 10;
X86ScheduleZnver1.td: let MispredictPenalty = 17;
```
.. which it can be as small as 10 cycles and as large as 20 cycles.
Partial register stalls do not seem to be an issue for AMD CPU's.
For intel CPU's, they should be around ~5 cycles?
Is that actually an issue here? I'm not sure.
In short, i'd say this is an improvement, at least on this microbenchmark.
Nikita Popov [Fri, 15 Mar 2019 21:04:34 +0000 (21:04 +0000)]
[AArch64] Turn BIC immediate creation into a DAG combine
Switch BIC immediate creation for vector ANDs from custom lowering
to a DAG combine, which gives generic DAG combines a change to
apply first. In particular this avoids (and x, -1) being turned into
a (bic x, 0) instead of being eliminated entirely.
Changpeng Fang [Fri, 15 Mar 2019 21:02:48 +0000 (21:02 +0000)]
AMDGPU: Fix a SIAnnotateControlFlow issue when there are multiple backedges.
Summary:
At the exit of the loop, the compiler uses a register to remember and accumulate
the number of threads that have already exited. When all active threads exit the
loop, this register is used to restore the exec mask, and the execution continues
for the post loop code.
When there is a "continue" in the loop, the compiler made a mistake to reset the
register to 0 when the "continue" backedge is taken. This will result in some
threads not executing the post loop code as they are supposed to.
Philip Reames [Fri, 15 Mar 2019 19:54:06 +0000 (19:54 +0000)]
[SimplifyDemandedVec] Strengthen handling all undef lanes (particularly GEPs)
A change of two parts:
1) A generic enhancement for all callers of SDVE to exploit the fact that if all lanes are undef, the result is undef.
2) A GEP specific piece to strengthen/fix the vector index undef element handling, and call into the generic infrastructure when visiting the GEP.
The result is that we replace a vector gep with at least one undef in each lane with a undef. We can also do the same for vector intrinsics. Once the masked.load patch (D57372) has landed, I'll update to include call tests as well.
Nikita Popov [Fri, 15 Mar 2019 18:37:45 +0000 (18:37 +0000)]
[ValueTracking] Use ConstantRange overflow checks for unsigned add/sub; NFC
Use the methods introduced in rL356276 to implement the
computeOverflowForUnsigned(Add|Sub) functions in ValueTracking, by
converting the KnownBits into a ConstantRange.
This is NFC: The existing KnownBits based implementation uses the same
logic as the the ConstantRange based one. This is not the case for the
signed equivalents, so I'm only changing unsigned here.
This is in preparation for D59386, which will also intersect the
computeConstantRange() result into the range determined from KnownBits.
Philip Reames [Fri, 15 Mar 2019 17:50:30 +0000 (17:50 +0000)]
[X86][GlobalISEL] Support lowering aligned unordered atomics
The existing lowering code is accidentally correct for unordered atomics as far as I can tell. An unordered atomic has no memory ordering, and simply requires the actual load or store to be done as a single well aligned instruction. As such, relax the restriction while adding tests to ensure the lowering remains correct in the future.
Yonghong Song [Fri, 15 Mar 2019 17:39:10 +0000 (17:39 +0000)]
[BPF] handle external global properly
Previous commit 6bc58e6d3dbd ("[BPF] do not generate unused local/global types")
tried to exclude global variable from type generation. The condition is:
if (Global.hasExternalLinkage())
continue;
This is not right. It also excluded initialized globals.
The correct condition (from AssemblyWriter::printGlobal()) is:
if (!GV->hasInitializer() && GV->hasExternalLinkage())
Out << "external ";
Let us do the same in BTF type generation. Also added a test for it.
Signed-off-by: Yonghong Song <yhs@fb.com>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356279 91177308-0d34-0410-b5e6-96231b3b80d8
Nikita Popov [Fri, 15 Mar 2019 17:29:05 +0000 (17:29 +0000)]
[ConstantRange] Add overflow check helpers
Add functions to ConstantRange that determine whether the
unsigned/signed addition/subtraction of two ConstantRanges
may/always/never overflows. This will allow checking overflow
conditions based on known constant ranges in addition to known bits.
I'm implementing these methods on ConstantRange to allow them to be
unit tested independently of any ValueTracking machinery. The tests
include exhaustive testing on 4-bit ranges, to make sure the result
is both conservatively correct and maximally precise.
The OverflowResult enum is redeclared on ConstantRange, because
I wanted to avoid a dependency in either direction between
ValueTracking.h and ConstantRange.h.
Pavel Labath [Fri, 15 Mar 2019 15:34:10 +0000 (15:34 +0000)]
YAMLIO: Improve endian type support
Summary:
Now that endian types support enumerations (D59141), the existing yaml
support for them is somewhat insufficient. The current solution was to
define the ScalarTraits class for these types, which always forwards to
the ScalarTraits of the underlying type. However, the enum types will
usually have ScalarEnumerationTraits of ScalarBitsetTraits.
In this patch I add the two extra Traits types to the endian types. In
order to properly SFINAE-ize them, I've also added an extra "Enable"
template argument to the Traits template classes.
Teresa Johnson [Fri, 15 Mar 2019 15:11:38 +0000 (15:11 +0000)]
[ThinLTO] Restructure AliasSummary to contain ValueInfo of Aliasee
Summary:
The AliasSummary previously contained the AliaseeGUID, which was only
populated when reading the summary from bitcode. This patch changes it
to instead hold the ValueInfo of the aliasee, and always populates it.
This enables more efficient access to the ValueInfo (specifically in the
recent patch r352438 which needed to perform an index hash table lookup
using the aliasee GUID).
As noted in the comments in AliasSummary, we no longer technically need
to keep a pointer to the corresponding aliasee summary, since it could
be obtained by walking the list of summaries on the ValueInfo looking
for the summary in the same module. However, I am concerned that this
would be inefficient when walking through the index during the thin
link for various analyses. That can be reevaluated in the future.
By always populating this new field, we can remove the guard and special
handling for a 0 aliasee GUID when dumping the dot graph of the summary.
An additional improvement in this patch is when reading the summaries
from LLVM assembly we now set the AliaseeSummary field to the aliasee
summary in that same module, which makes it consistent with the behavior
when reading the summary from bitcode.
Mikael Holmen [Fri, 15 Mar 2019 13:51:05 +0000 (13:51 +0000)]
[CodeGenPrepare] avoid crashing from replacing a phi twice
Summary:
This is a fix to bug 41052:
https://bugs.llvm.org/show_bug.cgi?id=41052
While trying to optimize a memory instruction in a dead basic block, we end up registering the same phi for replacement twice. This patch avoids registering more than the first replacement candidate for a phi.
Michael Liao [Fri, 15 Mar 2019 12:42:21 +0000 (12:42 +0000)]
[AMDGPU] Fix SGPR fixing through SCC chaining
Summary:
- During the fixing of SGPR copying from VGPR, ensure users of SCC is
properly propagated, i.e.
* only propagate through live def of SCC,
* skip the SCC-def inst itself, and
* stop the propagation on the other SCC-def inst after checking its
SCC-use first.
Florian Hahn [Fri, 15 Mar 2019 12:17:36 +0000 (12:17 +0000)]
[LSR] Check for signed overflow in NarrowSearchSpaceByDetectingSupersets.
We are adding a sign extended IR value to an int64_t, which can cause
signed overflows, as in the attached test case, where we have a formula
with BaseOffset = -1 and a constant with numeric_limits<int64_t>::min().
If the addition would overflow, skip the simplification for this
formula. Note that the target triple is required to trigger the failure.
James Henderson [Fri, 15 Mar 2019 10:35:27 +0000 (10:35 +0000)]
[yaml2obj]Allow explicit setting of p_filesz, p_memsz, and p_offset
yaml2obj currently derives the p_filesz, p_memsz, and p_offset values of
program headers from their sections. This makes writing tests for
certain formats more complex, and sometimes impossible. This patch
allows setting these fields explicitly, overriding the default value,
when relevant.