The approach I took only works for functions marked `noreturn`. In
general, a call that is not known to be noreturn may be followed by
unreachable for other reasons. For example, there could be multiple call
sites to a function that throws sometimes, and at some call sites, it is
known to always throw, so it is followed by unreachable. We need to
insert an `int3` in these cases to pacify the Windows unwinder.
I think this probably deserves its own standalone, Win64-only fixup pass
that runs after block placement. Implementing that will take some time,
so let's revert to TrapUnreachable in the mean time.
[WebAssembly] Compare functions by names in Emscripten Sjlj
Summary:
This removes all string constants for function names and compares
functions by string directly when needed. Many of these constants are
used only once or twice so the benefit of defining them separately is
not very clear, and this actually fixes a bug.
When we already have a `malloc` declaration which is an alias to
something else within the module,
```
@malloc = weak hidden alias i8* (i32), i8* (i32)* @dlmalloc
```
(this happens compiling with emscripten with `-s WASM_OBJECT_FILES=0`
because all bc files are merged before being fed into `wasm-ld` which
runs the backend optimizations as LTO)
`Module::getFunction("malloc")` in `canLongjmp` returns `nullptr`
because `Module::getFunction` dyncasts pointer into `Function`, but the
alias is a `GlobalValue` but not a `Function`. This makes `canLongjmp`
return false for `malloc` in this case, and we end up adding a lot of
longjmp handling code around malloc. This is not only a code size
increase but actually a bug because `malloc` is used in the entry block
when preparing for setjmp tables for emscripten sjlj handling, and this
makes initial setjmp preparation, which has to happen in the entry
block, move to another split block, and this interferes with SSA update
later.
This also adds two more functions, `getTempRet0` and `setTempRet0`, in
the list of not longjmp-able functions.
[llvm-profdata] Add mode to recover from profile read failures
Add a mode in which profile read errors are not immediately treated as
fatal. In this mode, merging makes forward progress and reports failure
only if no inputs can be read.
Philip Reames [Tue, 3 Sep 2019 21:56:17 +0000 (21:56 +0000)]
[GVN] Remove a todo introduced w/rL370791
When I dug into this, it turns out to be *much* more involved than I'd realized and doesn't actually simplify anything.
The general purpose of the leader table is that we want to find the most-dominating definition quickly. The problem for equivalance folding is slightly different; we want to find the most dominating *value* whose definition block dominates our use quickly.
To make this change, we'd end up having to restructure the leader table (either the sorting thereof, or maybe even introducing multiple leader tables per value) and that complexity is just not worth it.
[GlobalISel][CallLowering] Add support for splitting types according to calling conventions.
On AArch64, s128 types have to be split into s64 GPRs when passed as arguments.
This change adds the generic support in call lowering for dealing with multiple
registers, for incoming and outgoing args.
Support for splitting for return types not yet implemented.
Add the no-capture argument attribute deduction to the Attributor
fixpoint framework.
The new string attributed "no-capture-maybe-returned" is introduced to
allow deduction of no-capture through functions that "capture" an
argument but only by "returning" it. It is only used by the Attributor
for testing.
[CodeGen] Use FSHR in DAGTypeLegalizer::ExpandIntRes_MULFIX
Summary:
Simplify the right shift of the intermediate result (given
in four parts) by using funnel shift.
There are some impact on lit tests, but that seems to be
related to register allocation differences due to how FSHR
is expanded on X86 (giving a slightly different operand order
for the OR operations compared to the old code).
[MC] Pass through .code16/32/64 and .syntax unified for COFF
These flags should simply be passed through to the target, which will do
the right thing. Add an MC/X86 test that uses these directives with the
three primary object file formats and shows that they disassemble the
same everywhere.
There is a missing test for .code32 on Windows ARM, since I'm not sure
exactly how to construct one.
Philip Reames [Tue, 3 Sep 2019 17:31:19 +0000 (17:31 +0000)]
[GVN] Propagate simple equalities from assumes within the tail of the block
This extends the existing logic for propagating constant expressions in an analogous manner for what we do across basic blocks. The core point is that we chose some order of operands, and canonicalize uses towards that one.
The heuristic used is inspired by the one used across blocks; in a follow up change, I'd plan to common them so that the cross block version uses the slightly stronger ordering herein.
As noted by the TODOs in the code, there's a good amount of room for improving the existing code and making it more powerful. Some follow up work planned.
David Green [Tue, 3 Sep 2019 11:30:54 +0000 (11:30 +0000)]
[ARM] Ignore Implicit CPSR regs when lowering from Machine to MC operands
The code here seems to date back to r134705, when tablegen lowering was first
being added. I don't believe that we need to include CPSR implicit operands on
the MCInst. This now works more like other backends (like AArch64), where all
implicit registers are skipped.
This allows the AliasInst for CSEL's to match correctly, as can be seen in the
test changes.
David Green [Tue, 3 Sep 2019 11:06:24 +0000 (11:06 +0000)]
[ARM] Invert CSEL predicates if the opposite is a simpler constant to materialise
This moves ConstantMaterializationCost into ARMBaseInstrInfo so that it can
also be used in ISel Lowering, adding codesize values to the computed costs, to
be able to compare either approximate instruction counts or codesize costs.
It also adds a HasLowerConstantMaterializationCost, which compares the
ConstantMaterializationCost of two values, returning true if the first is
smaller either in instruction count/codesize, or falling back to the other in
the case that they are equal.
This is used in constant CSEL lowering to invert the predicate if the opposite
is easier to materialise.
David Green [Tue, 3 Sep 2019 10:53:07 +0000 (10:53 +0000)]
[ARM] Generate 8.1-m CSINC, CSNEG and CSINV instructions.
Arm 8.1-M adds a number of related CSEL instructions, including CSINC, CSNEG and CSINV. These choose between two values given the content in CPSR and a condition, performing an increment, negation or inverse of the false value.
This adds some selection for them, either from constant values or patterns. It does not include CSEL directly, which is currently not always making code better. It is still useful, but we will have to check more carefully where it should and shouldn't be used.
Code by Ranjeet Singh and Simon Tatham, with some modifications from me.
Simon Atanasyan [Tue, 3 Sep 2019 10:24:07 +0000 (10:24 +0000)]
[mips] Switch to the `.text` section after emitting asm file preamble
Now the last `.section` directive in the MIPS asm file preamble
is the `.section .mdebug.abi`. If assembler code injected for example
by the LLVM `module asm` or the C ` __asm` directives do not contain
explicit switching to the `.text` section it goes to the `.mdebug.abi`
section. It might be unexpected to the user and in fact for example
breaks building some existing code like FreeBSD libc [1].
The patch forces switching to the `.text` section after emitting MIPS
assembler file preamble.
David Green [Tue, 3 Sep 2019 09:57:02 +0000 (09:57 +0000)]
[ARM] Fix MVE ldst offset ranges
We were using isShiftedInt<7, Shift>(RHSC) to detect the ranges of offsets to
fold into MVE loads/stores. The instructions actually take a 7 bit unsigned
integer which is either added or subtracted. So something more like
isShiftedUInt<7, Shift>(abs(RHSC)).
Instead I've changes this to use the isScaledConstantInRange method, same as in
SelectT2AddrModeImm7Offset used by pre/post inc, which seemed to already be
getting this correct.
Oliver Stannard [Tue, 3 Sep 2019 09:51:19 +0000 (09:51 +0000)]
Bug fix on function epilog optimization (ARM backend)
To save a 'add sp,#val' instruction by adding registers to the final pop instruction,
the first register transferred by this pop instruction need to be found.
If the function to be optimized has a non-void return value, the operand list contains
r0 (implicit) which prevents the optimization to take place.
Therefore implicit register references should be skipped in the search loop,
because this registers are never popped from the stack.
[LV] Fix miscompiles by adding non-header PHI nodes to AllowedExit
Summary:
Fold-tail currently supports reduction last-vector-value live-out's,
but has yet to support last-scalar-value live-outs, including
non-header phi's. As it relies on AllowedExit in order to detect
them and bail out we need to add the non-header PHI nodes to
AllowedExit, otherwise we end up with miscompiles.
James Molloy [Tue, 3 Sep 2019 08:20:31 +0000 (08:20 +0000)]
[MachinePipeliner] Add a way to unit-test the schedule emitter
Emitting a schedule is really hard. There are lots of corner cases to take care of; in fact, of the 60+ SWP-specific testcases in the Hexagon backend most of those are testing codegen rather than the schedule creation itself.
One issue is that to test an emission corner case we must craft an input such that the generated schedule uses that corner case; sometimes this is very hard and convolutes testcases. Other times it is impossible but we want to test it anyway.
This patch adds a simple test pass that will consume a module containing a loop and generate pipelined code from it. We use post-instr-symbols as a way to annotate instructions with the stage and cycle that we want to schedule them at.
We also provide a flag that causes the MachinePipeliner to generate these annotations instead of actually emitting code; this allows us to generate an input testcase with:
[X86] Simplify the setOperationAction handling for fp_to_uint by improving the Custom handler a bit.
This merges the 32-bit and 64-bit mode code to just use Custom
for both i32 and i64. We already had most of the handling in
the custom handling due to the AVX512 having legal fp_to_uint.
Just needed to add the i32->i64 promotion handling. Refactor
the fp_to_uint code in the custom handler to simplify the
number of times we check things.
Tweak cost model tables to match the default handling we were
getting due to Expand before.
[X86] Don't use Expand for i32 fp_to_uint on SSE1/2 targets on 32-bit target.
Use Custom lowering instead. Fall back to default expansion only
when the scalar FP type belongs in an XMM register. This improves
lowering for i32 to fp80, and also i32 to double on SSE1 only.
[X86] Enable fp128 as a legal type with SSE1 rather than with MMX.
FP128 values are passed in xmm registers so should be asssociated
with an SSE feature rather than MMX which uses a different set
of registers.
llc enables sse1 and sse2 by default with x86_64. But does not
enable mmx. Clang enables all 3 features by default.
I've tried to add command lines to test with -sse
where possible, but any test that returns a value in an xmm
register fails with a fatal error with -sse since we have no
defined ABI for that scenario.
Summary:
Adds the following inline asm constraints for SVE:
- w: SVE vector register with full range, Z0 to Z31
- x: Restricted to registers Z0 to Z15 inclusive.
- y: Restricted to registers Z0 to Z7 inclusive.
This change also adds the "z" modifier to interpret a register as an SVE register.
Not all of the bitconvert patterns added by this patch are used, but they have been included here for completeness.
George Rimar [Mon, 2 Sep 2019 14:57:35 +0000 (14:57 +0000)]
Recommit r370661 "[llvm-nm] - Add a test case for case when we dump a symbol that belongs to a section with a broken sh_name."
Fix: add a 'consumeError()' call to ObjectFile.cpp.
This error was never checked.
Original commit message:
It adds a test case for a problem fixed by D66976 <https://reviews.llvm.org/D66976>.
It was introduced by me in D66089 <https://reviews.llvm.org/D66089>.
The error reported was never consumed because of a wrong variable name used,
so it could fail when LLVM_ENABLE_ABI_BREAKING_CHECKS is used.
[DAGCombiner] try to form test+set out of shift+mask patterns
The motivating bugs are:
https://bugs.llvm.org/show_bug.cgi?id=41340
https://bugs.llvm.org/show_bug.cgi?id=42697
As discussed there, we could view this as a failure of IR canonicalization,
but then we would need to implement a backend fixup with target overrides
to get this right in all cases. Instead, we can just view this as a codegen
opportunity. It's not even clear for x86 exactly when we should favor
test+set; some CPUs have better theoretical throughput for the ALU ops than
bt/test.
This patch is made more complicated than I expected because there's an early
DAGCombine for 'and' that can change types of the intermediate ops via
trunc+anyext.
Jay Foad [Mon, 2 Sep 2019 14:40:57 +0000 (14:40 +0000)]
Partially revert D61491 "AMDGPU: Be explicit about whether the high-word in SI_PC_ADD_REL_OFFSET is 0"
Summary:
D61491 caused us to use relocs when they're not strictly necessary, to
refer to symbols in the text section. This is a pessimization and it's a
problem for some loaders that don't support relocs yet.
Summary:
Commit r366897 introduced the possibility to set a variable from an
expression, such as [[#VAR2:VAR1+3]]. While introducing this feature, it
introduced extra logic to allow using such a variable on the same line
later on. Unfortunately that extra logic is flawed as it relies on a
mapping from variable to expression defining it when the mapping is from
variable definition to expression. This flaw causes among other issues
PR42896.
This commit avoids the problem by forbidding all use of a variable
defined on the same line, and removes the now useless logic. Redesign
will be done in a later commit because it will require some amount of
refactoring first for the solution to be clean. One example is the need
for some sort of transaction mechanism to set a variable temporarily and
from an expression and rollback if the CHECK pattern does not match so
that diagnostics show the right variable values.
George Rimar [Mon, 2 Sep 2019 13:54:45 +0000 (13:54 +0000)]
[llvm-nm] - Add a test case for case when we dump a symbol that belongs to a section with a broken sh_name.
It adds a test case for a problem fixed by D66976.
It was introduced by me in D66089.
The error reported was never consumed because of a wrong variable name used,
so it could fail when LLVM_ENABLE_ABI_BREAKING_CHECKS is used.
[InstCombine] recognize bswap disguised as shufflevector
bitcast <N x i8> (shuf X, undef, <N, N-1,...0>) to i{N*8} --> bswap (bitcast X to i{N*8})
In PR43146:
https://bugs.llvm.org/show_bug.cgi?id=43146
...we have a more complicated case where SLP is making a mess of bswap. This patch won't
do anything for that currently, but we need to improve bswap recognition in instcombine,
SLP, and/or a standalone pass to avoid that problem.
This is limited using the data-layout so we don't try to do this transform with actual
vector types. The backend does not appear to have folds to convert in either direction,
so we don't want to mess up something that is actually better lowered as a shuffle.
Martin Storsjo [Mon, 2 Sep 2019 13:28:07 +0000 (13:28 +0000)]
[llvm-dlltool] Remove support for implying output name
I don't see GNU dlltool supporting doing this; with only a -d option
and no -l option, GNU dlltool runs successfully but doesn't write any
output file.
[X86][BtVer2] Fix latency and throughput of conditional SIMD store instructions.
On BtVer2 conditional SIMD stores are heavily microcoded.
The latency is directly proportional to the number of packed elements extracted
from the input vector. Also, according to micro-benchmarks, most of the
computation seems to be done in the integer unit.
Only a minority of the uOPs is executed by the FPU. The observed behaviour on
the FPU looks similar to this:
- The input MASK value is moved to the Integer Unit
-- [ a VMOVMSK-like uOP-executed on JFPU0].
- In parallel, each element of the input XMM/YMM is extracted and then sent to
the IntegerUnit through JFPU1.
As expected, a (conditional) store is executed for every extracted element.
Interestingly, a (speculative) load is executed for every extracted element too.
It is as-if a "LOAD - BIT_EXTRACT- CMOV" sequence of uOPs is repeated by the
integer unit for every contionally stored element.
VMASKMOVDQU is a special case: the number of speculative loads is always 2
(presumably, one load per quadword). That means, extra shifts and masking is
performed on (one of) the loaded quadwords before each conditional store (that
also explains the big number of non-FP uOPs retired).
This patch replaces the existing writes for conditional SIMD stores (i.e.
WriteFMaskedStore, and WriteFMaskedStoreY) with the following new writes:
Added a wrapper class named X86SchedWriteMaskMove in X86Schedule.td to describe
both RM and MR variants for conditional SIMD moves in a single tablegen
definition.
Instances of that class are then passed in input to multiclass avx_movmask_rm
when constructing MASKMOVPS/PD definitions.
Since this patch introduces new writes, I had to update all the X86 scheduling
models.
Jeremy Morse [Mon, 2 Sep 2019 12:28:36 +0000 (12:28 +0000)]
[DebugInfo] LiveDebugValues: correctly discriminate kinds of variable locations
The missing line added by this patch ensures that only spilt variable
locations are candidates for being restored from the stack. Otherwise,
register or constant-value information can be interpreted as a spill
location, through a union.
The added regression test replicates a scenario where this occurs: the
stack load from [rsp] causes the register-location DBG_VALUE to be
"restored" to rsi, when it should be left alone. See PR43058 for details.
Un x-fail a test that was suffering from this from a previous patch.
[Clang Interpreter] Initial patch for the constexpr interpreter
Summary:
This patch introduces the skeleton of the constexpr interpreter,
capable of evaluating a simple constexpr functions consisting of
if statements. The interpreter is described in more detail in the
RFC. Further patches will add more features.
[X86] Add initial support for unfolding broadcast loads from arithmetic instructions to enable LICM hoisting of the load
MachineLICM can hoist an invariant load, but if that load is folded it needs to be unfolded. On AVX512 sometimes this load is an broadcast load which we were previously unable to unfold. This patch adds initial support for that with a very basic list of supported instructions as a starting point.
[DAGCombiner] improve throughput of shift+logic+shift
The motivating case for this is a long way from here:
https://bugs.llvm.org/show_bug.cgi?id=43146
...but I think this is where we have to start.
We need to canonicalize/optimize sequences of shift and logic to ease
pattern matching for things like bswap and improve perf in general.
But without the artificial limit of '!LegalTypes' (early combining),
there are a lot of test diffs, and not all are good.
In the minimal tests added for this proposal, x86 should have better
throughput in all cases. AArch64 is neutral for scalar tests because
it can fold shifts into bitwise logic ops.
There are 3 shift opcodes and 3 logic opcodes for a total of 9 possible patterns:
https://rise4fun.com/Alive/VlI
https://rise4fun.com/Alive/n1m
https://rise4fun.com/Alive/1Vn
Roman Lebedev [Sun, 1 Sep 2019 11:56:52 +0000 (11:56 +0000)]
[ConstantFolding] Fix 'undef' folding for @llvm.[us]{add,sub}.with.overflow ops (PR43188)
As we have already established/fixed in
https://bugs.llvm.org/show_bug.cgi?id=42209
https://reviews.llvm.org/D63065
https://reviews.llvm.org/rL363522
the InstSimplify handling for @llvm.with.overflow ops with undefs
is correct. Therefore if ConstantFolding produces different results,
then it is wrong.
This duplication of code hints at the need for some refactoring,
but for now address the brokenness of ConstantFolding by
copying the known-good handling from rL363522.
[TargetLowering] Fix Bugzilla ID 43183 to avoid soften comparison broken with constant inputs
Summary:
This fixes the bugzilla id 43183 which triggerd by the following commit:
[RISCV] Avoid generating AssertZext for LP64 ABI when lowering floating LibCall
[GlobalISel][NFC] Regression test cases for aarch64 legalizer (s128 sext+icmp).
There were legalizer asserts in aarch64 globalisel (in debug mode) with s128
sext+icmp before r367060 and r366943 landed. These are just a couple reduced
mir and ir regression tests that came from a build where these were encountered.
Craig Topper [Sat, 31 Aug 2019 23:52:25 +0000 (23:52 +0000)]
[X86] Replace some COPY_TO_REGCLASS from GR32/GR64 to VR128 in isel patterns with VMOVDI2PDIrr/VMOV64toPQIrr.
This is what the copies will eventually be turned into. We don't
use COPY_TO_REGCLASS for scalar_to_vector patterns. So we should
use the real instruction here too.
Simon Pilgrim [Sat, 31 Aug 2019 16:21:31 +0000 (16:21 +0000)]
[X86] EltsFromConsecutiveLoads - Don't confuse elt count with vector element count (PR43170)
EltsFromConsecutiveLoads was assuming that the number of input elts was the same as the number of elements in the output vector type when creating a zeroing shuffle, causing an assert when subvectors were being combined instead of just scalars.