Matt Arsenault [Thu, 20 Jun 2019 21:58:24 +0000 (21:58 +0000)]
AMDGPU: Always use s33 for global scratch wave offset
Every called function could possibly need this to calculate the
absolute address of stack objectst, and this avoids inserting a copy
around every call site in the kernel. It's also somewhat cleaner to
keep this in a callee saved SGPR.
Eli Friedman [Thu, 20 Jun 2019 21:56:47 +0000 (21:56 +0000)]
[ARM GlobalISel] Add support for s64 G_ADD and G_SUB.
Teach RegisterBankInfo to use the correct register class, and tell the
legalizer it's legal. Everything else just works.
The one thing that's slightly weird about this compared to SelectionDAG
isel is that legalization can't distinguish between i64 and <1 x i64>,
so we might end up with more NEON instructions than the user expects.
Rainer Orth [Thu, 20 Jun 2019 21:27:06 +0000 (21:27 +0000)]
[profile] Solaris ld supports __start___llvm_prof_data etc. labels
Currently, many profiling tests on Solaris FAIL like
Command Output (stderr):
--
Undefined first referenced
symbol in file
__llvm_profile_register_names_function /tmp/lit_tmp_Nqu4eh/infinite_loop-9dc638.o
__llvm_profile_register_function /tmp/lit_tmp_Nqu4eh/infinite_loop-9dc638.o
Solaris 11.4 ld supports the non-standard GNU ld extension of adding
__start_SECNAME and __stop_SECNAME labels to sections whose names are valid
as C identifiers. Given that we already use Solaris 11.4-only features
like ld -z gnu-version-script-compat and fully working .preinit_array
support in compiler-rt, we don't need to worry about older versions of
Solaris ld.
The patch documents that support (although the comment in
lib/Transforms/Instrumentation/InstrProfiling.cpp
(needsRuntimeRegistrationOfSectionRange) is quite cryptic what it's
actually about), and adapts the affected testcase not to expect the
alternativeq __llvm_profile_register_functions and __llvm_profile_init.
It fixes all affected tests.
Alina Sbirlea [Thu, 20 Jun 2019 21:09:09 +0000 (21:09 +0000)]
[LICM & MSSA] Limit unsafe sinking and hoisting.
Summary:
The getClobberingMemoryAccess API checks for clobbering accesses in a loop by walking the backedge. This may check if a memory access is being
clobbered by the loop in a previous iteration, depending how smart AA got over the course of the updates in MemorySSA (it does not occur when built from scratch).
If no clobbering access is found inside the loop, it will optimize to an access outside the loop. This however does not mean that access is safe to sink.
Given:
```
for i
load a[i]
store a[i]
```
The access corresponding to the load can be optimized to outside the loop, and the load can be hoisted. But it is incorrect to sink it.
In order to sink the load, we'd need to check no Def clobbers the Use in the same iteration. With this patch we currently restrict sinking to either
Defs not existing in the loop, or Defs preceding the load in the same block. An easy extension is to ensure the load (Use) post-dominates all Defs.
Caught by PR42294.
This issue also shed light on the converse problem: hoisting stores in this same scenario would be illegal. With this patch we restrict
hoisting of stores to the case when their corresponding Defs are dominating all Uses in the loop.
Sanjay Patel [Thu, 20 Jun 2019 21:04:14 +0000 (21:04 +0000)]
[InstSimplify] add tests for known-not-a-power-of-2; NFC
I added a canonicalization to create this general pattern in:
rL363956
But as noted in PR42314:
https://bugs.llvm.org/show_bug.cgi?id=42314#c11
...we have a (potentially expensive) simplification for the version
of the code that we just canonicalized away from, so we should
add/adjust that code to match.
Leonard Chan [Thu, 20 Jun 2019 19:44:51 +0000 (19:44 +0000)]
[clang][NewPM] Do not eliminate available_externally durng `-O2 -flto` runs
This fixes CodeGen/available-externally-suppress.c when the new pass manager is
turned on by default. available_externally was not emitted during -O2 -flto
runs when it should still be retained for link time inlining purposes. This can
be fixed by checking that we aren't LTOPrelinking when adding the
EliminateAvailableExternallyPass.
Philip Reames [Thu, 20 Jun 2019 18:45:06 +0000 (18:45 +0000)]
[LFTR] Fix a (latent?) bug related to nested loops
I can't actually come up with a test case this triggers on without an out of tree change, but in theory, it's a bug in the recently added multiple exit LFTR support. The root issue is that an exiting block common to two loops can (in theory) have computable exit counts for both loops. Rewriting the exit of an inner loop in terms of the outer loops IV would cause the inner loop to either a) run forever, or b) terminate on the first iteration.
In practice, we appear to get lucky and not have the exit count computable for the outer loop, except when it's trivially zero. Given we bail on zero exit counts, we don't appear to ever trigger this. But I can't come up with a reason we *can't* compute an exit count for the outer loop on the common exiting block, so this may very well be triggering in some cases.
Craig Topper [Thu, 20 Jun 2019 17:52:53 +0000 (17:52 +0000)]
[X86] Add BLSI to isUseDefConvertible.
Summary:
BLSI sets the C flag is the input is not zero. So if its followed
by a TEST of the input where only the Z flag is consumed, we can
replace it with the opposite check of the C flag.
We should be able to do the same for BLSMSK and BLSR, but the
naive test case for those is being optimized to a subo by
CodeGenPrepare.
Sanjay Patel [Thu, 20 Jun 2019 17:41:15 +0000 (17:41 +0000)]
[InstCombine] canonicalize check for power-of-2
The form that compares against 0 is better because:
1. It removes a use of the input value.
2. It's the more standard form for this pattern: https://graphics.stanford.edu/~seander/bithacks.html#DetermineIfPowerOf2
3. It results in equal or better codegen (tested with x86, AArch64, ARM, PowerPC, MIPS).
This is a root cause for PR42314, but probably doesn't completely answer the codegen request:
https://bugs.llvm.org/show_bug.cgi?id=42314
Philip Reames [Thu, 20 Jun 2019 17:16:53 +0000 (17:16 +0000)]
[Tests] Add a tricky LFTR case for documentation purposes
Thought of this case while working on something else. We appear to get it right in all of the variations I tried, but that's by accident. So, add a test which would catch the potential bug.
Matt Arsenault [Thu, 20 Jun 2019 17:03:23 +0000 (17:03 +0000)]
AMDGPU: Fix ignoring DisableFramePointerElim in leaf functions
The attribute can specify elimination for leaf or non-leaf, so it
should always be considered. I copied this bug from AArch64, which
probably should also be fixed.
Simon Tatham [Thu, 20 Jun 2019 15:16:56 +0000 (15:16 +0000)]
[ARM] Add a batch of MVE integer instructions.
This includes integer arithmetic of various kinds (add/sub/multiply,
saturating and not), and the immediate forms of VMOV and VMVN that
load an immediate into all lanes of a vector.
AMDGPU target needs to override getRegClass() used during
instruction selection. We now may have either 32 or 64 bit
conditional registers used in the same instructions. For
that purpose special SReg_1 register class is created which
is dynamically resolved to either SReg_64 or SGPR_32 depending
on the subtarget attributes.
George Rimar [Thu, 20 Jun 2019 14:44:48 +0000 (14:44 +0000)]
[yaml2obj] - Convert `ELFState<ELFT>::addSymbols` method to `toELFSymbols` helper. NFCI.
ELFState<ELFT>::addSymbols method looks a bit strange.
User code have to create the destination symbols vector outside,
add a null symbol and then pass it to addSymbols when it seems
the more natural logic is to isolate all work with symbols inside some
function, build the list right there and return it.
Simon Pilgrim [Thu, 20 Jun 2019 12:48:49 +0000 (12:48 +0000)]
[DAGCombine] Add TODOs for some combines that should support non-uniform vectors
We tend to only test for scalar/scalar consts when really we could support non-uniform vectors using ISD::matchUnaryPredicate/matchBinaryPredicate etc.
Sjoerd Meijer [Thu, 20 Jun 2019 09:33:11 +0000 (09:33 +0000)]
TargetParserTest.ARMExtensionFeatures run out of memory on 32-bit (PR42316)
Nothing of these tests made much sense. Loops were iterating too much, and I
also don't think it was actually testing anything. I think we simply want to
check that AEK_SOME_EXT returns "+some_ext".
I've given the AArch64 tests the same treatment as they very similarly didn't
made any sense either.
Craig Topper [Thu, 20 Jun 2019 04:58:40 +0000 (04:58 +0000)]
[X86] Remove memory instructions form isUseDefConvertible.
The caller of this is looking for comparisons of the input
to these instructions with 0. But the memory instructions
input is an addess not a value input in a register.
Matt Arsenault [Thu, 20 Jun 2019 00:51:28 +0000 (00:51 +0000)]
AMDGPU: Don't clobber VCC in MUBUF addr64 emulation
Introducing VCC defs during SIFixSGPRCopies is generally
problematic. Avoid it by starting with the VOP3 form with the general
condition register. This is the easiest to fix instance, but doesn't
solve any specific problems I'm looking at.
Eli Friedman [Thu, 20 Jun 2019 00:29:40 +0000 (00:29 +0000)]
[llvm-objdump] Switch between ARM/Thumb based on mapping symbols.
The ARMDisassembler changes allow changing between ARM and Thumb mode
based on the MCSubtargetInfo, rather than the Target, which simplifies
the other changes a bit.
I'm not really happy with adding more target-specific logic to
tools/llvm-objdump/, but there isn't any easy way around it: the logic
in question specifically applies to disassembling an object file, and
that code simply isn't located in lib/Target, at least at the moment.
Matt Arsenault [Wed, 19 Jun 2019 23:54:58 +0000 (23:54 +0000)]
AMDGPU: Consolidate some getGeneration checks
This is incomplete, and ideally these would all be removed, but it's
better to localize them to the subtarget first with comments about
what they're for.
[FileCheck] Stop qualifying expressions as numeric
Summary:
Stop referring to "numeric expression", using simply the term
"expression" instead. Likewise for numeric operation since operations
are only used in numeric expressions.
Simon Pilgrim [Wed, 19 Jun 2019 22:14:24 +0000 (22:14 +0000)]
[DAGCombine] Use ConstantSDNode::getAPIntValue() instead of getZExtValue().
Use getAPIntValue() in a few more places. Most of the time getZExtValue() is fine, but occasionally there's fuzzed code or someone decides to create i65536 or something.....
Philip Reames [Wed, 19 Jun 2019 22:05:47 +0000 (22:05 +0000)]
[Util] Add a helper script for converting -print-before-all output into a file based equivelent
Simple little utility which takes a opt logfile generated with "opt -print-before-all -print-module-scope -o /dev/null <args> 2&>1", and splits into a series of individual "chunk-X.ll" files. The intended purpose is to help automate one step in failure reduction.
The imagined workflow is:
New crasher bug reported against clang or other frontend
Frontend run with -emit-llvm equivalent and manually confirmed that opt -O2 <emit.ll> crashes
Run this splitter script
Manually map pass name to invocation command (next on the to automate list)
Run bugpoint on last chunk file + manual command
I chose to dump every chunk rather than only the last since miscompile debugging frequently requires either manual step by step reduction, or cross feeding IR into different compiler versions. Not an immediate target, but there may be applications.
Philip Reames [Wed, 19 Jun 2019 21:58:25 +0000 (21:58 +0000)]
LFTR for multiple exit loops
Teach IndVarSimply's LinearFunctionTestReplace transform to handle multiple exit loops. LFTR does two key things 1) it rewrites (all) exit tests in terms of a common IV potentially eliminating one in the process and 2) it moves any offset/indexing/f(i) style logic out of the loop.
This turns out to actually be pretty easy to implement. SCEV already has all the information we need to know what the backedge taken count is for each individual exit. (We use that when computing the BE taken count for the loop as a whole.) We basically just need to iterate through the exiting blocks and apply the existing logic with the exit specific BE taken count. (The previously landed NFC makes this super obvious.)
I chose to go ahead and apply this to all loop exits instead of only latch exits as originally proposed. After reviewing other passes, the only case I could find where LFTR form was harmful was LoopPredication. I've fixed the latch case, and guards aren't LFTRed anyways. We'll have some more work to do on the way towards widenable_conditions, but that's easily deferred.
I do want to note that I added one bit after the review. When running tests, I saw a new failure (no idea why didn't see previously) which pointed out LFTR can rewrite a constant condition back to a loop varying one. This was theoretically possible with a single exit, but the zero case covered it in practice. With multiple exits, we saw this happening in practice for the eliminate-comparison.ll test case because we'd compute a ExitCount for one of the exits which was guaranteed to never actually be reached. Since LFTR ran after simplifyAndExtend, we'd immediately turn around and undo the simplication work we'd just done. The solution seemed obvious, so I didn't bother with another round of review.
Alina Sbirlea [Wed, 19 Jun 2019 21:33:09 +0000 (21:33 +0000)]
[MemorySSA] Cleanup trivial phis.
Summary:
This is unfortunately needed for correctness, if we are to extend the tolerance of the update API to the way simple loop unswitch is doing cloning.
In simple loop unswitch (as opposed to loop unswitch), not all blocks are cloned. This can create unreachable cloned blocks (no predecessor), which are later cleaned up.
In MemorySSA, the APIs for supporting these kind of updates (clone + update exit blocks), make certain assumption on the integrity of the CFG. When cloning, if something was not cloned, it's values in MemorySSA default to LiveOnEntry. When updating exit blocks, it is safe to assume that we can first insert phis in the blocks merging two clones, then add additional phis in the IDF of the blocks that received phis. This no longer holds true if one of the clones being merged comes from an unreachable block. We'd conservatively need to add all phis before filling in their incoming definitions. In practice this restriction can be relaxed if we clean up trivial phis after the first round of insertion.
Matt Arsenault [Wed, 19 Jun 2019 20:44:15 +0000 (20:44 +0000)]
AMDGPU: Fix folding immediate into readfirstlane through reg_sequence
The def instruction for the vreg may not match, because it may be
folding through a reg_sequence. The assert was overly conservative and
not necessary. It's not actually important if DefMI really defined the
register, because the fold that will be done cares about the def of
the value that will be folded.
For some reason copies aren't making it through the reg_sequence,
although they should.
Philip Reames [Wed, 19 Jun 2019 20:41:28 +0000 (20:41 +0000)]
[LFTR] Rename variable to minimize confusion [NFC]
(Recommit of r363293 which was reverted when a dependent patch was.)
As pointed out by Nikita in D62625, BackedgeTakenCount is generally used to refer to the backedge taken count of the loop. A conditional backedge taken count - one which only applies if a particular exit is taken - is called a ExitCount in SCEV code, so be consistent here.
Philip Reames [Wed, 19 Jun 2019 19:23:19 +0000 (19:23 +0000)]
[SCEV] Revise a method description to match actual behavior [NFC]
Reword the ScalarEvolution::getExitCount comment in the same terminology as used by getBackedgeTakenCount since they're equivelent for single exit loops. Also, strengthen the comment to indicate exiting on the exact iteration specified is guaranteed. Several transforms implicitly rely on this; and the actual implementation checks for it (via dominating latch checks). So, spell out the guarantee in the comment.
Sanjay Patel [Wed, 19 Jun 2019 18:13:47 +0000 (18:13 +0000)]
[x86] avoid vector load narrowing with extracted store uses (PR42305)
This is an exception to the rule that we should prefer xmm ops to ymm ops.
As shown in PR42305:
https://bugs.llvm.org/show_bug.cgi?id=42305
...the store folding opportunity with vextractf128 may result in better
perf by reducing the instruction count.
Jordan Rupprecht [Wed, 19 Jun 2019 18:03:36 +0000 (18:03 +0000)]
[test] Fix TargetParserTest runtime.
r363780 fixes extreme memory growth by using a new std::vector every loop iteration, but causes runtime to go up (and occasionally timeout in certain situations) because of constructor cost every loop iteration. Fix this by moving the constructor back out, but clearing contents in the loop.
Also apply this to the AArch64 features test case, which seems to use the same pattern.
Simon Pilgrim [Wed, 19 Jun 2019 17:21:15 +0000 (17:21 +0000)]
[X86][SSE] Combine shuffles to ANY_EXTEND/ANY_EXTEND_VECTOR_INREG.
We already do this for ZERO_EXTEND/ZERO_EXTEND_VECTOR_INREG - this just extends the pattern matcher to recognize cases where we don't need the zeros in the extension.
This includes all the obvious bitwise operations (AND, OR, BIC, ORN,
MVN) in register-to-register forms, and the immediate forms of
AND/OR/BIC/ORN; byte-order reverse instructions; and the VMOVs that
access a single lane of a vector.
Some of those VMOVs (specifically, the ones that access a 32-bit lane)
share an encoding with existing instructions that were disassembled as
accessing half of a d-register (e.g. `vmov.32 r0, d1[0]`), but in
8.1-M they're now written as accessing a quarter of a q-register (e.g.
`vmov.32 r0, q0[2]`). The older syntax is still accepted by the
assembler.
The original line was there from when this test was added, but it is
checking for a switch that doesn't exist, so really has no purpose, at
least any more.
Hubert Tong [Wed, 19 Jun 2019 16:04:24 +0000 (16:04 +0000)]
[NFC][llvm-objcopy] Fix overly restrictive od output check
The check against the output of `od` in the affected tests expect a
specific input offset format. They also expect a specific offset value,
not consistent with the EXAMPLE section for `od` in POSIX.1-2017
Chapter 4, while using the `-j` option. In particular, the example shows
that the input offset begins at 0 following the bytes skipped.
This patch adjusts the matching of the input offset to be more generic.
In order to avoid false matches, it restricts the number of bytes to be
formatted.
Stefan Granitz [Wed, 19 Jun 2019 15:25:32 +0000 (15:25 +0000)]
Specify log level for CMake messages (less stderr)
Summary:
Specify message levels in CMake. Prefer STATUS (stdout).
As the default message mode (i.e. level) is NOTICE in CMake, more then necessary messages get printed to stderr. Some tools, noticably ccmake treat this as an error and require additional confirmation and re-running CMake's configuration step.
This commit specifies a mode (either STATUS or WARNING or FATAL_ERROR) instead of the default.
* I used `csearch -f 'llvm-project/.+(CMakeLists\.txt|cmake)' -l 'message\("'` to find all locations.
* Reviewers were chosen by the most common authors of specific files. If there are more suitable reviewers for these CMake changes, please let me know.