Sam Parker [Mon, 11 Feb 2019 11:35:42 +0000 (11:35 +0000)]
[ARM] Add v8m.base pattern for add negative imm
The v8m.base ISA contains movw, which can operate on an unsigned
16-bit value. Add the pattern that converts an add with a negative
value, that could fit into 16-bits when negated, into a sub with that
positive value.
Diana Picus [Mon, 11 Feb 2019 10:30:22 +0000 (10:30 +0000)]
test-release.sh: Add option to use ninja
Allow the use of ninja instead of make. This is useful on some
platforms where we'd like to be able to limit the number of link jobs
without slowing down the other steps of the release.
This patch adds a -use-ninja command line option, which sets the
generator to Ninja both for LLVM and the test-suite. It also deals with
some differences between make and ninja:
* DESTDIR handling - ninja doesn't like this to be listed after the
target, but both make and ninja can handle it before the command
* Verbose mode - ninja uses -v, make uses VERBOSE=1
* Keep going mode - make has a -k mode, which builds as much as possible
even when failures are encountered; for ninja we need to set a hard
limit (we use 100 since most people won't look at 100 failures anyway)
[DWARF] LLVM ERROR: Broken function found, while removing Debug Intrinsics.
Check that when SimplifyCFG is flattening a 'br', all their debug intrinsic instructions are removed, including any dbg.label referencing a label associated with the basic blocks being removed.
As the test case involves a CFG transformation, move it to the correct location.
Sjoerd Meijer [Mon, 11 Feb 2019 09:37:42 +0000 (09:37 +0000)]
[ARM] LoadStoreOptimizer: reoder limit
The whole design of generating LDMs/STMs is fragile and unreliable: it depends on
rescheduling here in the LoadStoreOptimizer that isn't register pressure aware
and regalloc that isn't aware of generating LDMs/STMs.
This patch adds a (hidden) option to control the total number of instructions that
can be re-ordered. I appreciate this looks only a tiny bit better than a hard-coded
constant, but at least it allows more easy experimentation with different values
for now. Ideally we calculate this reorder limit based on some heuristics, and take
register pressure into account. I might be looking into that next.
Michal Gorny [Mon, 11 Feb 2019 09:07:07 +0000 (09:07 +0000)]
[llvm] [cmake] Use current directory in GenerateVersionFromVCS
Find dependent scripts of GenerateVersionFromVCS in current directory
rather than ../../cmake/modules. I do not see any reason why the former
would not work and The latter is incorrect when GenerateVersionFromVCS
is used from install directory (i.e. in stand-alone builds).
Chandler Carruth [Mon, 11 Feb 2019 07:42:30 +0000 (07:42 +0000)]
[CallSite removal] Migrate the statepoint GC infrastructure to use the
`CallBase` class rather than `CallSite` wrappers.
I pushed this change down through most of the statepoint infrastructure,
completely removing the use of CallSite where I could reasonably do so.
I ended up making a couple of cut-points: generic call handling
(instcombine, TLI, SDAG). As soon as it hit truly generic handling with
users outside the immediate code, I simply transitioned into or out of
a `CallSite` to make this a reasonable sized chunk.
Simon Pilgrim [Sun, 10 Feb 2019 17:04:00 +0000 (17:04 +0000)]
[DAGCombine] Simplify funnel shifts with undef/zero args to bitshifts
Now that we have SimplifyDemandedBits support for funnel shifts (rL353539), we need to simplify funnel shifts back to bitshifts in cases where either argument has been folded to undef/zero.
Sanjay Patel [Sun, 10 Feb 2019 15:22:06 +0000 (15:22 +0000)]
[x86] narrow 256-bit horizontal ops via demanded elements
256-bit horizontal math ops are an x86 monstrosity (and thankfully have
not been extended to 512-bit AFAIK).
The two 128-bit halves operate on separate halves of the inputs. So if we
don't demand anything in the upper half of the result, we can extract the
low halves of the inputs, do the math, and then insert that result into a
256-bit output.
All of the extract/insert is free (ymm<-->xmm), so we're left with a
narrower (cheaper) version of the original op.
In the affected tests based on:
https://bugs.llvm.org/show_bug.cgi?id=33758
https://bugs.llvm.org/show_bug.cgi?id=38971
...we see that the h-op narrowing can result in further narrowing of other
math via existing generic transforms.
I originally drafted this patch as an exact pattern match starting from
extract_vector_elt, but I thought we might see diffs starting from
extract_subvector too, so I changed it to a more general demanded elements
solution. There are no extra existing regression test improvements from
that switch though, so we could go back.
Sanjay Patel [Sun, 10 Feb 2019 14:29:57 +0000 (14:29 +0000)]
[TargetLowering] refactor setcc folds to fix another miscompile (PR40657)
SimplifySetCC still has much room for improvement, but this should
fix the remaining problem examples from:
https://bugs.llvm.org/show_bug.cgi?id=40657
George Rimar [Sun, 10 Feb 2019 08:35:38 +0000 (08:35 +0000)]
[yaml2obj] - Fix .dynamic section entries writing for 32bit targets.
This was introduced by me in r353613.
I tried to fix Big-endian bot and replaced
uintX_t -> ELFT::Xword. But ELFT::Xword is a packed<uint64_t>,
so it is always 8 bytes and that was obviously incorrect.
My intention was to use something like packed<uint> actually, which
size is target dependent.
Patch fixes this bug and adds a test case, since no bots seems reported this.
Simon Pilgrim [Sat, 9 Feb 2019 20:34:59 +0000 (20:34 +0000)]
[X86] CombineOr - fold to generic funnel shifts
As discussed on D57389, this is a first step towards moving the SHLD/SHRD matching code to DAGCombiner using FSHL/FSHR instead.
There's a bit of work to do before I can do that, so this just folds to FSHL/FSHR in the existing code (handling the different SHRD/FSHR argument ordering), which fixes the issue we had with i16 shift amounts not being correctly masked.
Sanjay Patel [Sat, 9 Feb 2019 17:03:59 +0000 (17:03 +0000)]
[TargetLowering] add tests to show effect of setcc sub->shift; NFC
There's effectively no difference for the cases with variables.
We just trade a sub for an add on those. But the case with a
subtract from constant would require an extra move instruction
on x86, so this looks like a reasonable generic combine.
George Rimar [Sat, 9 Feb 2019 15:18:52 +0000 (15:18 +0000)]
[yaml2elf.cpp] - Fix compilation under linux.
Fixes errors like:
/home/ssglocal/clang-cmake-x86_64-sde-avx512-linux/clang-cmake-x86_64-sde-avx512-linux/llvm/tools/yaml2obj/yaml2elf.cpp:597:5: error: need ‘typename’ before ‘ELFT:: Xword’ because ‘ELFT’ is a dependent scope
ELFT::Xword Tag = (ELFT::Xword)DE.Tag;
George Rimar [Sat, 9 Feb 2019 15:03:19 +0000 (15:03 +0000)]
[yaml2elf] - An attemp to fix s390x BB after r353607.
s390x is big-endian and seems r353607 had an issue with endianess,
Bot was unhappy:
http://lab.llvm.org:8011/builders/clang-s390x-linux-lnt/builds/11168/steps/ninja%20check%201/logs/stdio
Simon Pilgrim [Sat, 9 Feb 2019 13:13:59 +0000 (13:13 +0000)]
[X86][SSE] Generalize X86ISD::BLENDI support to more value types
D42042 introduced the ability for the ExecutionDomainFixPass to more easily change between BLENDPD/BLENDPS/PBLENDW as the domains required.
With this ability, we can avoid most bitcasts/scaling in the DAG that was occurring with X86ISD::BLENDI lowering/combining, blend with the vXi32/vXi64 vectors directly and use isel patterns to lower to the float vector equivalent vectors.
This helps the shuffle combining and SimplifyDemandedVectorElts be more aggressive as we lose track of fewer UNDEF elements than when we go up/down through bitcasts.
I've introduced a basic blend(bitcast(x),bitcast(y)) -> bitcast(blend(x,y)) fold, there are more generalizations I can do there (e.g. widening/scaling and handling the tricky v16i16 repeated mask case).
The vector-reduce-smin/smax regressions will be fixed in a future improvement to SimplifyDemandedBits to peek through bitcasts and support X86ISD::BLENDV.
Fangrui Song [Sat, 9 Feb 2019 09:18:37 +0000 (09:18 +0000)]
[GlobalOpt] Simplify __cxa_atexit elimination
cxxDtorIsEmpty checks callers recursively to determine if the
__cxa_atexit-registered function is empty, and eliminates the
__cxa_atexit call accordingly.
This recursive check is unnecessary as redundant instructions and
function calls can be removed by early-cse and inliner. In addition,
cxxDtorIsEmpty does not mark visited function and it may visit a
function exponential times (multiplication principle).
Hubert Tong [Sat, 9 Feb 2019 02:11:51 +0000 (02:11 +0000)]
[MC] Clean up unused inline function and non-anchor defaulted destructors; NFCI
Summary:
Take care of some missing clean-ups that belong with r249548 and some
other copy/paste that had happened. In particular, the destructors are
no longer vtable anchors after r249548; and `setSectionName` in
`MCSectionWasm` is private and unused since r313058 culled its only
caller. The destructors are now implicitly defined, and the unused
function is removed.
Gabor Buella [Sat, 9 Feb 2019 01:44:28 +0000 (01:44 +0000)]
Extra processing for BitCast + PHI in InstCombine
For some specific cases with bitcast A->B->A with intervening PHI nodes InstCombiner::optimizeBitCastFromPhi transformation creates extra PHI nodes, which are actually a copy of already created PHI or in another words, they are redundant. These extra PHI nodes could lead to extra move instructions generated after DeSSA transformation. This happens when several conditions are met
- SROA kicks in and creates new alloca;
- there is a simple assignment L = R, which falls under 'canonicalize loads' done by combineLoadToOperationType (this transformation is by default). Exactly this transformation is the reason of bitcasts generated;
- the alloca is then used in A->B->A + PHI chain;
- there is a loop unrolling.
As a result optimizeBitCastFromPhi creates as many of PHI nodes for each new SROA alloca as loop unrolling factor is. These new extra PHI nodes are redundant actually except of one and should not be created. Moreover the idea of optimizeBitCastFromPhi is to get rid of the cast (when possible) but that doesn't happen in these conditions.
The proposed fix is to do the cast replacement for the whole calculated/accumulated PHI closure not for one cast only, which is an argument to the optimizeBitCastFromPhi. These will help to accomplish several things: 1) avoid extra PHI nodes generated as all casts which may trigger optimizeBitCastFromPhi transformation will be replaced, 3) bitcasts will be replaced, and 3) create more opportunities to remove dead code, which appears after the replacement.
A new test case shows that it's possible to get rid of all bitcasts completely and get quite good code reduction.
Author: Igor Tsimbalist <igor.v.tsimbalist@intel.com>
Craig Topper [Fri, 8 Feb 2019 20:48:56 +0000 (20:48 +0000)]
Implementation of asm-goto support in LLVM
This patch accompanies the RFC posted here:
http://lists.llvm.org/pipermail/llvm-dev/2018-October/127239.html
This patch adds a new CallBr IR instruction to support asm-goto
inline assembly like gcc as used by the linux kernel. This
instruction is both a call instruction and a terminator
instruction with multiple successors. Only inline assembly
usage is supported today.
This also adds a new INLINEASM_BR opcode to SelectionDAG and
MachineIR to represent an INLINEASM block that is also
considered a terminator instruction.
There will likely be more bug fixes and optimizations to follow
this, but we felt it had reached a point where we would like to
switch to an incremental development model.
Patch by Craig Topper, Alexander Ivchenko, Mikhail Dvoretckii
Vedant Kumar [Fri, 8 Feb 2019 20:48:04 +0000 (20:48 +0000)]
[CodeExtractor] Restore outputs after creating exit stubs
When CodeExtractor saves the result of InvokeInst at the first insertion
point of the 'normal destination' basic block, this block can be omitted
in the outlined region, so store is placed outside of the function. The
suggested solution is to process saving outputs after creating exit
stubs for new function, and stores will be placed in that blocks before
return in this case.
Matt Arsenault [Fri, 8 Feb 2019 19:59:32 +0000 (19:59 +0000)]
AMDGPU: Eliminate GPU specific SubtargetFeatures
Inline compatability is determined from the individual feature
bits. These are just sets of the separate features, but will always be
treated as incompatible unless they are specifically ignored.
Defining the ISA version number here in tablegen would be nice, but it
turns out this wasn't actually used.
[DAGCombine] Optimize pow(X, 0.75) to sqrt(X) * sqrt(sqrt(X))
The sqrt case is faster and we already do this for the case where
the exponent is 0.25. This adds the 0.75 case which is also not
sensitive to signed zeros.
Summary:
The motivating use case is eliminating duplicate profile data registered
for the same inline function in two object files. Before this change,
users would observe multiple symbol definition errors with VC link, but
links with LLD would succeed.
Users (Mozilla) have reported that PGO works well with clang-cl and LLD,
but when using LLD without this static registration, we would get into a
"relocation against a discarded section" situation. I'm not sure what
happens in that situation, but I suspect that duplicate, unused profile
information was retained. If so, this change will reduce the size of
such binaries with LLD.
Now, Windows uses static registration and is in line with all the other
platforms.
Simon Pilgrim [Fri, 8 Feb 2019 18:57:38 +0000 (18:57 +0000)]
[TargetLowering] Use ISD::FSHR in expandFixedPointMul
Replace OR(SHL,SRL) pattern with ISD::FSHR (legalization expands this later if necessary) - this helps with the scale == 0 'undefined' drop-through case that was discussed on D55720.
Teresa Johnson [Fri, 8 Feb 2019 17:08:27 +0000 (17:08 +0000)]
ArgumentPromotion should copy all metadata to new Function
Summary:
ArgumentPromotion had code to specifically move the dbg metadata over to
the new function, but other metadata such as the function_entry_count
!prof metadata was not. Replace code that moved dbg metadata with a call
to copyMetadata. The old metadata is automatically removed when the old
Function is removed.
Craig Topper [Fri, 8 Feb 2019 17:07:54 +0000 (17:07 +0000)]
[X86] Remove isReMaterializable from X87 floating point constant loads and constant pool loads.
Summary: These instructions update FPSW so they aren't generically safe to rematerialize into any location if FPSW is live for a comparison result. They also use FPCW for exception masking control. Though the only exception they can generate is stack overflow and we manage the stack ourselves so that's not really going to occur.
Carl Ritson [Fri, 8 Feb 2019 15:41:11 +0000 (15:41 +0000)]
[AMDGPU] Fix CS scratch setup on pre-GCN3 ASICs
Summary:
Prior to GCN3 s_load_dword offsets are in dwords rather than bytes.
Thus the scratch buffer descriptor offset must be adjusted for pre-GCN3 ASICs.
Petar Avramovic [Fri, 8 Feb 2019 14:27:23 +0000 (14:27 +0000)]
[MIPS GlobalISel] Select any extending load and truncating store
Make behavior of G_LOAD in widenScalar same as for G_ZEXTLOAD and
G_SEXTLOAD. That is perform widenScalarDst to size given by the target
and avoid additional checks in common code. Targets can reorder or add
additional rules in LegalizeRuleSet for the opcode to achieve desired
behavior.
Select extending load that does not have specified type of extension
into zero extending load.
Select truncating store that stores number of bytes indicated by size
in MachineMemoperand.