[SelectionDAG] Use KnownBits::computeForAddSub/computeForAddCarry
Summary:
Use KnownBits::computeForAddSub/computeForAddCarry
in SelectionDAG::computeKnownBits when doing value
tracking for addition/subtraction.
This should improve the precision of the known bits,
as we only used to make a simple estimate of known
zeroes. The KnownBits support functions are also
able to deduce bits that are known to be one in the
result.
[X86] Regenerate checks for domain-reassignment.mir
Apparently there are some stray IMPLICIT_DEF operations that weren't in the
checks. Not sure if they've always been there or something changed at some
point.
[GlobalISel] Enable CSE in the IRTranslator & legalizer for -O0 with constants only.
Other opcodes shouldn't be CSE'd until we can be sure debug info quality won't
be degraded.
This change also improves the IRTranslator so that in most places, but not all,
it creates constants using the MIRBuilder directly instead of first creating a
new destination vreg and then creating a constant. By doing this, the
buildConstant() method can just return the vreg of an existing G_CONSTANT
instead of having to create a COPY from it.
I measured a 0.2% improvement in compile time and a 0.9% improvement in code
size at -O0 ARM64.
[GlobalISel] Introduce a CSEConfigBase class to allow targets to define their own CSE configs.
Because CodeGen can't depend on GlobalISel, we need a way to encapsulate the CSE
configs that can be passed between TargetPassConfig and the targets' custom
pass configs. This CSEConfigBase allows targets to create custom CSE configs
which is then used by the GISel passes for the CSEMIRBuilder.
This support will be used in a follow up commit to allow constant-only CSE for
-O0 compiles in D60580.
[X86] Redefine KUNPCK instructions to take a narrower source register class than destination register class. Remove copies from the isel output pattern.
There's no reason for the inputs to be the destination register class. This just
forces an unnecessary copy in the output patterns.
[X86] Change IMUL with immediate instruction order to ri8 instructions come before ri/ri32 instructions.
This will ensure IMUL64ri8 is tried before IMUL64ri32. For IMUL32 and IMUL16 the
order doesn't really matter because only the ri8 versions use a predicate. That
automatically gives them priority.
[X86] Move VPTESTM matching from the isel table to custom code in X86ISelDAGToDAG.
We had many tablegen patterns for these instructions. And due to the
commutability of the patterns, tablegen expands them to even more patterns. All
together VPTESTMD patterns accounted for more the 50K of the 610K isel table.
This had gotten bad when we stopped canonicalizing AND to vXi64. This required
a pattern for every combination of bitcast input type.
This change moves the matching to custom code where it is easier to look through
the bitcasts without being concerned with the specific types.
The test changes are because we are now stricter with one use checks as its
required to make load folding legal. We now require the AND and any BITCAST to
only have a single use. This prevents forming VPTESTM and a VPAND with the same
inputs.
We now support broadcast loads for 128/256 patterns without VLX. We'll widen to
512-bit like and still fold the broadcast since the amount of memory read
doesn't change.
There are a few tests that got slightly longer because are now prefering
load + VPTESTM over XOR+VPCMPEQ for (seteq (load), allzeros). Previously we were
able to share the XOR with multiple VPTESTM instructions.
[X86] Don't form masked vpcmp/vcmp/vptestm operations if the setcc node has more than one use.
We're better of emitting a single compare + kand rather than a compare for the
other use and a masked compare.
I'm looking into using custom instruction selection for VPTESTM to reduce the
ridiculous number of permutations of patterns in the isel table. Putting a one
use check on all masked compare folding makes load fold matching in the custom
code easier.
Fangrui Song [Sun, 14 Apr 2019 07:20:03 +0000 (07:20 +0000)]
[Mem2Reg] Simplify and micro optimize
* Rearrange continu/break
* BBNumbers.lookup(A) -> BBNumbers.find(A)->second
BBNumbers has been computed, thus we can assume the value exists in the predicate.
Fangrui Song [Sun, 14 Apr 2019 04:45:04 +0000 (04:45 +0000)]
[ConstantRange] Delete unused getSetSize
getSetSize returns an APInt that is 1 bit wider. The APInt is typically 65-bit and requires memory allocation. isSizeStrictlySmallerThan and isSizeLargerThan are preferred. The last use of this helper method was removed by rL302385.
Philip Reames [Sat, 13 Apr 2019 22:12:56 +0000 (22:12 +0000)]
[Tests] Add tests for D60659, and make adjustments to others to make diff clear
Three related changes:
1) auto-gen several test files
2) Add the new tests at the bottom of said files
3) Adjust a couple of other test files not to use stores to constants when trying to test constexpr address handling
[ConstantRange] Disallow NUW | NSW in makeGuaranteedNoWrapRegion()
As motivated in D60598, this drops support for specifying both NUW and
NSW in makeGuaranteedNoWrapRegion(). None of the users of this function
currently make use of this.
When both NUW and NSW are specified, the exact nowrap region has two
disjoint parts and makeGNWR() returns one of them. This result doesn't
seem to be useful for anything, but makes the semantics of the function
fuzzier.
As pointed out in D60518 folding mulo(%x, undef) to {undef, undef}
isn't correct. As a correct version of this already exists in
InstructionSimplify (https://github.com/llvm-mirror/llvm/blob/bd8056ef326e075cc500f3f0cfcd1193bc200594/lib/Analysis/InstructionSimplify.cpp#L4750-L4757) this is just
dead code though. Drop it together with the mul(%x, 0) -> {0, false}
fold that is also already handled by InstSimplify.
Don Hinton [Sat, 13 Apr 2019 16:55:28 +0000 (16:55 +0000)]
[CommandLineParser] Add DefaultOption flag
Summary: Add DefaultOption flag to CommandLineParser which provides a
default option or alias, but allows users to override it for some
other purpose as needed.
Also, add `-h` as a default alias to `-help`, which can be seamlessly
overridden by applications like llvm-objdump and llvm-readobj which
use `-h` as an alias for other options.
Philip Reames [Sat, 13 Apr 2019 03:55:13 +0000 (03:55 +0000)]
[StackMaps] Update llvm-readobj to parse V3 Stackmaps
This updates the StackMap parser in the llvm-readobj tool to parse version 3 StackMaps, which were bumped in https://reviews.llvm.org/D32629.
Version 3 StackMaps differ in that they have a uint16 sized "location size" field which was added to the Location block in a StackMap record. The record has additional padding for alignment. This was a backwards incompatible change resulting in a StackMap version bump.
Patch By: jacob.hughes@kcl.ac.uk (with a rewrite of tests by me)
Differential Revision: https://reviews.llvm.org/D59020
Philip Reames [Sat, 13 Apr 2019 03:08:45 +0000 (03:08 +0000)]
[StackMaps] Add location size to llvm-readobj -stackmap output
The size field of a location can be different for each entry, so it is useful to have this displayed in the output of llvm-readobj -stackmap. Below is an example of how the output would look:
[AArch64][GlobalISel] Enable copy elision in the pre-legalizer combine and fix a crash.
This enables the simple copy combine that already exists in the CombinerHelper.
However, it exposed a bug in the GISelChangeObserver where it wouldn't clear a
set of MIs to process, and so would end up causing a crash when deleted MIs were
being added to the combiner worklist again.
Thomas Lively [Fri, 12 Apr 2019 22:27:48 +0000 (22:27 +0000)]
[WebAssembly] Add DataCount section to object files
Summary:
This ensures that object files will continue to validate as
WebAssembly modules in the presence of bulk memory operations. Engines
that don't support bulk memory operations will not recognize the
DataCount section and will report validation errors, but that's ok
because object files aren't supposed to be run directly anyway.
[GlobalISel] Fix a crash when handling an invalid MVT during call lowering.
This crash was introduced in r358032 as we try to construct an EVT from an MVT
in order to find the register type for the calling conv. Fall back instead of
trying to do this with an invalid MVT coming from i256.
[MemorySSA] Add previous def to cache when found, even if trivial.
Summary:
When inserting a new Def, MemorySSA may be have non-minimal number of Phis.
While inserting, the walk to find the previous definition may cleanup minimal Phis.
When the last definition is trivial to obtain, we do not cache it.
It is possible while getting the previous definition for a Def to get two different answers:
- one that was straight-forward to find when walking the first path (a trivial phi in this case), and
- another that follows a cleanup of the trivial phi, it determines it may need additional Phi nodes, it inserts them and returns a new phi in the same position as the former trivial one.
While the Phis added for the second path are all redundant, they are not complete (the walk is only done upwards), and they are not properly cleaned up afterwards.
A way to fix this problem is to cache the straight-forward answer we got on the first walk.
The caching is only kept for the duration of a getPreviousDef call, and for Phis we use TrackingVH, so removing the trivial phi will lead to replacing it with the next dominating phi in the cache.
Resolves PR40749.
[AArch64][GlobalISel] Fix a crash when selecting shufflevectors with an undef mask element.
If a shufflevector's mask vector has an element with "undef" then the generic
instruction defining that element register is a G_IMPLICT_DEF instead of G_CONSTANT.
This fixes the selector to handle this case, and for now assumes that undef just means
zero. In future we'll optimize this case properly.
makeGuaranteedNoWrapRegion() is actually makeExactNoWrapRegion() as
long as only one of NUW or NSW is specified. This is not obvious from
the current documentation, and some code seems to think that it is
only exact for single-element ranges. Clarify docs and add tests to
be more confident this really holds.
There are currently no users of makeGuaranteedNoWrapRegion() that
pass both NUW and NSW. I think it would be best to drop support for
this entirely and then rename the function to makeExactNoWrapRegion().
Knowing that the no-wrap region is exact is useful, because we can
backwards-constrain values. What I have in mind in particular is
that LVI should be able to constrain values on edges where the
with.overflow overflow flag is false.
Summary:
Create a method to forget everything in SCEV.
Add a cl::opt and PassManagerBuilder option to use this in LoopUnroll.
Motivation: Certain Halide applications spend a very long time compiling in forgetLoop, and prefer to forget everything and rebuild SCEV from scratch.
Sample difference in compile time reduction: 21.04 to 14.78 using current ToT release build.
Testcase showcasing this cannot be opensourced and is fairly large.
The option disabled by default, but it may be desirable to enable by
default. Evidence in favor (two difference runs on different days/ToT state):
File Before (s) After (s)
clang-9.bc 7267.91 6639.14
llvm-as.bc 194.12 194.12
llvm-dis.bc 62.50 62.50
opt.bc 1855.85 1857.53
File Before (s) After (s)
clang-9.bc 8588.70 7812.83
llvm-as.bc 196.20 194.78
llvm-dis.bc 61.55 61.97
opt.bc 1739.78 1886.26
Summary:
After introducing the limit for clobber walking, `walkToPhiOrClobber` would assert that the limit is at least 1 on entry.
The test included triggered that assert.
The callsite in `tryOptimizePhi` making the calls to `walkToPhiOrClobber` is structured like this:
```
while (true) {
if (getBlockingAccess()) { // calls walkToPhiOrClobber
}
for (...) {
walkToPhiOrClobber();
}
}
```
The cleanest fix is to check if the limit was reached inside `walkToPhiOrClobber`, and give an allowence of 1.
This approach not make any alias() calls (no calls to instructionClobbersQuery), so the performance condition is enforced.
The limit is set back to 0 if not used, as this provides info on the fact that we stopped before reaching a true clobber.
Philip Reames [Fri, 12 Apr 2019 18:26:56 +0000 (18:26 +0000)]
[InstCombine] Fix a nasty miscompile introduced w/masked.gather demanded elts
This fixes a miscompile which was introduced in r356510 (https://reviews.llvm.org/D57372).
The problem is that the original patch removed pointer operands where the load results we're demanded, but without considering the legality of the load itself. If the masked.gather had active, but undemanded, lanes, then we could end up creating a load which loaded from an undef address. The result could be a segfault, or, in theory, an arbitrary read from a random memory location into an used register.
[CVP] Set NSW/NUW flags when simplifying with.overflow
When CVP determines that a with.overflow intrinsic cannot overflow,
it currently inserts a simple add/sub. As we already determined that
there can be no overflow, we should add the appropriate NUW/NSW flag.
This is for D60460. computeForAddSub() essentially already supports
carries because it has to deal with subtractions. This revision
extracts a lower-level computeForAddCarry() function, which allows
computing the known bits for add (carry known zero), sub (carry known
one) and addcarry (carry unknown).
As we don't seem to have any yet, I've added a unit test file for
KnownBits and exhaustive tests for the new computeForAddCarry()
functionality, as well the existing computeForAddSub() function.
Lang Hames [Fri, 12 Apr 2019 18:07:28 +0000 (18:07 +0000)]
Simplify decoupling between RuntimeDyld/RuntimeDyldChecker, add 'got_addr' util.
This patch reduces the number of functions in the interface between RuntimeDyld
and RuntimeDyldChecker by combining "GetXAddress" and "GetXContent" functions
into "GetXInfo" functions that return a struct describing both the address and
content. The GetStubOffset function is also replaced with a pair of utilities,
GetStubInfo and GetGOTInfo, that fit the new scheme. For RuntimeDyld both of
these functions will return the same result, but for the new JITLink linker
(https://reviews.llvm.org/D58704) these will provide the addresses of PLT stubs
and GOT entries respectively.
For JITLink's use, a 'got_addr' utility has been added to the rtdyld-check
language, and the syntax of 'got_addr' and 'stub_addr' has been changed: both
functions now take two arguments, a 'stub container name' and a target symbol
name. For llvm-rtdyld/RuntimeDyld the stub container name is the object file
name and section name, separated by a slash. E.g.:
rtdyld-check: *{8}(stub_addr(foo.o/__text, y)) = y
For the upcoming llvm-jitlink utility, which creates stubs on a per-file basis
rather than a per-section basis, the container name is just the file name. E.g.:
[Hexagon] Fix reuse bug in Vector Loop Carried Reuse pass
The Hexagon Vector Loop Carried Reuse pass was allowing reuse between
two shufflevectors with different masks. The reason is that the masks
are not instruction objects, so the code that checks each operand
just skipped over the operands.
This patch fixes the bug by checking if the operands are the same
when they are not instruction objects. If the objects are not the
same, then the code assumes that reuse cannot occur.
[DAGCombiner] narrow shuffle of concatenated vectors
// shuffle (concat X, undef), (concat Y, undef), Mask -->
// concat (shuffle X, Y, Mask0), (shuffle X, Y, Mask1)
The ARM changes with 'vtrn' and narrowed 'vuzp' are improvements.
The x86 changes look neutral or better. There's one test with an
extra instruction, but that could be reversed for a subtarget with
the right attributes. But by default, we want to avoid the 256-bit
op when possible (in my motivating benchmark, a handful of ymm ops
sprinkled into a sequence of xmm ops are triggering frequency
throttling on Haswell resulting in significantly worse perf).
Currently combineHorizontalPredicateResult only handles anyof/allof reduction patterns of legal types, which can be tricky to match as type legalization of bools can introduce bitcasts/truncs/extensions.
This patch extends combineHorizontalPredicateResult to recognise vXi1 bool reductions as well and uses the existing combineBitcastvxi1 helper to create the MOVMSK necessary to then compare the signmask result.
This ensures the accuracy of the reduction costs added in D60403 which assume the MOVMSK generation.
Hans Wennborg [Fri, 12 Apr 2019 12:54:52 +0000 (12:54 +0000)]
Revert r358268 "[DebugInfo] DW_OP_deref_size in PrologEpilogInserter."
It causes clang to crash while building Chromium. See https://crbug.com/952230
for reproducer.
> The PrologEpilogInserter need to insert a DW_OP_deref_size before
> prepending a memory location expression to an already implicit
> expression to avoid having the existing expression act on the memory
> address instead of the value behind it.
>
> The reason for using DW_OP_deref_size and not plain DW_OP_deref is that
> big-endian targets need to read the right size as simply truncating a
> larger read would yield the wrong result (LSB bytes are not at the lower
> address).
>
> Differential Revision: https://reviews.llvm.org/D59687
Kang Zhang [Fri, 12 Apr 2019 09:59:40 +0000 (09:59 +0000)]
[PowerPC] Add initialization for some ppc passes
Summary:
Some llc debug options need pass-name as the parameters.
But if we use the pass-name ppc-early-ret, we will get below error:
llc test.ll -stop-after ppc-early-ret
LLVM ERROR: "ppc-early-ret" pass is not registered.
Below pass-names have the pass is not registered error:
ppc-ctr-loops
ppc-ctr-loops-verify
ppc-loop-preinc-prep
ppc-toc-reg-deps
ppc-vsx-copy
ppc-early-ret
ppc-vsx-fma-mutate
ppc-vsx-swaps
ppc-reduce-cr-ops
ppc-qpx-load-splat
ppc-branch-coalescing
ppc-branch-select
Jeremy Morse [Fri, 12 Apr 2019 09:47:35 +0000 (09:47 +0000)]
[DebugInfo] Fix pr41175 Dead Store Elimination missing debug loc
Bug: https://bugs.llvm.org/show_bug.cgi?id=41175
In the bug test case the DSE pass is shortening the range of memory that a
memset is working on. A getelementptr is generated so that the new
starting address can be passed to memset. This instruction was not given
a DebugLoc.
To fix the bug, copy the DebugLoc from the memset instruction.
Markus Lavin [Fri, 12 Apr 2019 08:23:55 +0000 (08:23 +0000)]
[DebugInfo] DW_OP_deref_size in PrologEpilogInserter.
The PrologEpilogInserter need to insert a DW_OP_deref_size before
prepending a memory location expression to an already implicit
expression to avoid having the existing expression act on the memory
address instead of the value behind it.
The reason for using DW_OP_deref_size and not plain DW_OP_deref is that
big-endian targets need to read the right size as simply truncating a
larger read would yield the wrong result (LSB bytes are not at the lower
address).
Hans Wennborg [Fri, 12 Apr 2019 08:23:28 +0000 (08:23 +0000)]
Fix missing arguments in tutorial
In tutorial "8. Kaleidoscope: Compiling to Object Code" a call to
TargetMachine->addPassesToEmitFile(pass, dest, FileType) is missing
nullptr as its 3rd value.
Eric Christopher [Fri, 12 Apr 2019 07:42:35 +0000 (07:42 +0000)]
Move getNumFrameInfos and getDwarfFrameInfos out of line and remove
the MCDwarf.h include.
This removes 50 transitive dependencies for a modification of
MCDwarf.h in a build of llc for a pair of out of line functions
and reduces the build overhead of 'touch MCDwarf.h" by 15% without
impacting test time of check-llvm.
Eric Christopher [Fri, 12 Apr 2019 06:57:45 +0000 (06:57 +0000)]
Move addInitialFrameState out of line and remove the MCDwarf.h include.
This removes 50 transitive dependencies for a modification of
MCDwarf.h in a build of llc for a single out of line function
and reduces the build overhead by 20% without impacting test
time of check-llvm.
[TargetLowering][X86] Teach SimplifyDemandedBits to use ShrinkDemandedOp on ISD::SHL nodes.
If the upper bits of the SHL result aren't used, we might be able to use a narrower shift. For example, on X86 this can turn a 64-bit into 32-bit enabling a smaller encoding.
Kang Zhang [Fri, 12 Apr 2019 06:35:15 +0000 (06:35 +0000)]
[PowerPC] Add initialization for some ppc passes
Summary:
Some llc debug options need pass-name as the parameters.
But if we use the pass-name ppc-early-ret, we will get below error:
llc test.ll -stop-after ppc-early-ret
LLVM ERROR: "ppc-early-ret" pass is not registered.
Below pass-names have the pass is not registered error:
ppc-ctr-loops
ppc-ctr-loops-verify
ppc-loop-preinc-prep
ppc-toc-reg-deps
ppc-vsx-copy
ppc-early-ret
ppc-vsx-fma-mutate
ppc-vsx-swaps
ppc-reduce-cr-ops
ppc-qpx-load-splat
ppc-branch-coalescing
ppc-branch-select
Eric Christopher [Fri, 12 Apr 2019 06:31:59 +0000 (06:31 +0000)]
Move addFrameInst out of line and remove the MCDwarf.h include.
This removes 500 transitive dependencies for a modification of
MCDwarf.h in a build of llc for a single out of line function
and reduces the build overhead by more than half without impacting
test time of check-llvm.
Zi Xuan Wu [Fri, 12 Apr 2019 05:21:31 +0000 (05:21 +0000)]
[PowerPC] More precise exploitation of P9 maddld instruction when operands are constant
There are 3 operands of maddld, (add (mul %1, %2), %3) and sometimes
they are constant. If there is constant operand, it takes extra li to
materialize the operand, and one more extra register too. So it's not
profitable to use maddld to optimize mul-add pattern.
Fangrui Song [Fri, 12 Apr 2019 02:16:15 +0000 (02:16 +0000)]
[MC] Fix typo: .symtab_shndxr -> .symtab_shndx
This special section is named .symtab_shndx, according to gABI Chapter 4
Sections, and the name is used by some other tools. Though the section
type SHT_SYMTAB_SHNDX is what really matters, let's fix the typo
introduced in rL204769 :)
Nick Desaulniers [Thu, 11 Apr 2019 22:47:13 +0000 (22:47 +0000)]
[X86AsmPrinter] refactor static functions into private methods. NFC
Summary:
A lot of the code for printing special cases of operands in this
translation unit are static functions. While I too have suffered many
years of abuse at the hands of C, we should prefer private methods,
particularly when you start passing around *this as your first argument,
which is a code smell.
This will help make generic vs arch specific asm printing easier, as it
brings X86AsmPrinter more in line with other arch's derived AsmPrinters.
We will then be able to more easily move architecture generic code to
the base class, and architecture specific code to the derived classes.
Some other small refactorings while we're here:
- the parameter Op is now consistently OpNo
- add spaces around binary expressions. I know we're not millionaires
but c'mon.
The isLoopCarriedDep function does not correctly compute loop
carried dependences when the array index offset is negative
or the stride is smallar than the access size.
Same as the other ConstantRange overflow checking methods, but for
unsigned mul. In this case there is no cheap overflow criterion, so
using umul_ov for the implementation.
Summary:
There is a bug in add_tablegen which causes cmake to fail with the following
error message if LLVM_TABLEGEN is set.
CMake Error at cmake/modules/TableGen.cmake:147 (add_dependencies):
The dependency target "LLVM-tablegen-host" of target "CLANG-tablegen-host"
does not exist.
Call Stack (most recent call first):
tools/clang/utils/TableGen/CMakeLists.txt:3 (add_tablegen)
The issue happens because setting LLVM_TABLEGEN causes cmake to skip generating
the LLVM-tablegen-host target. As a result, a non-existent target was added for
CLANG-tablegen-host causing cmake to fail.
In order to fix this issue, this patch adds a guard to check the validity of the
dependency target before adding it as a dependency.
Rong Xu [Thu, 11 Apr 2019 20:54:17 +0000 (20:54 +0000)]
[PGO] Better handling of profile hash mismatch
We currently assume profile hash conflicts will be caught by an upfront
check and we assert for the cases that escape the check. The assumption
is not always true as there are chances of conflict. This patch prints
a warning and skips annotating the function for the escaped cases,.