PACKUSWB converts Signed word to Unsigned byte, (the same about DW) and it can't be used for umin+truncate pattern.
AVX-512 VPMOVUS* instructions fit the pattern since they convert Unsigned to Unsigned.
Chandler Carruth [Sun, 29 Jan 2017 08:03:19 +0000 (08:03 +0000)]
[ArgPromote] Run clang-format to normalize remarkably idiosyncratic
formatting that has evolved here over the past years prior to making
somewhat invasive changes to thread new PM support through the business
logic.
Chandler Carruth [Sun, 29 Jan 2017 08:03:16 +0000 (08:03 +0000)]
[ArgPromote] Re-arrange the code in a more typical, logical way.
This arranges the static helpers in an order where they are defined
prior to their use to avoid the need of forward declarations, and
collect the core pass components at the bottom below their helpers.
This also folds one trivial function into the pass itself. Factoring
this 'runImpl' was an attempt to help porting to the new pass manager,
however in my attempt to begin this port in earnest it turned out to not
be a substantial help. I think it will be easier to factor things
without it.
This is an NFC change and does a minimal amount of edits over all.
Subsequent NFC cleanups will normalize the formatting with clang-format
and improve the basic doxygen commenting.
Craig Topper [Sun, 29 Jan 2017 04:38:19 +0000 (04:38 +0000)]
[DAGCombiner] Remove unnecessary check on the size of the type of the index of EXTRACT_SUBVECTOR.
The type system already requires that the number of vector elements must fit in 32-bits so an index should as well. Even if the type of the index were larger all we care about is that the constant index can fit in 64-bits so that we can call getZExtValue.
David Majnemer [Sun, 29 Jan 2017 01:27:08 +0000 (01:27 +0000)]
[Target] Add NoSignedZerosFPMath to the TargetOptions constructor
Most flags were already initialized by the TargetOptions constructor but
we missed out on one. Also, simplify the constructor by using field
initializers when possible.
Lang Hames [Sun, 29 Jan 2017 00:51:17 +0000 (00:51 +0000)]
[Orc][RPC] Remove a couple of redundant calls to abandonAllPendingResponses.
appendCallAsync, which all RPC call functions ultimately build on, will call
abandonAllPendingResponses on channel error. These extra calls are redundant.
Mohammad Shahid [Sat, 28 Jan 2017 17:59:44 +0000 (17:59 +0000)]
[SLP] Vectorize loads of consecutive memory accesses, accessed in non-consecutive (jumbled) way.
The jumbled scalar loads will be sorted while building the tree and these accesses will be marked to generate shufflevector after the vectorized load with proper mask.
Support for barrier synchronization between a subset of threads
in a CTA through one of sixteen explicitly specified barriers.
These intrinsics are not directly exposed in CUDA but are
critical for forthcoming support of OpenMP on NVPTX GPUs.
The intrinsics allow the synchronization of an arbitrary
(multiple of 32) number of threads in a CTA at one of 16
distinct barriers. The two intrinsics added are as follows:
call void @llvm.nvvm.barrier.n(i32 10)
waits for all threads in a CTA to arrive at named barrier #10.
call void @llvm.nvvm.barrier(i32 15, i32 992)
waits for 992 threads in a CTA to arrive at barrier #15.
Detailed description of these intrinsics are available in the PTX manual.
http://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions
Vadim Chugunov [Sat, 28 Jan 2017 07:39:52 +0000 (07:39 +0000)]
This addresses LLDB bug 31699, which was caused by LLVM using static linking on Windows.
In order to make sure that LLVM continues to work on machines that do not have the Universal CRT yet,
we'll need to ship a copy of UCRT in the Windows installation package. Fortunately, CMake 3.6+ already
supports app-local deployment of UCRT dlls, we just need to turn this on.
Taewook Oh [Sat, 28 Jan 2017 07:05:43 +0000 (07:05 +0000)]
[InstCombine] Merge DebugLoc when speculatively hoisting store instruction
Summary: Along with https://reviews.llvm.org/D27804, debug locations need to be merged when hoisting store instructions as well. Not sure if just dropping debug locations would make more sense for this case, but as the branch instruction will have at least different discriminator with the hoisted store instruction, I think there will be no difference in practice.
Quentin Colombet [Sat, 28 Jan 2017 02:23:48 +0000 (02:23 +0000)]
[RegisterBankInfo] Emit proper type for remapped registers.
When the OperandsMapper creates virtual registers, it used to just create
plain scalar register with the right size. This may confuse the
instruction selector because we lose the information of the instruction
using those registers what supposed to do. The MachineVerifier complains
about that already.
With this patch, the OperandsMapper still creates plain scalar register,
but the expectation is for the mapping function to remap the type
properly. The default mapping function has been updated to do that.
Matthias Braun [Sat, 28 Jan 2017 02:02:38 +0000 (02:02 +0000)]
Cleanup dump() functions.
We had various variants of defining dump() functions in LLVM. Normalize
them (this should just consistently implement the things discussed in
http://lists.llvm.org/pipermail/cfe-dev/2014-January/034323.html
For reference:
- Public headers should just declare the dump() method but not use
LLVM_DUMP_METHOD or #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
- The definition of a dump method should look like this:
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
LLVM_DUMP_METHOD void MyClass::dump() {
// print stuff to dbgs()...
}
#endif
Daniel Berlin [Sat, 28 Jan 2017 01:23:13 +0000 (01:23 +0000)]
Introduce a basic MemorySSA updater, that supports insertDef,
insertUse, moveBefore and moveAfter operations.
Summary:
This creates a basic MemorySSA updater that handles arbitrary
insertion of uses and defs into MemorySSA, as well as arbitrary
movement around the CFG. It replaces the current splice API.
It can be made to handle arbitrary control flow changes.
Currently, it uses the same updater algorithm from D28934.
The main difference is because MemorySSA is single variable, we have
the complete def and use list, and don't need anyone to give it to us
as part of the API. We also have to rename stores below us in some
cases.
If we go that direction in that patch, i will merge all the updater
implementations (using an updater_traits or something to provide the
get* functions we use, called read*/write* in that patch).
Sadly, the current SSAUpdater algorithm is way too slow to use for
what we are doing here.
I have updated the tests we have to basically build memoryssa
incrementally using the updater api, and make sure it still comes out
the same.
Quentin Colombet [Sat, 28 Jan 2017 01:05:27 +0000 (01:05 +0000)]
[RegisterCoalescing] Recommit the patch "Remove partial redundent copy".
In r292621, the recommit fixes a bug related with live interval update
after the partial redundent copy is moved.
This recommit solves an additional bug related to the lack of update of
subranges.
The original patch is to solve the performance problem described in
PR27827. Register coalescing sometimes cannot remove a copy because of
interference. But if we can find a reverse copy in one of the predecessor
block of the copy, the copy is partially redundent and we may remove the
copy partially by moving it to the predecessor block without the
reverse copy.
Evgeniy Stepanov [Sat, 28 Jan 2017 00:46:30 +0000 (00:46 +0000)]
Fix memory leak in globalisel.
  #0 0x89cdeb in operator new[](unsigned long) /code/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:84:37
  #1 0x4ec87c4 in llvm::RegisterBankInfo::ValueMapping const* llvm::RegisterBankInfo::getOperandsMapping<llvm::RegisterBankInfo::ValueMapping const* const*>(llvm::RegisterBankInfo::ValueMapping const* const*, llvm::RegisterBankInfo::ValueMapping const* const*) const /code/llvm/lib/CodeGen/GlobalISel/RegisterBankInfo.cpp:297:9
  #2 0x9327ee in llvm::AArch64RegisterBankInfo::getInstrMapping(llvm::MachineInstr const&) const /code/llvm/lib/Target/AArch64/AArch64RegisterBankInfo.cpp:540:30
  #3 0x4eb8d07 in llvm::RegBankSelect::assignInstr(llvm::MachineInstr&) /code/llvm/lib/CodeGen/GlobalISel/RegBankSelect.cpp:546:24
  #4 0x4eb9dd2 in llvm::RegBankSelect::runOnMachineFunction(llvm::MachineFunction&) /code/llvm/lib/CodeGen/GlobalISel/RegBankSelect.cpp:624:12
  #5 0x3141875 in llvm::MachineFunctionPass::runOnFunction(llvm::Function&) /code/llvm/lib/CodeGen/MachineFunctionPass.cpp:62:13
  #6 0x396128d in llvm::FPPassManager::runOnFunction(llvm::Function&) /code/llvm/lib/IR/LegacyPassManager.cpp:1513:27
  #7 0x3961832 in llvm::FPPassManager::runOnModule(llvm::Module&) /code/llvm/lib/IR/LegacyPassManager.cpp:1534:16
  #8 0x3962540 in runOnModule /code/llvm/lib/IR/LegacyPassManager.cpp:1590:27
  #9 0x3962540 in llvm::legacy::PassManagerImpl::run(llvm::Module&) /code/llvm/lib/IR/LegacyPassManager.cpp:1693
  #10 0x8ae368 in compileModule(char**, llvm::LLVMContext&) /code/llvm/tools/llc/llc.cpp:562:8
  #11 0x8a7a1b in main /code/llvm/tools/llc/llc.cpp:316:22
Tim Northover [Fri, 27 Jan 2017 23:54:31 +0000 (23:54 +0000)]
GlobalISel: don't leak super-entry BB when merging with IR-level one.
We have to delete the block manually or it leaks. That triggers failures in
-fsanitize=leak bots (unsurprisingly), which should be fixed by this patch.
Sanjay Patel [Fri, 27 Jan 2017 23:26:27 +0000 (23:26 +0000)]
[InstCombine] move icmp transforms that might be recognized as min/max and inf-loop (PR31751)
This is a minimal patch to avoid the infinite loop in:
https://llvm.org/bugs/show_bug.cgi?id=31751
But the general problem is bigger: we're not canonicalizing all of the min/max forms reported
by value tracking's matchSelectPattern(), and we don't define min/max consistently. Some code
uses matchSelectPattern(), other code uses matchers like m_Umax, and others have their own
inline definitions which may be subtly different from any of the above.
The reason that the test cases in this patch need a cast op to trigger is because we don't
(yet) canonicalize all min/max forms based on matchSelectPattern() in
canonicalizeMinMaxWithConstant(), but we do make min/max+cast transforms based on
matchSelectPattern() in visitSelectInst().
The location of the icmp transforms that trigger the inf-loop seems arbitrary at best, so
I'm moving those behind the min/max fence in visitICmpInst() as the quick fix.
Artem Tamazov [Fri, 27 Jan 2017 22:19:42 +0000 (22:19 +0000)]
[AMDGPU][mc] Fix memory corruption uncovered by AddressSanitizer during coverage/smoke Gfx7/8 testing.
Coverage/smoke Gfx7/8 tests were committed r292922 but then reverted
by r292974 due to AddressSanitizer failure, which is fixed by this patch.
Tests to be re-committed soon.
Mehdi Amini [Fri, 27 Jan 2017 19:48:57 +0000 (19:48 +0000)]
Global DCE performance improvement
Change the original algorithm so that it scales better when meeting
very large bitcode where every instruction does not implies a global.
The target query is "how to you get all the globals referenced by
another global"?
Before this patch, it was doing this by walking the body (or the
initializer) and collecting the references. What this patch is doing,
it precomputing the answer to this query for the whole module by
walking the use-list of every global instead.
Matthias Braun [Fri, 27 Jan 2017 18:53:07 +0000 (18:53 +0000)]
ScheduleDAGInstrs: Do not try to toggle kill flags on debug uses
Preparation for upcoming changes. No testcase as none of the public
targets bundles early enough and has a post machine scheduler enabled at
the same time. The error is also easily catched by asserts.
Matthew Simpson [Fri, 27 Jan 2017 17:33:16 +0000 (17:33 +0000)]
[ARM/AArch64] Relocate and update InterleavedAccessPass tests (NFC)
The interleaved access pass is an IR-to-IR transformation that runs before code
generation. It matches interleaved memory operations to target-specific
intrinsics (that are later lowered to load and store multiple instructions on
ARM/AArch64). We place tests for similar passes (e.g., GlobalMergePass) under
test/Transforms. This patch moves the InterleavedAccessPass tests out of
test/CodeGen and into target-specific directories under
test/Transforms/InterleavedAccess.
Although the pass is an IR pass, many of the existing tests were llc tests
rather opt tests. For example, the tests would check for ldN/stN instructions
generated by llc rather than the intrinsic calls the pass actually inserts.
Thus, this patch updates all tests to be opt tests that check for the inserted
intrinsics. We already have separate CodeGen tests that ensure we lower the
interleaved access intrinsics to their corresponding ldN/stN instructions. In
addition to migrating the tests to opt, this patch also performs some minor
clean-up (to ensure consistent naming, etc.).
Mehdi Amini [Fri, 27 Jan 2017 16:12:22 +0000 (16:12 +0000)]
Fix BasicAA incorrect assumption on GEP
This is fixing pr31761: BasicAA is deducing NoAlias
on the result of the GEP if the base pointer is itself NoAlias.
This is possible only if the NoAlias on the base pointer is
deduced with a non-sized query: this should guarantee that
the pointers are belonging to different memory allocation
and that the GEP can't legally jump from one to another.
There, NextMetadataNo gets incremented, and since the order
of arguments evaluation is not specified, that can happen
before or after other arguments are evaluated.
In a few cases the other arguments indirectly use NextMetadataNo.
For instance, it's
Therefore, the order of evaluation becomes important. That caused
a very subtle LLD crash that only happens if compiled with GCC or
if LLD is built with LTO. In the case if LLD is compiled with Clang
and regular linking mode, everything worked as intended.
This change extracts incrementing of NextMetadataNo outside of
the arguments list to guarantee the correct order of evaluation.
For the record, this has taken 3 days to track to the origin. It all
started with a ThinLTO bot in Chrome not being able to link a target
if debug info is enabled.
Simon Dardis [Fri, 27 Jan 2017 11:36:52 +0000 (11:36 +0000)]
[mips] Recommit: "N64 static relocation model support"
This patch makes one change to GOT handling and two changes to N64's
relocation model handling. Furthermore, the jumptable encodings have
been corrected for static N64.
Big GOT handling is now done via a new SDNode MipsGotHi - this node is
unconditionally lowered to an lui instruction.
The first change to N64's relocation handling is the lifting of the
restriction that N64 always uses PIC. Now it is possible to target static
environments.
The second change adds support for 64 bit symbols and enables them by
default. Previously N64 had patterns for sym32 mode only. In this mode all
symbols are assumed to have 32 bit addresses. sym32 mode support
is selectable with attribute 'sym32'. A follow on patch for clang will
add the necessary frontend parameter.
This partially resolves PR/23485.
Thanks to Brooks Davis for reporting the issue!
This version corrects a "Conditional jump or move depends on uninitialised
value(s)" error detected by valgrind present in the original commit.
Alexey Bataev [Fri, 27 Jan 2017 10:54:04 +0000 (10:54 +0000)]
[SLP] Refactoring of horizontal reduction analysis, NFC.
Some checks in SLP horizontal reduction analysis function are performed
several times, though it is enough to perform these checks only once
during an initial attempt at adding candidate for the reduction
instruction/reduced value.
Chandler Carruth [Fri, 27 Jan 2017 10:27:32 +0000 (10:27 +0000)]
[LICM] When we are recomputing the alias sets for a subloop, we cannot
skip sub-subloops.
The logic to skip subloops dated from when this code was shared with the
cached case. Once it was factored out to only run in the case of
recomputed subloops it became a dangerous bug. If a subsubloop contained
an interfering instruction it would be silently skipped from the alias
sets for LICM.
With the old pass manager this was extremely hard to trigger as it would
require failing to visit these subloops with the LICM pass but then
visiting the outer loop somehow. I've not yet contrived any test case
that actually manages to trigger this.
But with the new pass manager we don't do the cross-loop caching hack
that the old PM does and so we recompute alias set information from
first principles. While this seems much cleaner and simpler it exposed
this bug and would subtly miscompile code due to failing to correctly
model the aliasing constraints of deeply nested loops.
Jonas Paulsson [Fri, 27 Jan 2017 07:46:26 +0000 (07:46 +0000)]
[DAGTypeLegalizer] Handle SIGN/ZERO_EXTEND in WidenVecRes_Convert().
In case of a SIGN/ZERO_EXTEND of an incomplete vector type (using only a
partial number of available vector elements), WidenVecRes_Convert() used to
resort to scalarization.
This patch adds a handling of the (common) case where an input vector can be
found of same width as the widened result vector, by converting the node to
SIGN/ZERO_EXTEND_VECTOR_INREG.
Adam Nemet [Fri, 27 Jan 2017 06:39:08 +0000 (06:39 +0000)]
[opt-viewer] Remove message from the key
This is causing problems because the rendering of the text will depend on
varying global state to show relative hotness or a link in the inlining
context.
Adam Nemet [Fri, 27 Jan 2017 06:38:31 +0000 (06:38 +0000)]
[opt-viewer] Put critical items in parallel
Summary:
Put opt-viewer critical items in parallel
Patch by Brian Cain!
Requires features from Python 2.7
**Performance**
Below are performance results across various configurations. These were taken on an i5-5200U (dual core + HT). They were taken with a small subset of the YAML output of building Python 3.6.0b3 with LTO+PGO. 60 YAML files.
"ImportError" vs "class<...CLoader>" below are just confirming the expected configuration (with/without CLoader).
The below was measured on AMD A8-5500B (4 cores) with 224 input YAML files, showing a ~1.75x speed increase over the baseline with libYAML. I suspect it would scale well on high-end servers.
The Windows on ARM target uses custom division for normal division as
the backend needs to insert division-by-zero checks. However, it is
designed to only handle non-vectorized division. ARM has custom
lowering for vectorized division as that can avoid loading registers
with the values and invoke a division routine for each one, preferring
to lower using NEON instructions. Fall back to the custom lowering for
the NEON instructions if we encounter a vectorized division.
Daniel Berlin [Fri, 27 Jan 2017 02:37:11 +0000 (02:37 +0000)]
NewGVN: Add basic dead and redundant store elimination
Summary:
This adds basic dead and redundant store elimination to
NewGVN. Unlike our current DSE, it will happily do cross-block DSE if
it meets our requirements.
We get a bunch of DSE's simple.ll cases, and some stuff it doesn't.
Unlike DSE, however, we only try to eliminate stores of the same value
to the same memory location, not just general stores to the same
memory location.
Tim Shen [Fri, 27 Jan 2017 02:11:07 +0000 (02:11 +0000)]
[APFloat] Reduce some dispatch boilerplates. NFC.
Summary: This is an attempt to reduce the verbose manual dispatching code in APFloat. This doesn't handle multiple dispatch on single discriminator (e.g. APFloat::add(const APFloat&)), nor handles multiple dispatch on multiple discriminators (e.g. APFloat::convert()).
Justin Lebar [Fri, 27 Jan 2017 00:58:58 +0000 (00:58 +0000)]
[NVPTX] Upgrade NVVM intrinsics in InstCombineCalls.
Summary:
There are many NVVM intrinsics that we can't entirely get rid of, but
that nonetheless often correspond to target-generic LLVM intrinsics.
For example, if flush denormals to zero (ftz) is enabled, we can convert
@llvm.nvvm.ceil.ftz.f to @llvm.ceil.f32. On the other hand, if ftz is
disabled, we can't do this, because @llvm.ceil.f32 will be lowered to a
non-ftz PTX instruction. In this case, we can, however, simplify the
non-ftz nvvm ceil intrinsic, @llvm.nvvm.ceil.f, to @llvm.ceil.f32.
These transformations are particularly useful because they let us
constant fold instructions that appear in libdevice, the bitcode library
that ships with CUDA and essentially functions as its libm.
Justin Lebar [Fri, 27 Jan 2017 00:58:34 +0000 (00:58 +0000)]
[ValueTracking] Add comment that CannotBeOrderedLessThanZero does the wrong thing for powi.
Summary:
CannotBeOrderedLessThanZero(powi(x, exp)) returns true if
CannotBeOrderedLessThanZero(x). But powi(-0, exp) is negative if exp is
odd, so we actually want to return SignBitMustBeZero(x).
Except that also isn't right, because we want to return true if x is
NaN, even if x has a negative sign bit.
What we really need in order to fix this is a consistent approach in
this function to handling the sign bit of NaNs. Without this it's very
difficult to say what the correct behavior here is.
This is technically UB as the LangRef is written today if %a is ever less than
-0. But emitting code that's compliant with the current definition of sqrt
would require a branch, which would then prevent us from matching this idiom in
SelectionDAG (which we do today -- ISD::FSQRT has defined behavior on negative
inputs), because SelectionDAG looks at one BB at a time.
Nothing in LLVM takes advantage of this undefined behavior, as far as we can
tell, and the fact that llvm.sqrt has UB dates from its initial addition to the
LangRef.
Chandler Carruth [Fri, 27 Jan 2017 00:50:21 +0000 (00:50 +0000)]
[PM] Flesh out almost all of the late loop passes.
With this the per-module pass pipeline is *extremely* close to the
legacy PM. The missing pieces are:
- PruneEH (or some equivalent)
- ArgumentPromotion
- LoopLoadElimination
- LoopUnswitch
I'm going to work through those in essentially that order but this seems
like a worthwhile incremental step toward the end state.
One difference in what I have here from the legacy PM is that I've
consolidated some of the per-function passes at the very end of the
pipeline into the main optimization function pipeline. The intervening
passes are *really* uninteresting and so this seems very likely to have
any effect other than minor improvement to locality.
Note that there are still some failures in the test suite, but the
compiler doesn't crash or assert.