Rafael Espindola [Mon, 30 Jan 2017 15:49:20 +0000 (15:49 +0000)]
Bring back r293480. It is safe now.
Original message:
Fix the values of two xcore ELF flags.
The values in llvm grew from a pre-MC day when they would not show up
in .o files and are outside of the SHF_MASKPROC.
Fortunately the MC output is not currently used as xcore has its own
assemble and that assembler uses valid values. This updates llvm to
use the same values as the xmos assembler.
Rafael Espindola [Mon, 30 Jan 2017 15:38:43 +0000 (15:38 +0000)]
Only print architecture dependent flags for that architecture.
Different architectures can have different meaning for flags in the
SHF_MASKPROC mask, so we should always check what the architecture use
before checking the flag.
NFC for now, but will allow fixing the value of an xmos flag.
Tom Stellard [Mon, 30 Jan 2017 15:07:01 +0000 (15:07 +0000)]
TableGen: Fix infinite recursion in RegisterBankEmitter
Summary:
AMDGPU has two register classes with the same set of registers, and this
was causing this tablegen backend would get stuck in infinite recursion.
Rafael Espindola [Mon, 30 Jan 2017 14:07:43 +0000 (14:07 +0000)]
Fix the values of two xcore ELF flags.
The values in llvm grew from a pre-MC day when they would not show up
in .o files and are outside of the SHF_MASKPROC.
Fortunately the MC output is not currently used as xcore has its own
assemble and that assembler uses valid values. This updates llvm to
use the same values as the xmos assembler.
By calling getScalarizationOverhead with the CallInst instead of the types of
its arguments, we make sure that only unique call arguments are added to the
scalarization cost.
getScalarizationOverhead() is extended to handle calls by only passing on the
actual call arguments (which is not all the operands).
This also eliminates a wrapper function with the same name.
Craig Topper [Sun, 29 Jan 2017 22:53:33 +0000 (22:53 +0000)]
[AVX-512] Fix lowering for mask register concatenation with undef in the lower half.
Previously this test case fired an assertion in getNode because we tried to create an insert_subvector with both input types the same size and the index pointing to half the vector width.
Matthias Braun [Sun, 29 Jan 2017 18:20:42 +0000 (18:20 +0000)]
MachineInstr: Remove parameter from dump()
The primary use of the dump() functions in LLVM is for use in a
debugger. Unfortunately lldb does not seem to handle default arguments
so using `p SomeMI.dump()` fails and you have to type the longer `p
SomeMI.dump(nullptr)`. Remove the paramter to make the most common use
easy. (You can always construct something like `p
SomeMI.print(dbgs(),MyTII)` if you need more features).
Simon Pilgrim [Sun, 29 Jan 2017 18:13:37 +0000 (18:13 +0000)]
[X86][SSE] Lower scalar_to_vector(0) to zero vector
Replaces an xor+movd/movq with an xorps which will be shorter in codesize, avoid an int-fpu transfer, allow modern cores to fast path the result during decode and helps other combines recognise an all-zero vector.
The only reason I can think of that we'd want to keep scalar_to_vector in this case is to help recognise the upper elts are undef but this doesn't seem to be a problem.
Matthias Braun [Sun, 29 Jan 2017 17:52:03 +0000 (17:52 +0000)]
llvm-c: Keep LLVMDumpModule() even in release builds
While this probably should be considered a dump debugger utility, the C
API currently has no other ways to print a module to stderr for error
reporting purposes, so keep it even in release builds.
Support lowering AEABI TLS access (__aeabi_read_tp) with long calls.
This requires adjusting the call sequence to use an indirect call to get
full addressability.
PACKUSWB converts Signed word to Unsigned byte, (the same about DW) and it can't be used for umin+truncate pattern.
AVX-512 VPMOVUS* instructions fit the pattern since they convert Unsigned to Unsigned.
Chandler Carruth [Sun, 29 Jan 2017 08:03:19 +0000 (08:03 +0000)]
[ArgPromote] Run clang-format to normalize remarkably idiosyncratic
formatting that has evolved here over the past years prior to making
somewhat invasive changes to thread new PM support through the business
logic.
Chandler Carruth [Sun, 29 Jan 2017 08:03:16 +0000 (08:03 +0000)]
[ArgPromote] Re-arrange the code in a more typical, logical way.
This arranges the static helpers in an order where they are defined
prior to their use to avoid the need of forward declarations, and
collect the core pass components at the bottom below their helpers.
This also folds one trivial function into the pass itself. Factoring
this 'runImpl' was an attempt to help porting to the new pass manager,
however in my attempt to begin this port in earnest it turned out to not
be a substantial help. I think it will be easier to factor things
without it.
This is an NFC change and does a minimal amount of edits over all.
Subsequent NFC cleanups will normalize the formatting with clang-format
and improve the basic doxygen commenting.
Craig Topper [Sun, 29 Jan 2017 04:38:19 +0000 (04:38 +0000)]
[DAGCombiner] Remove unnecessary check on the size of the type of the index of EXTRACT_SUBVECTOR.
The type system already requires that the number of vector elements must fit in 32-bits so an index should as well. Even if the type of the index were larger all we care about is that the constant index can fit in 64-bits so that we can call getZExtValue.
David Majnemer [Sun, 29 Jan 2017 01:27:08 +0000 (01:27 +0000)]
[Target] Add NoSignedZerosFPMath to the TargetOptions constructor
Most flags were already initialized by the TargetOptions constructor but
we missed out on one. Also, simplify the constructor by using field
initializers when possible.
Lang Hames [Sun, 29 Jan 2017 00:51:17 +0000 (00:51 +0000)]
[Orc][RPC] Remove a couple of redundant calls to abandonAllPendingResponses.
appendCallAsync, which all RPC call functions ultimately build on, will call
abandonAllPendingResponses on channel error. These extra calls are redundant.
Mohammad Shahid [Sat, 28 Jan 2017 17:59:44 +0000 (17:59 +0000)]
[SLP] Vectorize loads of consecutive memory accesses, accessed in non-consecutive (jumbled) way.
The jumbled scalar loads will be sorted while building the tree and these accesses will be marked to generate shufflevector after the vectorized load with proper mask.
Support for barrier synchronization between a subset of threads
in a CTA through one of sixteen explicitly specified barriers.
These intrinsics are not directly exposed in CUDA but are
critical for forthcoming support of OpenMP on NVPTX GPUs.
The intrinsics allow the synchronization of an arbitrary
(multiple of 32) number of threads in a CTA at one of 16
distinct barriers. The two intrinsics added are as follows:
call void @llvm.nvvm.barrier.n(i32 10)
waits for all threads in a CTA to arrive at named barrier #10.
call void @llvm.nvvm.barrier(i32 15, i32 992)
waits for 992 threads in a CTA to arrive at barrier #15.
Detailed description of these intrinsics are available in the PTX manual.
http://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions
Vadim Chugunov [Sat, 28 Jan 2017 07:39:52 +0000 (07:39 +0000)]
This addresses LLDB bug 31699, which was caused by LLVM using static linking on Windows.
In order to make sure that LLVM continues to work on machines that do not have the Universal CRT yet,
we'll need to ship a copy of UCRT in the Windows installation package. Fortunately, CMake 3.6+ already
supports app-local deployment of UCRT dlls, we just need to turn this on.
Taewook Oh [Sat, 28 Jan 2017 07:05:43 +0000 (07:05 +0000)]
[InstCombine] Merge DebugLoc when speculatively hoisting store instruction
Summary: Along with https://reviews.llvm.org/D27804, debug locations need to be merged when hoisting store instructions as well. Not sure if just dropping debug locations would make more sense for this case, but as the branch instruction will have at least different discriminator with the hoisted store instruction, I think there will be no difference in practice.
Quentin Colombet [Sat, 28 Jan 2017 02:23:48 +0000 (02:23 +0000)]
[RegisterBankInfo] Emit proper type for remapped registers.
When the OperandsMapper creates virtual registers, it used to just create
plain scalar register with the right size. This may confuse the
instruction selector because we lose the information of the instruction
using those registers what supposed to do. The MachineVerifier complains
about that already.
With this patch, the OperandsMapper still creates plain scalar register,
but the expectation is for the mapping function to remap the type
properly. The default mapping function has been updated to do that.
Matthias Braun [Sat, 28 Jan 2017 02:02:38 +0000 (02:02 +0000)]
Cleanup dump() functions.
We had various variants of defining dump() functions in LLVM. Normalize
them (this should just consistently implement the things discussed in
http://lists.llvm.org/pipermail/cfe-dev/2014-January/034323.html
For reference:
- Public headers should just declare the dump() method but not use
LLVM_DUMP_METHOD or #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
- The definition of a dump method should look like this:
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
LLVM_DUMP_METHOD void MyClass::dump() {
// print stuff to dbgs()...
}
#endif
Daniel Berlin [Sat, 28 Jan 2017 01:23:13 +0000 (01:23 +0000)]
Introduce a basic MemorySSA updater, that supports insertDef,
insertUse, moveBefore and moveAfter operations.
Summary:
This creates a basic MemorySSA updater that handles arbitrary
insertion of uses and defs into MemorySSA, as well as arbitrary
movement around the CFG. It replaces the current splice API.
It can be made to handle arbitrary control flow changes.
Currently, it uses the same updater algorithm from D28934.
The main difference is because MemorySSA is single variable, we have
the complete def and use list, and don't need anyone to give it to us
as part of the API. We also have to rename stores below us in some
cases.
If we go that direction in that patch, i will merge all the updater
implementations (using an updater_traits or something to provide the
get* functions we use, called read*/write* in that patch).
Sadly, the current SSAUpdater algorithm is way too slow to use for
what we are doing here.
I have updated the tests we have to basically build memoryssa
incrementally using the updater api, and make sure it still comes out
the same.
Quentin Colombet [Sat, 28 Jan 2017 01:05:27 +0000 (01:05 +0000)]
[RegisterCoalescing] Recommit the patch "Remove partial redundent copy".
In r292621, the recommit fixes a bug related with live interval update
after the partial redundent copy is moved.
This recommit solves an additional bug related to the lack of update of
subranges.
The original patch is to solve the performance problem described in
PR27827. Register coalescing sometimes cannot remove a copy because of
interference. But if we can find a reverse copy in one of the predecessor
block of the copy, the copy is partially redundent and we may remove the
copy partially by moving it to the predecessor block without the
reverse copy.
Tim Northover [Fri, 27 Jan 2017 23:54:31 +0000 (23:54 +0000)]
GlobalISel: don't leak super-entry BB when merging with IR-level one.
We have to delete the block manually or it leaks. That triggers failures in
-fsanitize=leak bots (unsurprisingly), which should be fixed by this patch.
Sanjay Patel [Fri, 27 Jan 2017 23:26:27 +0000 (23:26 +0000)]
[InstCombine] move icmp transforms that might be recognized as min/max and inf-loop (PR31751)
This is a minimal patch to avoid the infinite loop in:
https://llvm.org/bugs/show_bug.cgi?id=31751
But the general problem is bigger: we're not canonicalizing all of the min/max forms reported
by value tracking's matchSelectPattern(), and we don't define min/max consistently. Some code
uses matchSelectPattern(), other code uses matchers like m_Umax, and others have their own
inline definitions which may be subtly different from any of the above.
The reason that the test cases in this patch need a cast op to trigger is because we don't
(yet) canonicalize all min/max forms based on matchSelectPattern() in
canonicalizeMinMaxWithConstant(), but we do make min/max+cast transforms based on
matchSelectPattern() in visitSelectInst().
The location of the icmp transforms that trigger the inf-loop seems arbitrary at best, so
I'm moving those behind the min/max fence in visitICmpInst() as the quick fix.
Artem Tamazov [Fri, 27 Jan 2017 22:19:42 +0000 (22:19 +0000)]
[AMDGPU][mc] Fix memory corruption uncovered by AddressSanitizer during coverage/smoke Gfx7/8 testing.
Coverage/smoke Gfx7/8 tests were committed r292922 but then reverted
by r292974 due to AddressSanitizer failure, which is fixed by this patch.
Tests to be re-committed soon.
Mehdi Amini [Fri, 27 Jan 2017 19:48:57 +0000 (19:48 +0000)]
Global DCE performance improvement
Change the original algorithm so that it scales better when meeting
very large bitcode where every instruction does not implies a global.
The target query is "how to you get all the globals referenced by
another global"?
Before this patch, it was doing this by walking the body (or the
initializer) and collecting the references. What this patch is doing,
it precomputing the answer to this query for the whole module by
walking the use-list of every global instead.
Matthias Braun [Fri, 27 Jan 2017 18:53:07 +0000 (18:53 +0000)]
ScheduleDAGInstrs: Do not try to toggle kill flags on debug uses
Preparation for upcoming changes. No testcase as none of the public
targets bundles early enough and has a post machine scheduler enabled at
the same time. The error is also easily catched by asserts.