Mikael Holmen [Tue, 14 Feb 2017 06:37:42 +0000 (06:37 +0000)]
[LSR] Pointers with different address spaces are considered incompatible.
Summary:
Function isCompatibleIVType is already used as a guard before the call to
SE.getMinusSCEV(OperExpr, PrevExpr);
in LSRInstance::ChainInstruction. getMinusSCEV requires the expressions
to be of the same type, so we now consider two pointers with different
address spaces to be incompatible, since it is possible that the pointers
in fact have different sizes.
Lang Hames [Tue, 14 Feb 2017 05:40:01 +0000 (05:40 +0000)]
[Orc][RPC] Remove lanch policies in favor of async handlers.
Launch policies provided a mechanism for running RPC handlers on a background
thread (unblocking the main RPC receiver thread). Async handlers generalize
this by passing the responder function (the function that sends the RPC return
value) as an argument to the handler. The handler can optionally do its work on
a background thread (the same way launch policies do), but can also (a) can
inspect the call arguments before deciding to run the work on a different
thread, or (b) can use the responder in a subsequent RPC call (e.g. in the
handler of a callAsync), allowing the handler to call back to the originator (or
to a 3rd party) without blocking the listener thread, and without launching a
new thread.
Philip Reames [Tue, 14 Feb 2017 01:38:31 +0000 (01:38 +0000)]
[LICM] Make store promotion work in the face of unordered atomics
Extend our store promotion code to deal with unordered atomic accesses. Ordered atomics continue to be unhandled.
Most of the change is straight-forward, the only complicated bit is in the reasoning around mixing of atomic and non-atomic memory access. Rather than trying to reason about the complex semantics in these cases, I simply disallowed promotion when both atomic and non-atomic accesses are present. This is conservatively correct.
It seems really tempting to just promote all access to atomics, but the original accesses might have been conditional. Since we can't lower an arbitrary atomic type, it might not be safe to promote all access to atomic. Consider a loop like the following:
while(b) {
load i128 ...
if (can lower i128 atomic)
store atomic i128 ...
else
store i128
}
It could be there's no race on the location and thus the code is perfectly well defined even if we can't lower a i128 atomically.
It's not clear we need to be this conservative - arguably the program above is brocken since it can't be lowered unless the branch is folded - but I didn't want to have to fix any fallout which might result.
Reid Kleckner [Tue, 14 Feb 2017 01:21:39 +0000 (01:21 +0000)]
Use std::call_once on Windows
Previously we could not use it because std::once_flag's default
constructor was not constexpr. Today, all supported versions of VS
correctly mark it constexpr. I confirmed that MSVC 2015 does not emit
any problematic racy dynamic initialization code, so we should be safe
to use this now.
Andrew Kaylor [Mon, 13 Feb 2017 23:38:52 +0000 (23:38 +0000)]
[X86] Add MXCSR register
This adds MXCSR to the set of recognized registers for X86 targets and updates the instructions that read or write it. I do not intend for all of the various floating point instructions that implicitly use the control bits or update the status bits of this register to ever have that usage modeled by default. However, when constrained floating point modes (such as strict FP exception status modeling or dynamic rounding modes) are enabled, implicit use/def information for MXCSR will be added to those instructions.
Until those additional updates are made this should cause (almost?) no functional changes. Theoretically, this will prevent instructions like LDMXCSR and STMXCSR from being moved past one another, but that should be prevented anyway and I haven't found a case where it is happening now.
Sanjoy Das [Mon, 13 Feb 2017 23:19:07 +0000 (23:19 +0000)]
[LangRef] Explicitly allow readnone and reaodnly functions to unwind
Summary:
This change edits the language reference to explicitly allow the
existence of readnone and readonly functions that can throw. Full
discussion at
http://lists.llvm.org/pipermail/llvm-dev/2017-January/108637.html
Sanjoy Das [Mon, 13 Feb 2017 23:14:03 +0000 (23:14 +0000)]
[LangRef] Update the TBAA section
Summary:
Update the TBAA section to mention the struct path TBAA that LLVM
implements today. This is not a proposal or change in semantics -- it
is intended only to **document** what LLVM already does today.
This is related to https://reviews.llvm.org/D26438 where I've tried to
implement some of the constraints as verifier checks.
Sanjay Patel [Mon, 13 Feb 2017 23:10:51 +0000 (23:10 +0000)]
[FunctionAttrs] try to extend nonnull-ness of arguments from a callsite back to its parent function
As discussed here:
http://lists.llvm.org/pipermail/llvm-dev/2016-December/108182.html
...we should be able to propagate 'nonnull' info from a callsite back to its parent.
The original motivation for this patch is our botched optimization of "dyn_cast" (PR28430),
but this won't solve that problem.
The transform is currently disabled by default while we wait for clang to work-around
potential security problems:
http://lists.llvm.org/pipermail/cfe-dev/2017-January/052066.html
IR: Type ID summary extensions for WPD; thread summary into WPD pass.
Make the whole thing testable by adding YAML I/O support for the WPD
summary information and adding some negative tests that exercise the
YAML support.
Taewook Oh [Mon, 13 Feb 2017 18:15:31 +0000 (18:15 +0000)]
Make MachineBasicBlock::updateTerminator to update DebugLoc as well
Summary:
Currently MachineBasicBlock::updateTerminator simply drops DebugLoc for newly created branch instructions, which may cause incorrect stepping and/or imprecise sample profile data. Below is an example:
```
1 extern int bar(int x);
2
3 int foo(int *begin, int *end) {
4 int *i;
5 int ret = 0;
6 for (
7 i = begin ;
8 i != end ;
9 i++)
10 {
11 ret += bar(*i);
12 }
13 return ret;
14 }
```
Below is a bitcode of 'foo' at the end of LLVM-IR level optimizations with -O3:
. As you can see, the terminator of 'entry' block, which is a loop control branch, has a DebugLoc of line 6, column 3. Howerver, after the execution of 'MachineBlock::updateTerminator' function, which is triggered by MachineSinking pass, the DebugLoc info is dropped as below (see there's no debug-location for JNE_1):
Matthew Simpson [Mon, 13 Feb 2017 18:02:35 +0000 (18:02 +0000)]
Revert "[LV] Extend trunc optimization to all IVs with constant integer steps"
This reverts commit r294967. This patch caused execution time slowdowns in a
few LLVM test-suite tests, as reported by the clang-cmake-aarch64-quick bot.
I'm reverting to investigate.
Quentin Colombet [Mon, 13 Feb 2017 17:38:59 +0000 (17:38 +0000)]
[FastISel] Add a diagnostic to warm on fallback.
This is consistent with what we do for GlobalISel. That way, it is easy
to see whether or not FastISel is able to fully select a function.
At some point we may want to switch that to an optimization remark.
Matthew Simpson [Mon, 13 Feb 2017 16:48:00 +0000 (16:48 +0000)]
[LV] Extend trunc optimization to all IVs with constant integer steps
This patch extends the optimization of truncations whose operand is an
induction variable with a constant integer step. Previously we were only
applying this optimization to the primary induction variable. However, the cost
model assumes the optimization is applied to the truncation of all integer
induction variables (even regardless of step type). The transformation is now
applied to the other induction variables, and I've updated the cost model to
ensure it is better in sync with the transformation we actually perform.
Simon Dardis [Mon, 13 Feb 2017 16:06:48 +0000 (16:06 +0000)]
[mips] divide macro instruction cleanup.
Clean up the implementation of divide macro expansion by getting rid of a
FIXME regarding magic numbers and branch instructions. Match GAS' behaviour
for expansion of ddiv / div in the two and three operand cases. Add the two
operand alias for MIPSR6. Finally, optimize macro expansion cases where the
divisior is the $zero register.
Sanne Wouda [Mon, 13 Feb 2017 14:07:45 +0000 (14:07 +0000)]
[CodeGen] fix alignment of JUMPTABLE_INSTS on v8M.base
Summary:
The attached test case fails with "fatal error: error in backend:
misaligned pc-relative fixup value" as the jump table is misaligned.
The EmitAlignment existed already for ARM and Thumb-1 code, but was
missing for Thumb-2.
The test checks that the fatal error disappears when generating an obj
file, as well as checking the align directive is there when producing an
asm file.
James Molloy [Mon, 13 Feb 2017 14:07:39 +0000 (14:07 +0000)]
[Thumb-1] TBB generation: spot redefinitions of index register
We match a sequence of 3-4 instructions into a tTBB pseudo. One of our checks is that
a particular register in that sequence is killed (so it can be clobbered by the pseudo).
We weren't noticing if an errant MOV or other instruction had infiltrated the
sequence we were walking. If it had, and it defined the register we've already
identified as killed, it makes it live across the tBR_JT and thus unclobberable.
Sanne Wouda [Mon, 13 Feb 2017 13:58:00 +0000 (13:58 +0000)]
[Assembler] Improve diagnostics for inline assembly.
Summary:
Keep a vector of LocInfos around; one for each call to EmitInlineAsm.
Since each call to EmitInlineAsm creates a new buffer in the inline asm
SourceMgr, we can use the buffer number to map to the right LocInfo.
James Molloy [Mon, 13 Feb 2017 12:32:47 +0000 (12:32 +0000)]
[ARM] Use VCMP, not VCMPE, for floating point equality comparisons
When generating a floating point comparison we currently unconditionally
generate VCMPE. This has the sideeffect of setting the cumulative Invalid
bit in FPSCR if any of the operands are QNaN.
It is expected that use of a relational predicate on a QNaN value should
raise Invalid. Quoting from the C standard:
The relational and equality operators support the usual mathematical
relationships between numeric values. For any ordered pair of numeric
values exactly one of relationships the less, greater, equal and is true.
Relational operators may raise the floating-point exception when argument
values are NaNs.
The standard doesn't explicitly state the expectation for equality operators,
but the implication and obvious expectation is that equality operators
should not raise Invalid on a QNaN input, as those predicates are wholly
defined on unordered inputs (to return not equal).
Therefore, add a new operand to ARMISD::FPCMP and FPCMPZ indicating if
QNaN should raise Invalid, and pipe that through to TableGen.
Compile time decreasing in the case we're dealing with Machine Combiner.
Before this patch compile time was about 21s (see below). After this patch
we have less than 2s (see bellow).
Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
DAGCombiner - trunk
time ./llc spill_fdiv.ll -o /dev/null -enable-unsafe-fp-math
real 0m1.685s
DAGCombiner + Speed patch
time ./llc spill_fdiv.ll -o /dev/null -enable-unsafe-fp-math
real 0m1.655s
MachineCombiner w/o Speed patch
time ./llc spill_fdiv.ll -o /dev/null -enable-unsafe-fp-math
real 0m21.614s
MachineCombiner + Speed patch
time ./llc spill_fdiv.ll -o /dev/null -enable-unsafe-fp-math
real 0m1.593s
The test spill_fdiv.ll is attached to D29627
D29627 should be closed.
Alexey Bataev [Mon, 13 Feb 2017 08:01:26 +0000 (08:01 +0000)]
[SLP] Fix for PR31690: Allow using of extra values in horizontal
reductions.
Currently, LLVM supports vectorization of horizontal reduction
instructions with initial value set to 0. Patch supports vectorization
of reduction with non-zero initial values. Also, it supports a
vectorization of instructions with some extra arguments, like:
```
float f(float x[], int a, int b) {
float p = a % b;
p += x[0] + 3;
for (int i = 1; i < 32; i++)
p += x[i];
return p;
}
```
Patch allows vectorization of this kind of horizontal reductions.
Craig Topper [Mon, 13 Feb 2017 04:53:33 +0000 (04:53 +0000)]
[DAGCombiner] Teach DAG combine that inserting an extract_subvector result into the same location of a an undef vector can just use the original input to the extract.
Craig Topper [Mon, 13 Feb 2017 04:53:29 +0000 (04:53 +0000)]
[X86] Genericize the handling of INSERT_SUBVECTOR from an EXTRACT_SUBVECTOR to support 512-bit vectors with 128-bit or 256-bit subvectors.
We now detect that both the extract and insert indices are non-zero and convert to a shuffle. This will be lowered as a blend for 256-bit vectors or as a vshuf operations for 512-bit vectors.
Craig Topper [Sun, 12 Feb 2017 23:49:46 +0000 (23:49 +0000)]
[X86] Don't let LowerEXTRACT_SUBVECTOR call getNode for EXTRACT_SUBVECTOR.
This results in the simplifications inside of getNode running while we're legalizing nodes popped off the worklist during the final DAG combine. This basically makes a DAG combine like operation occur during this legalize step, but we don't handle something quite the same way. I think we don't recursively added the removed nodes to the DAG combiner worklist.
Daniel Berlin [Sun, 12 Feb 2017 23:24:47 +0000 (23:24 +0000)]
NewGVN: We really pass TBAA if we enable DCE and fix the test. Note that GVN eliminates no-use readonly/readnone calls, even if they are not marked nounwind. NewGVN only eliminates them if they are marked nounwind, and thus, trivially dead.
Sanjay Patel [Sun, 12 Feb 2017 23:07:52 +0000 (23:07 +0000)]
[TargetLowering] fix SETCC SETLT folding with FP types
The bug was introduced with:
https://reviews.llvm.org/rL294863
...and manifests as a selection failure in x86, but that's actually
another bug. This fix prevents wrong codegen with -0.0, but in the
more common case when we have NSZ and NNAN (-ffast-math), we should
still be able to fold this setcc/compare.
Craig Topper [Sun, 12 Feb 2017 18:47:44 +0000 (18:47 +0000)]
[AVX-512] Add VMOV64toSDZrm CodeGenOnly instruction based on the same instruction from AVX/SSE.
I can't prove that we can select this instruction or the AVX/SSE version, but I'm adding it for consistency for now so I can continue matching the load folding tables.
Dorit Nuzman [Sun, 12 Feb 2017 09:32:53 +0000 (09:32 +0000)]
[LV/LoopAccess] Check statically if an unknown dependence distance can be
proven larger than the loop-count
This fixes PR31098: Try to resolve statically data-dependences whose
compile-time-unknown distance can be proven larger than the loop-count,
instead of resorting to runtime dependence checking (which are not always
possible).
For vectorization it is sufficient to prove that the dependence distance
is >= VF; But in some cases we can prune unknown dependence distances early,
and even before selecting the VF, and without a runtime test, by comparing
the distance against the loop iteration count. Since the vectorized code
will be executed only if LoopCount >= VF, proving distance >= LoopCount
also guarantees that distance >= VF. This check is also equivalent to the
Strong SIV Test.
Chandler Carruth [Sun, 12 Feb 2017 05:34:04 +0000 (05:34 +0000)]
[PM] Enable GlobalsAA in the new PM's pipeline by default.
All the invalidation issues and bugs in this seem to be fixed, it has
survived a full build of the test suite plus SPEC with asserts and ASan
enabled on the Clang binary used.
Craig Topper [Sat, 11 Feb 2017 22:57:12 +0000 (22:57 +0000)]
[X86] Move code for using blendi for insert_subvector out to an isel pattern. This gives the DAG combiner more opportunity to optimize without needing to dig through the blend.
Mehdi Amini [Sat, 11 Feb 2017 21:26:52 +0000 (21:26 +0000)]
Update Kaleidoscope tutorial and improve Windows support
Many quoted code blocks were not in sync with the actual toy.cpp
files. Improve tutorial text slightly in several places.
Added some step descriptions crucial to avoid crashes (like
InitializeNativeTarget* calls).
Solve/workaround problems with Windows (JIT'ed method not found, using
custom and standard library functions from host process).