Chris Bieneman [Thu, 17 Nov 2016 04:36:59 +0000 (04:36 +0000)]
[CMake] [Darwin] Add support for debugging tablegen dependencies
This patch adds an option to the build system LLVM_DEPENDENCY_DEBUGGING. Over time I plan to extend this to do more complex verifications, but the initial patch causes compile errors wherever there is missing a dependency on intrinsics_gen.
Because intrinsics_gen is a compile-time dependency not a link-time dependency, everything that relies on the headers generated in intrinsics_gen needs an explicit dependency.
This patch updates a bunch of places where add_dependencies was being explicitly called to add dependencies on intrinsics_gen to instead use the DEPENDS named parameter. This cleanup is needed for a patch I'm working on to add a dependency debugging mode to the build system.
Dehao Chen [Thu, 17 Nov 2016 01:17:02 +0000 (01:17 +0000)]
Use profile info to adjust loop unroll threshold.
Summary:
For flat loop, even if it is hot, it is not a good idea to unroll in runtime, thus we set a lower partial unroll threshold.
For hot loop, we set a higher unroll threshold and allows expensive tripcount computation to allow more aggressive unrolling.
Sanjay Patel [Wed, 16 Nov 2016 22:34:05 +0000 (22:34 +0000)]
[x86] allow FP-logic ops when one operand is FP and result is FP
We save an inter-register file move this way. If there's any CPU where
the FP logic is slower, we could transform this back to int-logic in
MachineCombiner.
This helps, but doesn't solve, PR6137:
https://llvm.org/bugs/show_bug.cgi?id=6137
The 'andn' test shows that we're missing a pattern match to
recognize the xor with -1 constant as a 'not' op.
Ahmed Bougacha [Wed, 16 Nov 2016 22:24:56 +0000 (22:24 +0000)]
[CodeGen] Pull MMI helpers from FunctionLoweringInfo to MMI. NFC.
They're not SelectionDAG- or FunctionLoweringInfo-specific. They
are, however, specific to building MMI from IR.
We could make them members, but it's nice having MMI be a "simple" data
structure and this logic kept separate.
Kevin Enderby [Wed, 16 Nov 2016 22:17:38 +0000 (22:17 +0000)]
General clean up of error handling in llvm-objdump to remove its use of report_fatal_error().
No real functional change with this commit.
The problem with report_fatal_error() is it does not include the tool name
and the file name the for which the error message was generated.
Uses of report_fatal_error() were change to report_error() or error()
to get a better error and to make the code smaller and cleaner.
Also changed things like error(errorToErrorCode(SOrErr.takeError())) to
use report_error() with a file name and the llvm::Error (as well as the
ArchitectureName if available) so the error message is printed.
Dylan McKay [Wed, 16 Nov 2016 21:58:04 +0000 (21:58 +0000)]
[AVR] Add the pseudo instruction expansion pass
Summary:
A lot of the pseudo instructions are required because LLVM assumes that
all integers of the same size as the pointer size are legal. This means
that it will not currently expand 16-bit instructions to their 8-bit
variants because it thinks 16-bit types are legal for the operations.
This also adds all of the CodeGen tests that required the pass to run.
We only ever create TargetConstantPool, TargetJumpTable, TargetExternalSymbol,
TargetGlobalAddress, TargetGlobalTLSAddress, MCSymbol and TargetBlockAddress
nodes as operands of X86ISD::Wrapper nodes, so we can remove one check and
invert the other.
Also update the documentation comment for X86ISD::Wrapper.
Sanjoy Das [Wed, 16 Nov 2016 21:45:22 +0000 (21:45 +0000)]
[ImplicitNullChecks] Do not not handle call MachineInstrs
We don't track callee clobbered registers correctly, so avoid hoisting
across calls.
Note: for this bug to trigger we need a `readonly` call target, since we
already have logic to not hoist across potentially storing instructions
either.
Tim Northover [Wed, 16 Nov 2016 20:54:28 +0000 (20:54 +0000)]
ARM: fix CodeGen for 64-bit shifts.
One half of the shifts obviously needed conditional selection based on whether
the shift amount is more than 32-bits, but leaving the other half as the
natural shift isn't acceptable either: it's undefined behaviour to shift a
32-bit value by more than 31.
Rong Xu [Wed, 16 Nov 2016 20:50:06 +0000 (20:50 +0000)]
Make block placement deterministic
We fail to produce bit-to-bit matching stage2 and stage3 compiler in PGO
bootstrap build. The reason is because LoopBlockSet is of SmallPtrSet type
whose iterating order depends on the pointer value.
This patch fixes this issue by changing to use SmallSetVector.
Geoff Berry [Wed, 16 Nov 2016 19:35:19 +0000 (19:35 +0000)]
[AArch64] Handle vector types in replaceZeroVectorStore.
Summary:
Extend replaceZeroVectorStore to handle more vector type stores,
floating point zero vectors and set alignment more accurately on split
stores.
Tom Stellard [Wed, 16 Nov 2016 18:42:17 +0000 (18:42 +0000)]
AMDGPU/SI: Avoid creating unnecessary copies in the SIFixSGPRCopies pass
Summary:
1. Don't try to copy values to and from the same register class.
2. Replace copies with of registers with immediate values with v_mov/s_mov
instructions.
The main purpose of this change is to make MachineSink do a better job of
determining when it is beneficial to split a critical edge, since the pass
assumes that copies will become move instructions.
This prevents a regression in uniform-cfg.ll if we enable critical edge
splitting for AMDGPU.
Sanjay Patel [Wed, 16 Nov 2016 17:42:40 +0000 (17:42 +0000)]
[x86] add fake scalar FP logic instructions to ReplaceableInstrs to save some bytes
We can replace "scalar" FP-bitwise-logic with other forms of bitwise-logic instructions.
Scalar SSE/AVX FP-logic instructions only exist in your imagination and/or the bowels of
compilers, but logically equivalent int, float, and double variants of bitwise-logic
instructions are reality in x86, and the float variant may be a shorter instruction
depending on which flavor (SSE or AVX) of vector ISA you have...so just prefer float all
the time.
This is a preliminary step towards solving PR6137:
https://llvm.org/bugs/show_bug.cgi?id=6137
Lang Hames [Wed, 16 Nov 2016 17:31:09 +0000 (17:31 +0000)]
[Orc] Re-enable the RPC unit test disabled in r286917.
This unit test infinite-looped on s390x due to a thread_yield being optimized
out. I've updated the QueueChannel class (where thread_yield was called) to use
a condition variable instead. This should cause the unit test to behave
correctly.
Simon Pilgrim [Wed, 16 Nov 2016 14:48:32 +0000 (14:48 +0000)]
[X86][AVX512] Autoupgrade lossless i32/u32 to f64 conversion intrinsics with generic IR
Both the (V)CVTDQ2PD (i32 to f64) and (V)CVTUDQ2PD (u32 to f64) conversion instructions are lossless and can be safely represented as generic SINT_TO_FP/UINT_TO_FP calls instead of x86 intrinsics without affecting final codegen.
Simon Dardis [Wed, 16 Nov 2016 11:29:07 +0000 (11:29 +0000)]
[mips] Fix unsigned/signed type error
MipsFastISel uses a a class to represent addresses with a signed member
to represent the offset. MipsFastISel::emitStore, emitLoad and computeAddress
all treated the offset as being positive. In cases where the offset was
actually negative and a frame pointer was used, this would cause the constant
synthesis routine to crash as it would generate an unexpected instruction
sequence when frame indexes are replaced.
Craig Topper [Wed, 16 Nov 2016 05:24:10 +0000 (05:24 +0000)]
[X86] Remove the scalar intrinsics for fadd/fsub/fdiv/fmul
Summary: These intrinsics have been unused for clang for a while. This patch removes them. We auto upgrade them to extractelements, a scalar operation and then an insertelement. This matches the sequence used by clangs intrinsic file.
Davide Italiano [Wed, 16 Nov 2016 05:10:28 +0000 (05:10 +0000)]
[ELF] Convert ELF.h to Expected<T>.
This has two advantages:
1) We slowly move away from ErrorOr to the new handling interface,
in the hope of having an uniform error handling in LLVM, eventually.
2) We're starting to have *meaningful* error messages for invalid
object ELF files, rather than a generic "parse error". At some point
we should include also the offset to improve the quality of the
diagnostic.
[XRay][docs] Define requirements on installed log handlers.
Summary:
We update the documentation to define what the requirements are for the
provided XRay log handler. This is to make it clear that the function
pointer provided must do internal synchronisation and that there are no
guarantees provided by XRay on when the function shall be invoked once
it has been installed as a log handler.
Quentin Colombet [Wed, 16 Nov 2016 01:07:12 +0000 (01:07 +0000)]
[RegAllocGreedy] Record missed hint for late recoloring.
In https://reviews.llvm.org/D25347, Geoff noticed that we still have
useless copy that we can eliminate after register allocation. At the
time the allocation is chosen for those copies, they are not useless
but, because of changes in the surrounding code, later on they might
become useless.
The Greedy allocator already has a mechanism to deal with such cases
with a late recoloring. However, we missed to record the some of the
missed hints.
Fixed the lost FastMathFlags for CALL operations in SLPVectorizer.
Reviewer: Michael Zolotukhin.
Differential Revision: https://reviews.llvm.org/D26575
Justin Lebar [Wed, 16 Nov 2016 00:44:47 +0000 (00:44 +0000)]
[BypassSlowDivision] Handle division by constant numerators better.
Summary:
We don't do BypassSlowDivision when the denominator is a constant, but
we do do it when the numerator is a constant.
This patch makes two related changes to BypassSlowDivision when the
numerator is a constant:
* If the numerator is too large to fit into the bypass width, don't
bypass slow division (because we'll never run the smaller-width
code).
* If we bypass slow division where the numerator is a constant, don't
OR together the numerator and denominator when determining whether
both operands fit within the bypass width. We need to check only the
denominator.
Rui Ueyama [Wed, 16 Nov 2016 00:38:33 +0000 (00:38 +0000)]
Fix Modi and File count if there are more than 65535 modules/files.
These numbers are intended to be capped at 65535, but
`std::max<uint16_t>(UINT16_MAX, N)` always returns N for any N because
the expression is the same as `std::max((uint16_t)UINT16_MAX, (uint16_t)N)`.
Always use relative jump table encodings on PowerPC64.
For the default, small and medium code model, use the existing
difference from the jump table towards the label. For all other code
models, setup the picbase and use the difference between the picbase and
the block address.
Overall, this results in smaller data tables at the expensive of one or
two more arithmetic operation at the jump site. Given that we only create
jump tables with a lot more than two entries, it is a net win in size.
For larger code models the assumption remains that individual functions
are no larger than 2GB.
Kevin Enderby [Tue, 15 Nov 2016 23:07:41 +0000 (23:07 +0000)]
General clean up of Mach-O error handling in llvm-objdump.
To get a good error message for all files that could contain Mach-O
files the code in llvm-objdump needs to use the archive member name
and name of the architecture of a slice of a universal file in those cases
where the error come from a Mach-O file in an archive or a universal file.
Most of this is fixed by moving the call to checkSymbolTable() into
ProcessMachO() and calling it when the operation needs the symbol
table. And then calling the form of report_error() that has the
ArchiveName and ArchitectureName arguments. One other place
needed to call this form of report_error() also with these arguments.
Also changed the code in MachODump.cpp to not use report_fatal_error()
and use report_error() instead to make the code smaller and cleaner. All
cases of this are for errors with the symbol table which should now never
be tripped since checkSymbolTable() should be called first to get a good
error message in these cases.
Kuba Brecka [Tue, 15 Nov 2016 21:07:03 +0000 (21:07 +0000)]
Fix llvm-symbolizer to correctly sort a symbol array and calculate symbol sizes
Sometimes, llvm-symbolizer gives wrong results due to incorrect sizes of some symbols. The reason for that was an incorrectly sorted array in computeSymbolSizes. The comparison function used subtraction of unsigned types, which is incorrect. Let's change this to return explicit -1 or 1.
The wave barrier represents the discardable barrier. Its main purpose is to
carry convergent attribute, thus preventing illegal CFG optimizations. All lanes
in a wave come to convergence point simultaneously with SIMT, thus no special
instruction is needed in the ISA. The barrier is discarded during code generation.
Sanjay Patel [Tue, 15 Nov 2016 18:44:53 +0000 (18:44 +0000)]
[x86] auto-generate checks; NFC
Also, fix the test params to use an attribute rather than a CPU model
and remove the AVX run because that does nothing but check for a 'v'
prefix in all of these tests.
Wei Mi [Tue, 15 Nov 2016 18:35:53 +0000 (18:35 +0000)]
[LSR] Allow formula containing Reg for SCEVAddRecExpr related with outerloop.
In RateRegister of existing LSR, if a formula contains a Reg which is a SCEVAddRecExpr,
and this SCEVAddRecExpr's loop is an outerloop, the formula will be marked as Loser
and dropped.
Suppose we have an IR that %for.body is outerloop and %for.body2 is innerloop. LSR only
handle inner loop now so only %for.body2 will be handled.
Using the logic above, formula like
reg(%array) + reg({1,+, %size}<%for.body>) + 1*reg({0,+,1}<%for.body2>) will be dropped
no matter what because reg({1,+, %size}<%for.body>) is a SCEVAddRecExpr type reg related
with outerloop. Only formula like
reg(%array) + 1*reg({{1,+, %size}<%for.body>,+,1}<nuw><nsw><%for.body2>) will be kept
because the SCEVAddRecExpr related with outerloop is folded into the initial value of the
SCEVAddRecExpr related with current loop.
But in some cases, we do need to share the basic induction variable
reg{0 ,+, 1}<%for.body2> among LSR Uses to reduce the final total number of induction
variables used by LSR, so we don't want to drop the formula like
reg(%array) + reg({1,+, %size}<%for.body>) + 1*reg({0,+,1}<%for.body2>) unconditionally.
From the existing comment, it tries to avoid considering multiple level loops at the same time.
However, existing LSR only handles innermost loop, so for any SCEVAddRecExpr with a loop other
than current loop, it is an invariant and will be simple to handle, and the formula doesn't have
to be dropped.