Zachary Turner [Tue, 21 Feb 2017 19:29:56 +0000 (19:29 +0000)]
Remove svn:eol-style property from 2 files.
There are still over 3400 files remaining with this property set, but there are tens of thousands more with the property not set. Until we decide what to do on a global scale, this at least unblocks me temporarily.
Matt Arsenault [Tue, 21 Feb 2017 19:12:08 +0000 (19:12 +0000)]
AMDGPU: Don't use stack space for SGPR->VGPR spills
Before frame offsets are calculated, try to eliminate the
frame indexes used by SGPR spills. Then we can delete them
after.
I think for now we can be sure that no other instruction
will be re-using the same frame indexes. It should be easy
to notice if this assumption ever breaks since everything
asserts if it tries to use a dead frame index later.
The unused emergency stack slot seems to still be left behind,
so an additional 4 bytes is still wasted.
Adrian Prantl [Tue, 21 Feb 2017 19:03:15 +0000 (19:03 +0000)]
Teach the IR verifier to reject conflicting debug info for function arguments.
Conflicting debug info for function arguments causes hard-to-debug
assertions in the DWARF backend, so the Verifier should reject it.
For performance reasons this only checks function arguments from
non-inlined debug intrinsics for now.
Geoff Berry [Tue, 21 Feb 2017 18:53:14 +0000 (18:53 +0000)]
[CodeGenPrepare] Sink and duplicate more 'and' instructions.
Summary:
Rework the code that was sinking/duplicating (icmp and, 0) sequences
into blocks where they were being used by conditional branches to form
more tbz instructions on AArch64. The new code is more general in that
it just looks for 'and's that have all icmp 0's as users, with a target
hook used to select which subset of 'and' instructions to consider.
This change also enables 'and' sinking for X86, where it is more widely
beneficial than on AArch64.
The 'and' sinking/duplicating code is moved into the optimizeInst phase
of CodeGenPrepare, where it can take advantage of the fact the
OptimizeCmpExpression has already sunk/duplicated any icmps into the
blocks where they are used. One minor complication from this change is
that optimizeLoadExt needed to be updated to always mark 'and's it has
determined should be in the same block as their feeding load in the
InsertedInsts set to avoid an infinite loop of hoisting and sinking the
same 'and'.
This change fixes a regression on X86 in the tsan runtime caused by
moving GVNHoist to a later place in the optimization pipeline (see
PR31382).
John Brawn [Tue, 21 Feb 2017 16:41:29 +0000 (16:41 +0000)]
[ARM] Correct SP/PC handling in t2MOVr
PC isn't allowed in the source operand of t2MOVr, so change the register class
to one without PC. SP handling is slightly trickier and changes depending on if
we're in ARMv8, so do that in checkTargetMatchPredicate.
The patch introduces new way of narrowing complex (>UINT16 variants) solutions.
The new method introduced under "-lsr-exp-narrow" option (currenlty set to true).
Summary:
The method is based on registers number mathematical expectation and should be
generally closer to optimal solution.
Please see details in comments to
"LSRInstance::NarrowSearchSpaceByDeletingCostlyFormulas()" function
(in lib/Transforms/Scalar/LoopStrengthReduce.cpp).
Craig Topper [Tue, 21 Feb 2017 06:39:13 +0000 (06:39 +0000)]
[X86] Use SHLD with both inputs from the same register to implement rotate on Sandy Bridge and later Intel CPUs
Summary:
Sandy Bridge and later CPUs have better throughput using a SHLD to implement rotate versus the normal rotate instructions. Additionally it saves one uop and avoids a partial flag update dependency.
This patch implements this change on any Sandy Bridge or later processor without BMI2 instructions. With BMI2 we will use RORX as we currently do.
Matthias Braun [Tue, 21 Feb 2017 01:27:33 +0000 (01:27 +0000)]
ScheduleDAG: Cleanup; NFC
- Fix doxygen comments (do not repeat documented name, remove definition
comment if there is already one at the declaration, add \p, ...)
- Add some const modifiers
- Use range based for
Matthias Braun [Tue, 21 Feb 2017 01:27:29 +0000 (01:27 +0000)]
SubtargetFeature: Cleanup; NFC
- Fix doxygen comments
- Remove duplicated comments
- Remove section comments (which became wrong over time)
- Use more `const` and references but less `auto`
Taewook Oh [Tue, 21 Feb 2017 00:12:38 +0000 (00:12 +0000)]
[BranchFolding] Update debug location along with the update of branch instruction.
Summary:
Currently, BranchFolder drops DebugLoc for branch instructions in some places. For example, for the test code attached, the branch instruction of 'entry' block has a DILocation of
Daniel Sanders [Mon, 20 Feb 2017 15:30:43 +0000 (15:30 +0000)]
[globalisel] OperandPredicateMatcher's shouldn't need to generate the MachineOperand expr. NFC
Summary:
Each OperandPredicateMatcher shouldn't need to know how to generate the expression
to reference a MachineOperand. The OperandMatcher should provide it.
In addition to separating responsibilities, this also lays some groundwork for
decoupling source patterns from destination patterns to allow invented operands
or operands provided by GlobalISel's equivalent to the ComplexPattern<> class.
Diana Picus [Mon, 20 Feb 2017 14:45:58 +0000 (14:45 +0000)]
[ARM] GlobalISel: Don't select atomic loads
There used to be a check in the IRTranslator that prevented us from having to
deal with atomic loads/stores. That check has been removed in r294993 and the
AArch64 backend was updated accordingly. This commit does the same thing for the
ARM backend.
In general, in the ARM backend we introduce fences during the atomic expand
pass, so we don't have to worry about atomics, *except* for the 32-bit ARMv8
target, which handles atomics more like AArch64. Since we don't want to worry
about that yet, just bail out of instruction selection if we find any atomic
loads.
Daniel Sanders [Mon, 20 Feb 2017 14:31:27 +0000 (14:31 +0000)]
[globalisel] Separate the SelectionDAG importer from the emitter. NFC
Summary:
In the near future the rules will be sorted between these two steps to
ensure that more important rules are not prevented by less important ones.
Igor Breger [Mon, 20 Feb 2017 14:16:29 +0000 (14:16 +0000)]
[X86] Fix EXTRACT_VECTOR_ELT with variable index from v32i16 and v64i8 vector.
Its more profitable to go through memory (1 cycles throughput)
than using VMOVD + VPERMV/PSHUFB sequence ( 2/3 cycles throughput) to implement EXTRACT_VECTOR_ELT with variable index.
IACA tool was used to get performace estimation (https://software.intel.com/en-us/articles/intel-architecture-code-analyzer)
For example for var_shuffle_v16i8_v16i8_xxxxxxxxxxxxxxxx_i8 test from vector-shuffle-variable-128.ll I get 26 cycles vs 79 cycles.
Removing the VINSERT node, we don't need it any more.
Sjoerd Meijer [Mon, 20 Feb 2017 10:57:54 +0000 (10:57 +0000)]
AArch64AsmParser: tablegen the isBranchTarget helper functions
Use tablegen to autogenerate isBranchtarget helper functions. This is a cleanup
that removes almost identical functions that differ only in a few constants.
Ayman Musa [Mon, 20 Feb 2017 08:27:54 +0000 (08:27 +0000)]
[X86][AVX] Extend hasVEX_WPrefix bit to accept WIG value (W Ignore) + update all AVX instructions with the new value.
Add WIG value to all of AVX instructions which ignore the W-bit in their encoding, instead of giving them the default value of 0.
This patch is needed for a follow up work on EVEX2VEX pass (replacing EVEX encoded instructions with their corresponding VEX version when possible).
The current ObjectLinkingLayer (now RTDyldObjectLinkingLayer) links objects
in-process using MCJIT's RuntimeDyld class. In the near future I hope to add new
object linking layers (e.g. a remote linking layer that links objects in the JIT
target process, rather than the client), so I'm renaming this class to be more
descriptive.
Simon Pilgrim [Sun, 19 Feb 2017 19:40:31 +0000 (19:40 +0000)]
[X86][SSE] Use getTargetConstantBitsFromNode to find zeroable shuffle elements.
Replaces existing approach that could only search BUILD_VECTOR nodes.
Requires getTargetConstantBitsFromNode to discriminate cases with all/partial UNDEF bits in each element - this should also be useful when we get around to supporting getTargetShuffleMaskIndices with UNDEF elements.
Simon Pilgrim [Sun, 19 Feb 2017 17:19:38 +0000 (17:19 +0000)]
[X86][SSE] Enable initial support for domain crossing at high shuffle combine depths.
As discussed on D27692, this permits another domain to be used to combine a shuffle at high depths.
We currently set the required depth at 4 or more combined shuffles, this is probably too high for most targets but is a good starting point and already helps avoid a number of costly variable shuffles.
Craig Topper [Sun, 19 Feb 2017 08:03:23 +0000 (08:03 +0000)]
[AVX-512] Add patterns to show missed opportunities for folding vpternlog with broadcast loads. Also demonstrates a bug in the commuting of broadcast vpternlog instructions when we are able to select them.