Matthias Braun [Tue, 19 Dec 2017 20:24:12 +0000 (20:24 +0000)]
TargetLoweringBase: Fix darwinHasSinCos()
Another followup to my refactoring in r321036: Turns out we can end up
with an x86 darwin target that is not macos (simulator triples can look
like i386-apple-ios) so we need the x86/32bit check in all cases.
Mark Searles [Tue, 19 Dec 2017 19:26:23 +0000 (19:26 +0000)]
[AMDGPU] Turn off MergeConsecutiveStores() before Instruction Selection for AMDGPU. Commit dbbb6c5fc3642987430866dffdf710df4f616ac7 turned on MergeConsecutiveStores() before Instruction Selection for all targets. Enough AMDGPU compiles go into an infinite loop ( MergeConsecutiveStores() merges two stores; LegalizeStoreOps() un-merges; MergeConsecutiveStores() re-merges, etc. ) to warrant turning it off until the issues can be addressed.
Simon Pilgrim [Tue, 19 Dec 2017 16:54:07 +0000 (16:54 +0000)]
[X86][AVX512] Attempt target shuffle combining to different types instead of early-out
We try to prevent shuffle combining to value types that would stop the folding of masked operations, but by just returning early, we were failing to try different shuffle types.
The TODOs are all still relevant here to improve codegen but we're lacking test examples.
Ben Dunbobbin [Tue, 19 Dec 2017 14:49:33 +0000 (14:49 +0000)]
[ThinLTO][C-API] Correct api comments
Negative values never disabled the pruning - they simply set high values for the pruning interval.
The behaviour now is that negative values set the maximum pruning interval (which appears to have been the intention from the start) see https://reviews.llvm.org/D41231.
I have adjusted the comments to reflect this, removed any inaccurate statements, and corrected any typos I spotted in the English.
Previously, Interval was unsigned (see: CachePruning.h), replacing the type with std::chrono::seconds (which is signed) causes a regression in behaviour because the c-api intends negative values to translate to large positive intervals to *effectively* disable the pruning (see comments on: setCachePruningInterval()).
Simon Pilgrim [Tue, 19 Dec 2017 13:16:43 +0000 (13:16 +0000)]
[X86][SSE] Add cpu feature for aggressive combining to variable shuffles
As mentioned in D38318 and D40865, modern Intel processors prefer to combine multiple shuffles to a variable shuffle mask (PSHUFB/VPERMPS etc.) instead of having multiple stage 'fixed' shuffles which put more pressure on Port 5 (at the expense of extra shuffle mask loads).
This patch provides a FeatureFastVariableShuffle target flag for Haswell+ CPUs that prefers combining 2 or more fixed shuffles to a single variable shuffle (default is 3 shuffles).
The long term aim is to drive more of this from schedule data (probably via the MC) but we're not close to being ready for that yet.
Pavel Labath [Tue, 19 Dec 2017 12:15:50 +0000 (12:15 +0000)]
[Support] Add WritableMemoryBuffer class
Summary:
The motivation here is LLDB, where we need to fixup relocations in
mmapped files before their contents can be read correctly. The
MemoryBuffer class does exactly what we need, *except* that it maps the
file in read-only mode.
WritableMemoryBuffer reuses the existing machinery for opening and
mmapping a file. The only difference is in the argument to the
mapped_file_region constructor -- we create a private copy-on-write
mapping, so that we can make changes to the mapped data, but the changes
aren't carried over to the underlying file.
This patch is based on an initial version by Zachary Turner.
Simon Pilgrim [Tue, 19 Dec 2017 12:02:40 +0000 (12:02 +0000)]
[X86][SSE] Use (V)PHMINPOSUW for vXi8 SMAX/SMIN/UMAX/UMIN horizontal reductions (PR32841)
Extension to D39729 which performed this for vXi16, with the same bit flipping to handle SMAX/SMIN/UMAX cases, vXi8 UMIN horizontal reductions can be performed.
This makes use of the fact that by performing a pair-wise i8 SHUFFLE/UMIN before PHMINPOSUW, we both get the UMIN of each pair but also zero-extend the upper bits ready for v8i16.
Simon Dardis [Tue, 19 Dec 2017 11:16:22 +0000 (11:16 +0000)]
[mips] Handle the emission of microMIPSr6 sll instruction when used as a nop.
This instruction is encoded as zero, so we have handle that case when checking
for unimplemented opcodes when producing the encoding for an instruction.
[dwarfdump] Lookup needs to be an unsigned long long parameter.
Before this patch, dwarfdump's lookup parameter only accepts unsigned.
Given that for many current platforms the load address already exceeds
unsigned (e.g. arm64 w/ 0x100000000), dwarfdump needs an unsigned long
long parameter.
Patch by: Dr. Michael 'Mickey' Lauer <mickey@vanille-media.de>
Max Kazantsev [Tue, 19 Dec 2017 09:10:21 +0000 (09:10 +0000)]
[JumpThreading] Restrict PRE across instructions that don't pass control to successors
PRE in JumpThreading should not be able to hoist copy of non-speculable loads across
instructions that don't always transfer execution to their successors, otherwise they may
introduce an unsafe load which otherwise would not be executed.
Craig Topper [Tue, 19 Dec 2017 06:59:10 +0000 (06:59 +0000)]
[X86] Don't extend v16i8 non-uniform shifts to v16i32 if we have BWI. Use v16i16 instead.
BWI supports shifting by word amounts. Even if VLX isn't support we can still widen to v32i16 and extract the lower half. For SKX its preferrable to not use 512-bit vector if we can.
Craig Topper [Tue, 19 Dec 2017 06:29:00 +0000 (06:29 +0000)]
[X86] Use a specific list of MVTs in combineShiftRightArithmetic instead of iterating over every integer VT and checking their size.
Previously, we were checking for MVTs with sizes betwen 8 and 64 which only includes i8, i16, i32, and i64 today. But I don't think we should assume that and should list the types that are legal for x86. I also don't think we need i64 since type legalization is guaranteed to split those up.
Craig Topper [Tue, 19 Dec 2017 04:52:04 +0000 (04:52 +0000)]
[X86] Use ZERO_EXTEND instead of ANY_EXTEND when extending the shift amount for a non-uniform shift.
My reading of the SDM says that all bits of the shift amount are used. If the value of the element is larger than the number of bits the result the shift result is zero. So I think we need to zero_extend here to avoid garbage in the upper bits.
In reality we lower any_extend as zero_extend so in most cases it would be hard to hit this.
Serguei Katkov [Tue, 19 Dec 2017 04:27:39 +0000 (04:27 +0000)]
Fix APFloat from string conversion for Inf
The method IEEEFloat::convertFromStringSpecials() does not recognize
the "+Inf" and "-Inf" strings but these strings are printed for
the double Infinities by the IEEEFloat::toString().
This patch adds the "+Inf" and "-Inf" strings to the list of recognized
patterns in IEEEFloat::convertFromStringSpecials().
Quentin Colombet [Tue, 19 Dec 2017 02:57:23 +0000 (02:57 +0000)]
[TableGen][GlobalISel] Reset the internal map of RuleMatchers just before the emission
Between the creation of the last InstructionMatcher and the first
emission of the related Rule, we need to clear the internal map of IDs.
We used to do that right after the creation of the main
InstructionMatcher when building the rule and although that worked, this
is fragile because if for some reason some later code decides to create
more InstructionMatcher before the final call to emit, then the IDs
would be completely messed up.
Move that to the beginning of "emit" so that the IDs are guarantee to be
consistent.
Justin Bogner [Tue, 19 Dec 2017 00:49:04 +0000 (00:49 +0000)]
update_mir_test_checks: Accept IR as input as well as MIR
We need to handle IR for tests that want to do lowering (or just
-stop-after with IR as input). I've run this on one AArch64 test to
demonstrate what it looks like.
Matthias Braun [Mon, 18 Dec 2017 23:19:42 +0000 (23:19 +0000)]
X86/AArch64/ARM: Factor out common sincos_stret logic; NFCI
Note:
- X86ISelLowering: setLibcallName(SINCOS) was superfluous as
InitLibcalls() already does it.
- ARMISelLowering: Setting libcallnames for sincos/sincosf seemed
superfluous as in the darwin case it wouldn't be used while for all
other cases InitLibcalls already does it.
Quentin Colombet [Mon, 18 Dec 2017 22:12:13 +0000 (22:12 +0000)]
[TableGen][GlobalISel] Make the arguments of the Instruction and Operand Matchers consistent
Move InsnVarID and OpIdx at the beginning of the list of arguments
for all the constructors of the OperandMatcher subclasses.
This matches what we do for the InstructionMatcher.
Bob Haarman [Mon, 18 Dec 2017 22:10:14 +0000 (22:10 +0000)]
Fix buffer overrun in WindowsResourceCOFFWriter::writeSymbolTable()
Summary:
We were using sprintf(..., "$R06X", <some uint32_t>) to create strings
that are expected to be exactly length 8, but this results in longer
strings if the uint32_t is greater than 0xffffff. This change modifies
the behavior as follows:
- Uses the loop counter instead of the data offset. This gives us
sequential symbol names, avoiding collisions as much as possible.
- Masks the value to 0xffffff to avoid generating names longer than 8
bytes.
Quentin Colombet [Mon, 18 Dec 2017 21:25:53 +0000 (21:25 +0000)]
[TableGen][GlobalISel] Refactor optimizeRules related bit to allow code reuse
In theory, reapplying optimizeRules on each group matchers should give
us a second nesting level on the matching table. In practice, we need
more work to make that happen because all the predicates are actually
not directly available through the predicate matchers list.
Ivan A. Kosarev [Mon, 18 Dec 2017 20:05:20 +0000 (20:05 +0000)]
[Analysis] Generate more precise TBAA tags when one access encloses the other
There are cases when two tags with different base types denote
accesses to the same direct or indirect member of a structure
type. Currently, merging of such tags results in a tag that
represents an access to an object that has the type of that
member. This patch changes this so that if one of the accesses
encloses the other, then the generic tag is the one of the
enclosed access.
Teresa Johnson [Mon, 18 Dec 2017 20:02:43 +0000 (20:02 +0000)]
[PGO] Fix handling of cold entry count for instrumented PGO
Summary:
In r277849, getEntryCount was changed to return None when the entry
count was 0, specifically for SamplePGO where it means no samples were
recorded. However, for instrumentation PGO a 0 entry count should be
returned directly, since it does mean that the function was completely
cold. Otherwise we end up treating these functions conservatively
in isFunctionEntryCold() and isColdBB().
Instead, for SamplePGO use -1 when there are no samples, and change
getEntryCount to return None when the value is -1.
Quentin Colombet [Mon, 18 Dec 2017 19:47:41 +0000 (19:47 +0000)]
[TableGen][GlobalISel] Optimize MatchTable for faster instruction selection
*** Context ***
Prior to this patchw, the table generated for matching instruction was
straight forward but highly inefficient.
Basically, each pattern generates its own set of self contained checks
and actions.
E.g., TableGen generated:
// First pattern
CheckNumOperand 3
CheckOpcode G_ADD
...
Build ADDrr
// Second pattern
CheckNumOperand 3
CheckOpcode G_ADD
...
Build ADDri
// Third pattern
CheckNumOperand 3
CheckOpcode G_SUB
...
Build SUBrr
*** Problem ***
Because of that generation, a *lot* of check were redundant between each
pattern and were checked every single time until we reach the pattern
that matches.
E.g., Taking the previous table, let say we are matching a G_SUB, that
means we were going to check all the rules for G_ADD before looking at
the G_SUB rule. In particular we are going to do:
check 3 operands; PASS
check G_ADD; FAIL
; Next rule
check 3 operands; PASS (but we already knew that!)
check G_ADD; FAIL (well it is still not true)
; Next rule
check 3 operands; PASS (really!!)
check G_SUB; PASS (at last :P)
*** Proposed Solution ***
This patch introduces a concept of group of rules (GroupMatcher) that
share some predicates and only get checked once for the whole group.
This patch only creates groups with one nesting level. Conceptually
there is nothing preventing us for having deeper nest level. However,
the current implementation is not smart enough to share the recording
(aka capturing) of values. That limits its ability to do more sharing.
For the given example the current patch will generate:
// First group
CheckOpcode G_ADD
// First pattern
CheckNumOperand 3
...
Build ADDrr
// Second pattern
CheckNumOperand 3
...
Build ADDri
// Second group
CheckOpcode G_SUB
// Third pattern
CheckNumOperand 3
...
Build SUBrr
But if we allowed several nesting level, it could create a sub group
for the checknumoperand 3.
(We would need to call optimizeRules on the rules within a group.)
*** Result ***
With only one level of nesting, the instruction selection pass is up
to 4x faster. For instance, one instruction now takes 500 checks,
instead of 24k! With more nesting we could get in the tens I believe.
Jessica Paquette [Mon, 18 Dec 2017 19:33:21 +0000 (19:33 +0000)]
[MachineOutliner] Recommit r320229
LR was undefined entering outlined functions that contain calls. This made the
machine verifier unhappy when expensive checks were enabled. This fixes that.
Don Hinton [Mon, 18 Dec 2017 19:15:15 +0000 (19:15 +0000)]
[cmake] Update experimental target error message
Summary:
Update this error message indicate this test only ensures experimental
targets were passed via LLVM_EXPERIMENTAL_TARGETS_TO_BUILD.
Originally, this test validated all targets, but in r184923, it was moved
after the LLVMBUILDTOOL test, which also validates all targets, making
that part of the test redundant.
Teresa Johnson [Mon, 18 Dec 2017 18:00:32 +0000 (18:00 +0000)]
[ThinLTO] Make distributed indexes test more robust
Modify test so that it passes in the reverse-iteration bot.
We use DenseMap instead of std::map for the summaries to emit into
distributed index files. The iteration order is not defined, but
it is deterministic, which is good enough.
Sander de Smalen [Mon, 18 Dec 2017 16:48:53 +0000 (16:48 +0000)]
[AArch64][SVE] Asm: Improve diagnostics further when +sve is not specified
Summary: Patch [4/4] in a series to add parsing of predicates and properly parse SVE ZIP1/ZIP2 instructions. This patch further improves diagnostic messages for when the SVE feature is not specified.
Sander de Smalen [Mon, 18 Dec 2017 14:34:24 +0000 (14:34 +0000)]
[TableGen][AsmMatcherEmitter] Only choose specific diagnostic for enabled instruction
Summary:
When emitting a diagnostic for an invalid operand, a specific diagnostic
should only be reported when the instruction being matched is actually
enabled by the feature flags.
Patch [3/4] in a series to add parsing of predicates and properly parse SVE
ZIP1/ZIP2 instructions. This patch fixes bogus diagnostic messages for when
the SVE feature is not specified.
Diana Picus [Mon, 18 Dec 2017 13:22:28 +0000 (13:22 +0000)]
[ARM GlobalISel] Fix G_(UN)MERGE_VALUES handling after r319524
r319524 has made more G_MERGE_VALUES/G_UNMERGE_VALUES pairs legal than
are supported by the rest of the pipeline. Restrict that to only the
cases that we can currently handle: packing 32-bit values into 64-bit
ones, when we have hardware FP.
Max Kazantsev [Mon, 18 Dec 2017 13:01:32 +0000 (13:01 +0000)]
[ConstantRange] Support for ashr in ConstantRange computation
Extend the ConstantRange implementation to compute the range of possible values resulting from an arithmetic right shift operation.
There will be a follow up patch to leverage this constant range infrastructure in LazyValueInfo.
Tim Northover [Mon, 18 Dec 2017 10:36:00 +0000 (10:36 +0000)]
AArch64: work around how Cyclone handles "movi.2d vD, #0".
For Cylone, the instruction "movi.2d vD, #0" is executed incorrectly in some rare
circumstances. Work around the issue conservatively by avoiding the instruction entirely.
This patch changes CodeGen so that problematic instructions are never
generated, and the AsmParser so that an equivalent instruction is used (with a
warning).
Sam Parker [Mon, 18 Dec 2017 10:04:27 +0000 (10:04 +0000)]
[DAGCombine] Move AND nodes to multiple load leaves
Search from AND nodes to find whether they can be propagated back to
loads, so that the AND and load can be combined into a narrow load.
We search through OR, XOR and other AND nodes and all bar one of the
leaves are required to be loads or constants. The exception node then
needs to be masked off meaning that the 'and' isn't removed, but the
loads(s) are narrowed still.
Hiroshi Inoue [Mon, 18 Dec 2017 06:47:37 +0000 (06:47 +0000)]
[SROA] Disable non-whole-alloca splits by default
This patch introduce a switch to control splitting of non-whole-alloca slices with default off.
The switch will be default on again after fixing an issue reported in PR35657.
Craig Topper [Mon, 18 Dec 2017 04:50:05 +0000 (04:50 +0000)]
[X86] Fix mistake that I made when splitting up the setOperationAction calls recently.
The block I moved things that need BWI and 512-bit or VLX is incorrectly qualified with just hasBWI || hasVLX. Here I've qualified it with hasBWI && (hasAVX512 || hasVLX) where the hasAVX512 will be replaced with allowing 512-bit vectors in an upcoming patch.