Quentin Colombet [Mon, 18 Dec 2017 22:12:13 +0000 (22:12 +0000)]
[TableGen][GlobalISel] Make the arguments of the Instruction and Operand Matchers consistent
Move InsnVarID and OpIdx at the beginning of the list of arguments
for all the constructors of the OperandMatcher subclasses.
This matches what we do for the InstructionMatcher.
Bob Haarman [Mon, 18 Dec 2017 22:10:14 +0000 (22:10 +0000)]
Fix buffer overrun in WindowsResourceCOFFWriter::writeSymbolTable()
Summary:
We were using sprintf(..., "$R06X", <some uint32_t>) to create strings
that are expected to be exactly length 8, but this results in longer
strings if the uint32_t is greater than 0xffffff. This change modifies
the behavior as follows:
- Uses the loop counter instead of the data offset. This gives us
sequential symbol names, avoiding collisions as much as possible.
- Masks the value to 0xffffff to avoid generating names longer than 8
bytes.
Quentin Colombet [Mon, 18 Dec 2017 21:25:53 +0000 (21:25 +0000)]
[TableGen][GlobalISel] Refactor optimizeRules related bit to allow code reuse
In theory, reapplying optimizeRules on each group matchers should give
us a second nesting level on the matching table. In practice, we need
more work to make that happen because all the predicates are actually
not directly available through the predicate matchers list.
Ivan A. Kosarev [Mon, 18 Dec 2017 20:05:20 +0000 (20:05 +0000)]
[Analysis] Generate more precise TBAA tags when one access encloses the other
There are cases when two tags with different base types denote
accesses to the same direct or indirect member of a structure
type. Currently, merging of such tags results in a tag that
represents an access to an object that has the type of that
member. This patch changes this so that if one of the accesses
encloses the other, then the generic tag is the one of the
enclosed access.
Teresa Johnson [Mon, 18 Dec 2017 20:02:43 +0000 (20:02 +0000)]
[PGO] Fix handling of cold entry count for instrumented PGO
Summary:
In r277849, getEntryCount was changed to return None when the entry
count was 0, specifically for SamplePGO where it means no samples were
recorded. However, for instrumentation PGO a 0 entry count should be
returned directly, since it does mean that the function was completely
cold. Otherwise we end up treating these functions conservatively
in isFunctionEntryCold() and isColdBB().
Instead, for SamplePGO use -1 when there are no samples, and change
getEntryCount to return None when the value is -1.
Quentin Colombet [Mon, 18 Dec 2017 19:47:41 +0000 (19:47 +0000)]
[TableGen][GlobalISel] Optimize MatchTable for faster instruction selection
*** Context ***
Prior to this patchw, the table generated for matching instruction was
straight forward but highly inefficient.
Basically, each pattern generates its own set of self contained checks
and actions.
E.g., TableGen generated:
// First pattern
CheckNumOperand 3
CheckOpcode G_ADD
...
Build ADDrr
// Second pattern
CheckNumOperand 3
CheckOpcode G_ADD
...
Build ADDri
// Third pattern
CheckNumOperand 3
CheckOpcode G_SUB
...
Build SUBrr
*** Problem ***
Because of that generation, a *lot* of check were redundant between each
pattern and were checked every single time until we reach the pattern
that matches.
E.g., Taking the previous table, let say we are matching a G_SUB, that
means we were going to check all the rules for G_ADD before looking at
the G_SUB rule. In particular we are going to do:
check 3 operands; PASS
check G_ADD; FAIL
; Next rule
check 3 operands; PASS (but we already knew that!)
check G_ADD; FAIL (well it is still not true)
; Next rule
check 3 operands; PASS (really!!)
check G_SUB; PASS (at last :P)
*** Proposed Solution ***
This patch introduces a concept of group of rules (GroupMatcher) that
share some predicates and only get checked once for the whole group.
This patch only creates groups with one nesting level. Conceptually
there is nothing preventing us for having deeper nest level. However,
the current implementation is not smart enough to share the recording
(aka capturing) of values. That limits its ability to do more sharing.
For the given example the current patch will generate:
// First group
CheckOpcode G_ADD
// First pattern
CheckNumOperand 3
...
Build ADDrr
// Second pattern
CheckNumOperand 3
...
Build ADDri
// Second group
CheckOpcode G_SUB
// Third pattern
CheckNumOperand 3
...
Build SUBrr
But if we allowed several nesting level, it could create a sub group
for the checknumoperand 3.
(We would need to call optimizeRules on the rules within a group.)
*** Result ***
With only one level of nesting, the instruction selection pass is up
to 4x faster. For instance, one instruction now takes 500 checks,
instead of 24k! With more nesting we could get in the tens I believe.
Jessica Paquette [Mon, 18 Dec 2017 19:33:21 +0000 (19:33 +0000)]
[MachineOutliner] Recommit r320229
LR was undefined entering outlined functions that contain calls. This made the
machine verifier unhappy when expensive checks were enabled. This fixes that.
Don Hinton [Mon, 18 Dec 2017 19:15:15 +0000 (19:15 +0000)]
[cmake] Update experimental target error message
Summary:
Update this error message indicate this test only ensures experimental
targets were passed via LLVM_EXPERIMENTAL_TARGETS_TO_BUILD.
Originally, this test validated all targets, but in r184923, it was moved
after the LLVMBUILDTOOL test, which also validates all targets, making
that part of the test redundant.
Teresa Johnson [Mon, 18 Dec 2017 18:00:32 +0000 (18:00 +0000)]
[ThinLTO] Make distributed indexes test more robust
Modify test so that it passes in the reverse-iteration bot.
We use DenseMap instead of std::map for the summaries to emit into
distributed index files. The iteration order is not defined, but
it is deterministic, which is good enough.
Sander de Smalen [Mon, 18 Dec 2017 16:48:53 +0000 (16:48 +0000)]
[AArch64][SVE] Asm: Improve diagnostics further when +sve is not specified
Summary: Patch [4/4] in a series to add parsing of predicates and properly parse SVE ZIP1/ZIP2 instructions. This patch further improves diagnostic messages for when the SVE feature is not specified.
Sander de Smalen [Mon, 18 Dec 2017 14:34:24 +0000 (14:34 +0000)]
[TableGen][AsmMatcherEmitter] Only choose specific diagnostic for enabled instruction
Summary:
When emitting a diagnostic for an invalid operand, a specific diagnostic
should only be reported when the instruction being matched is actually
enabled by the feature flags.
Patch [3/4] in a series to add parsing of predicates and properly parse SVE
ZIP1/ZIP2 instructions. This patch fixes bogus diagnostic messages for when
the SVE feature is not specified.
Diana Picus [Mon, 18 Dec 2017 13:22:28 +0000 (13:22 +0000)]
[ARM GlobalISel] Fix G_(UN)MERGE_VALUES handling after r319524
r319524 has made more G_MERGE_VALUES/G_UNMERGE_VALUES pairs legal than
are supported by the rest of the pipeline. Restrict that to only the
cases that we can currently handle: packing 32-bit values into 64-bit
ones, when we have hardware FP.
Max Kazantsev [Mon, 18 Dec 2017 13:01:32 +0000 (13:01 +0000)]
[ConstantRange] Support for ashr in ConstantRange computation
Extend the ConstantRange implementation to compute the range of possible values resulting from an arithmetic right shift operation.
There will be a follow up patch to leverage this constant range infrastructure in LazyValueInfo.
Tim Northover [Mon, 18 Dec 2017 10:36:00 +0000 (10:36 +0000)]
AArch64: work around how Cyclone handles "movi.2d vD, #0".
For Cylone, the instruction "movi.2d vD, #0" is executed incorrectly in some rare
circumstances. Work around the issue conservatively by avoiding the instruction entirely.
This patch changes CodeGen so that problematic instructions are never
generated, and the AsmParser so that an equivalent instruction is used (with a
warning).
Sam Parker [Mon, 18 Dec 2017 10:04:27 +0000 (10:04 +0000)]
[DAGCombine] Move AND nodes to multiple load leaves
Search from AND nodes to find whether they can be propagated back to
loads, so that the AND and load can be combined into a narrow load.
We search through OR, XOR and other AND nodes and all bar one of the
leaves are required to be loads or constants. The exception node then
needs to be masked off meaning that the 'and' isn't removed, but the
loads(s) are narrowed still.
Hiroshi Inoue [Mon, 18 Dec 2017 06:47:37 +0000 (06:47 +0000)]
[SROA] Disable non-whole-alloca splits by default
This patch introduce a switch to control splitting of non-whole-alloca slices with default off.
The switch will be default on again after fixing an issue reported in PR35657.
Craig Topper [Mon, 18 Dec 2017 04:50:05 +0000 (04:50 +0000)]
[X86] Fix mistake that I made when splitting up the setOperationAction calls recently.
The block I moved things that need BWI and 512-bit or VLX is incorrectly qualified with just hasBWI || hasVLX. Here I've qualified it with hasBWI && (hasAVX512 || hasVLX) where the hasAVX512 will be replaced with allowing 512-bit vectors in an upcoming patch.
Serguei Katkov [Mon, 18 Dec 2017 04:25:07 +0000 (04:25 +0000)]
[CGP] Fix the handling select inst in complex addressing mode
When we put the value in select placeholder we must pass
the value through simplification tracker due to the value might
be already simplified and erased.
Craig Topper [Sun, 17 Dec 2017 18:23:45 +0000 (18:23 +0000)]
[X86] Make the code that creates fmaddsub from build_vector of extracts and inserts functional and add tests.
Summary:
We had no tests for this and we couldn't do the optimization because of a bad use count check. We need to know how many non-undef pieces of the build vector were filled in and ensure our use count is equal to that. But on the shuffle combine version we need the use count to be 2.
The missing coverage was noticed during the review of D40335.
Simon Pilgrim [Sat, 16 Dec 2017 23:32:18 +0000 (23:32 +0000)]
[X86][AVX] lowerVectorShuffleAsBroadcast - aggressively peek through BITCASTs
Assuming we can safely adjust the broadcast index for the new type to keep it suitably aligned, then peek through BITCASTs when looking for the broadcast source.
Sean Fertile [Sat, 16 Dec 2017 22:41:39 +0000 (22:41 +0000)]
[Memcpy Loop Lowering] Only calculate residual size/bytes copied when needed.
If the loop operand type is int8 then there will be no residual loop for the
unknown size expansion. Dont create the residual-size and bytes-copied values
when they are not needed.
Craig Topper [Sat, 16 Dec 2017 21:12:24 +0000 (21:12 +0000)]
[X86] Don't pass a zero input to the passthru operand of getVectorMaskingNode/getScalarMaskingNode when its going to emit an ISD::OR/ISD::AND. NFCI
In those cases, the pass thru operand of the methods isn't used. The calls to the scalar version were passing a MVT::i1 zero, which is an illegal type at the stage this code runs.
Craig Topper [Sat, 16 Dec 2017 19:31:36 +0000 (19:31 +0000)]
[X86] When using vpopcntdq for ctpop of v8i16 vectors, only promote to v8i32.
Previously we promoted to v8i64, but we don't need to go all the way to 512-bits. If we have VLX we can use the 256-bit instruction. And even if we don't have VLX we can widen v8i32 to v16i32 and drop the upper half.
Craig Topper [Sat, 16 Dec 2017 18:35:31 +0000 (18:35 +0000)]
[X86] Combine some more scheduler model entries using regular expressions.
We had a lot of separate 32 and 64 instructions that had the same scheduling data. This merges them into the same regular expression. This is pretty consistent with a lot of other instructions.
We want to do this for 2 reasons:
1. Value tracking does not recognize the ashr variant, so it would fail to match for cases like D39766.
2. DAGCombiner does better at producing optimal codegen when we have the cmp+sel pattern.
More detail about what happens in the backend:
1. DAGCombiner has a generic transform for all targets to convert the scalar cmp+sel variant of abs
into the shift variant. That is the opposite of this IR canonicalization.
2. DAGCombiner has a generic transform for all targets to convert the vector cmp+sel variant of abs
into either an ABS node or the shift variant. That is again the opposite of this IR canonicalization.
3. DAGCombiner has a generic transform for all targets to convert the exact shift variants produced by #1 or #2
into an ISD::ABS node. Note: It would be an efficiency improvement if we had #1 go directly to an ABS node
when that's legal/custom.
4. The pattern matching above is incomplete, so it is possible to escape the intended/optimal codegen in a
variety of ways.
a. For #2, the vector path is missing the case for setlt with a '1' constant.
b. For #3, we are missing a match for commuted versions of the shift variants.
5. Therefore, this IR canonicalization can only help get us to the optimal codegen. The version of cmp+sel
produced by this patch will be recognized in the DAG and converted to an ABS node when possible or the
shift sequence when not.
6. In the following examples with this patch applied, we may get conditional moves rather than the shift
produced by the generic DAGCombiner transforms. The conditional move is created using a target-specific
decision for any given target. Whether it is optimal or not for a particular subtarget may be up for debate.
define <4 x i32> @abs_shifty_vec(<4 x i32> %x) {
%signbit = ashr <4 x i32> %x, <i32 31, i32 31, i32 31, i32 31>
%add = add <4 x i32> %signbit, %x
%abs = xor <4 x i32> %signbit, %add
ret <4 x i32> %abs
}
define <4 x i32> @abs_cmpsubsel_vec(<4 x i32> %x) {
%cmp = icmp slt <4 x i32> %x, zeroinitializer
%sub = sub <4 x i32> zeroinitializer, %x
%abs = select <4 x i1> %cmp, <4 x i32> %sub, <4 x i32> %x
ret <4 x i32> %abs
}
Hal Finkel [Sat, 16 Dec 2017 02:55:24 +0000 (02:55 +0000)]
[LV] Extend InstWidening with CM_Widen_Recursive
Changes to the original scalar loop during LV code gen cause the return value
of Legal->isConsecutivePtr() to be inconsistent with the return value during
legal/cost phases (further analysis and information of the bug is in D39346).
This patch is an alternative fix to PR34965 following the CM_Widen approach
proposed by Ayal and Gil in D39346. It extends InstWidening enum with
CM_Widen_Reverse to properly record the widening decision for consecutive
reverse memory accesses and, consequently, get rid of the
Legal->isConsetuviePtr() call in LV code gen. I think this is a simpler/cleaner
solution to PR34965 than the one in D39346.
Vitaly Buka [Sat, 16 Dec 2017 02:10:00 +0000 (02:10 +0000)]
[LTO] Make processing of combined module more consistent
Summary:
1. Use stream 0 only for combined module. Previously if combined module was not
processes ThinLTO used the stream for own output. However small changes in input,
could trigger combined module and shuffle outputs making life of llvm::LTO harder.
2. Always process combined module and write output to stream 0. Processing empty
combined module is cheap and allows llvm::LTO users to avoid implementing processing
which is already done in llvm::LTO.
Hal Finkel [Sat, 16 Dec 2017 01:12:50 +0000 (01:12 +0000)]
[LV] NFC patch for moving VP*Recipe class definitions from LoopVectorize.cpp to VPlan.h
This is a small step forward to move VPlan stuff to where it should belong (i.e., VPlan.*):
1. VP*Recipe classes in LoopVectorize.cpp are moved to VPlan.h.
2. Many of VP*Recipe::print() and execute() definitions are still left in
LoopVectorize.cpp since they refer to things declared in LoopVectorize.cpp. To
be moved to VPlan.cpp at a later time.
3. InterleaveGroup class is moved from anonymous namespace to llvm namespace.
Referencing it in anonymous namespace from VPlan.h ended up in warning.
Teresa Johnson [Sat, 16 Dec 2017 00:18:12 +0000 (00:18 +0000)]
[ThinLTO] Enable importing of aliases as copy of aliasee
Summary:
This implements a missing feature to allow importing of aliases, which
was previously disabled because alias cannot be available_externally.
We instead import an alias as a copy of its aliasee.
Some additional work was required in the IndexBitcodeWriter for the
distributed build case, to ensure that the aliasee has a value id
in the distributed index file (i.e. even when it is not being
imported directly).
This is a performance win in codes that have many aliases, e.g. C++
applications that have many constructor and destructor aliases.