[X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess.
This patch expands the support of lowerInterleavedStore to 32x8i stride 4.
LLVM creates suboptimal shuffle code-gen for AVX2. In overall, this patch is a specific fix for the pattern (Strid=4 VF=32) and we plan to include more patterns in the future. To reach our goal of "more patterns". We include two mask creators. The first function creates shuffle's mask equivalent to unpacklo/unpackhi instructions. The other creator creates mask equivalent to a concat of two half vectors(high/low).
The patch goal is to optimize the following sequence:
At the end of the computation, we have ymm2, ymm0, ymm12 and ymm3 holding
each 32 chars:
TargetLowering: Change isShuffleMaskLegal's mask argument type to ArrayRef<int>. NFCI.
Changing mask argument type from const SmallVectorImpl<int>& to
ArrayRef<int>.
This came up in D35700 where a mask is received as an ArrayRef<int> and
we want to pass it to TargetLowering::isShuffleMaskLegal().
Also saves a few lines of code.
[X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess part1.
splitting patch D34601 into two part. This part changes the location of two functions.
The second part will be based on that patch. This was requested by @RKSimon.
Max Kazantsev [Wed, 26 Jul 2017 04:55:54 +0000 (04:55 +0000)]
[SCEV] Cache results of computeExitLimit
This patch adds a cache for computeExitLimit to save compilation time. A lot of examples of
tests that take extensive time to compile are attached to the bug 33494.
[X86] Prevent selecting masked aligned load instructions if the load should be non-temporal
Summary: The aligned load predicates don't suppress themselves if the load is non-temporal the way the unaligned predicates do. For the most part this isn't a problem because the aligned predicates are mostly used for instructions that only load the the non-temporal loads have priority over those. The exception are masked loads.
Sanjoy Das [Wed, 26 Jul 2017 01:32:19 +0000 (01:32 +0000)]
[SCEV] Remove unnecessary call to forgetMemoizedResults
`SCEVUnknown::allUsesReplacedWith` does not need to call `forgetMemoizedResults`
since RAUW does a value-equivalent replacement by assumption. If this
assumption was false then the later setValPtr(New) call would be incorrect too.
This is a non-trivial performance optimization for functions with a large number
of loops since `forgetMemoizedResults` walks all loop backedge taken counts to
see if any of them use the SCEVUnknown being RAUWed. However, this improvement
is difficult to demonstrate without checking in an excessively large IR file.
Eric Beckmann [Wed, 26 Jul 2017 01:21:55 +0000 (01:21 +0000)]
Move manifest utils into separate lib, to reduce libxml2 deps.
Summary:
Previously were in support. Since many many things depend on support,
were all forced to also depend on libxml2, which we only want in a few cases.
This puts all the libxml2 deps in a separate lib to be used only in a few
places.
[DWARF] Generalized verification of .apple_names accelerator table to be applicable to any acceleration table. Added verification for .apple_types, .apple_namespaces and .apple_objc sections.
[PDB] Improve GSI hash table dumping for publics and globals
The PDB "symbol stream" actually contains symbol records for the publics
and the globals stream. The globals and publics streams are essentially
hash tables that point into a single stream of records. In order to
match cvdump's behavior, we need to only dump symbol records referenced
from the hash table. This patch implements that, and then implements
global stream dumping, since it's just a subset of public stream
dumping.
Now we shouldn't see S_PROCREF or S_GDATA32 records when dumping
publics, and instead we should see those record in the globals stream.
Wei Mi [Tue, 25 Jul 2017 23:37:17 +0000 (23:37 +0000)]
Disable loop unswitching for some patterns containing equality comparison with undef.
This is a workaround for the bug described in PR31652 and
http://lists.llvm.org/pipermail/llvm-dev/2017-July/115497.html. The temporary
solution is to add a function EqualityPropUnSafe. In EqualityPropUnSafe, for
some simple patterns we can know the equality comparison may contains undef,
so we regard such comparison as unsafe and will not do loop-unswitching for
them. We also need to disable the select simplification when one of select
operand is undef and its result feeds into equality comparison.
The patch cannot clear the safety issue caused by the bug, but it can suppress
the issue from happening to some extent.
Michal Gorny [Tue, 25 Jul 2017 22:38:31 +0000 (22:38 +0000)]
[lit] Fix UnboundLocalError for invalid shtest redirects
Replace the incorrect variable reference when invalid redirect is used.
This fixes the following issue:
File "/usr/src/llvm/utils/lit/lit/TestRunner.py", line 316, in processRedirects
raise InternalShellError(cmd, "Unsupported redirect: %r" % (r,))
UnboundLocalError: local variable 'r' referenced before assignment
which in turn broke shtest-shell.py and max-failures.py lit tests.
The breakage was introduced during refactoring in rL307310.
Petr Hosek [Tue, 25 Jul 2017 22:38:08 +0000 (22:38 +0000)]
Reland "[LLVM][llvm-objcopy] Added basic plumbing to get things started"
As discussed on llvm-dev I've implemented the first basic steps towards
llvm-objcopy/llvm-objtool (name pending).
This change adds the ability to copy (without modification) 64-bit
little endian ELF executables that have SHT_PROGBITS, SHT_NOBITS,
SHT_NULL and SHT_STRTAB sections.
[libFuzzer] don't disable msan for TracePC::CollectFeatures: this started to cause false positives in msan. No tests for libFuzzer+msan yet -- tests will need to wait until we move libFuzzer to compiler-rt
Petr Hosek [Tue, 25 Jul 2017 21:16:33 +0000 (21:16 +0000)]
Reland "[LLVM][llvm-objcopy] Added basic plumbing to get things started"
As discussed on llvm-dev I've implemented the first basic steps towards
llvm-objcopy/llvm-objtool (name pending).
This change adds the ability to copy (without modification) 64-bit
little endian ELF executables that have SHT_PROGBITS, SHT_NOBITS,
SHT_NULL and SHT_STRTAB sections.
Teresa Johnson [Tue, 25 Jul 2017 19:42:32 +0000 (19:42 +0000)]
[LTO] Prevent dead stripping and internalization of symbols with sections
Summary:
ELF linkers generate __start_<secname> and __stop_<secname> symbols
when there is a value in a section <secname> where the name is a valid
C identifier. If dead stripping determines that the values declared
in section <secname> are dead, and we then internalize (and delete)
such a symbol, programs that reference the corresponding start and end
section symbols will get undefined reference linking errors.
To fix this, add the section name to the IRSymtab entry when a symbol is
defined in a specific section. Then use this in the gold-plugin to mark
the symbol as external and visible from outside the summary when the
section name is a valid C identifier.
Simon Pilgrim [Tue, 25 Jul 2017 17:04:37 +0000 (17:04 +0000)]
[X86][CGP] Reduce memcmp() expansion to 2 load pairs (PR33914)
D35067/rL308322 attempted to support up to 4 load pairs for memcmp inlining which resulted in regressions for some optimized libc memcmp implementations (PR33914).
Until we can match these more optimal cases, this patch reduces the memcmp expansion to a maximum of 2 load pairs (which matches what we do for -Os).
This patch should be considered for the 5.0.0 release branch as well
Simon Pilgrim [Tue, 25 Jul 2017 16:36:44 +0000 (16:36 +0000)]
[DAG] Move DAGCombiner::GetDemandedBits to SelectionDAG
This patch moves the DAGCombiner::GetDemandedBits function to SelectionDAG::GetDemandedBits as a first step towards making it easier for targets to get to the source of any demanded bits without the limitations of SimplifyDemandedBits.
[Sparc] invalid adjustments in TLS_LE/TLS_LDO relocations removed
Summary:
Some SPARC TLS relocations were applying nontrivial adjustments
to zero value, leading to unexpected non-zero values in ELF and then
Solaris linker failures.
[LIR] Teach LIR to avoid extending the BE count prior to adding one to
it when safe.
Very often the BE count is the trip count minus one, and the plus one
here should fold with that minus one. But because the BE count might in
theory be UINT_MAX or some such, adding one before we extend could in
some cases wrap to zero and break when we scale things.
This patch checks to see if it would be safe to add one because the
specific case that would cause this is guarded for prior to entering the
preheader. This should handle essentially all of the common loop idioms
coming out of C/C++ code once canonicalized by LLVM.
Before this patch, both forms of loop in the added test cases ended up
subtracting one from the size, extending it, scaling it up by 8 and then
adding 8 back onto it. This is really silly, and it turns out made it
all the way into generated code very often, so this is a surprisingly
important cleanup to do.
Many thanks to Sanjoy for showing me how to do this with SCEV.
Sam Parker [Tue, 25 Jul 2017 08:51:30 +0000 (08:51 +0000)]
[ARM] Enable partial and runtime unrolling
Enable runtime and partial loop unrolling of simple loops without
calls on M-class cores. The thresholds are calculated based on
whether the target is Thumb or Thumb-2.
Martin Storsjo [Tue, 25 Jul 2017 05:20:01 +0000 (05:20 +0000)]
[AArch64] Reserve a 16 byte aligned amount of fixed stack for win64 varargs
Create a dummy 8 byte fixed object for the unused slot below the first
stored vararg.
Alternative ideas tested but skipped: One could try to align the whole
fixed object to 16, but I haven't found how to add an offset to the stack
frame used in LowerWin64_VASTART.
If only the size of the fixed stack object size is padded but not the offset, via
MFI.CreateFixedObject(alignTo(GPRSaveSize, 16), -(int)GPRSaveSize, false),
PrologEpilogInserter crashes due to "Attempted to reset backwards range!".
This fixes misconceptions about where registers are spilled, since
AArch64FrameLowering.cpp assumes the offset from fixed objects is
aligned to 16 bytes (and the Win64 case there already manually aligns
the offset to 16 bytes).
This fixes cases where local stack allocations could overwrite callee
saved registers on the stack.
Marek Sokolowski [Tue, 25 Jul 2017 00:25:18 +0000 (00:25 +0000)]
Add an empty shell of llvm-rc.
This starts the development on one of MS Visual Studio binutils,
Resource Converter. The tool compiles resource scripts (.rc)
into binary resource files (.res).
The current implementation does nothing but parse the command
line arguments. It is going to be extended in the future.
Kevin Enderby [Mon, 24 Jul 2017 20:33:41 +0000 (20:33 +0000)]
Small tweak to one check in error handling to the dyld compact export
entries in libObject (done in r308690). In the case when the last node
has no children setting State.Current = Children + 1; where that would be past
Trie.end() is actually ok since the pointer is not used with zero children.
Tom Stellard [Mon, 24 Jul 2017 19:28:30 +0000 (19:28 +0000)]
test-release.sh: Fix phase2 and phase3 binary comparision
Summary:
scudo_utils.cpp.o from compiler-rt has one of the host compiler's builtin
include paths stored in the .debug_line section. So we need to do
sed 's,Phase1,Phase2,g` on the Phase2 object file so it matches Phase3.
Leo Li [Mon, 24 Jul 2017 17:26:28 +0000 (17:26 +0000)]
[CMake] Remove redundant logic in runtimes/CMakeList.txt
Summary:
`SUB_CHECK_TARGETS` contains all test targets in `SUB_COMPONENTS` when
we load `Components.cmake`. We don't need to add those targets
again and having duplicate targets will break the cmake policy CMP0002.
Benjamin Kramer [Mon, 24 Jul 2017 16:18:09 +0000 (16:18 +0000)]
[CodeGenPrepare] Cut off FindAllMemoryUses if there are too many uses.
This avoids excessive compile time. The case I'm looking at is
Function.cpp from an old version of LLVM that still had the giant memcmp
string matcher in it. Before r308322 this compiled in about 2 minutes,
after it, clang takes infinite* time to compile it. With this patch
we're at 5 min, which is still bad but this is a pathological case.
The cut off at 20 uses was chosen by looking at other cut-offs in LLVM
for user scanning. It's probably too high, but does the job and is very
unlikely to regress anything.
Fixes PR33900.
* I'm impatient and aborted after 15 minutes, on the bug report it was
killed after 2h.
Propagates the GraphT template argument to the default value of
the AnalysisGraphTraitsT template argument. This allows to specialize
the DefaultAnalysisGraphTraits<AnalysisT,GraphT> for analysis with a
graph type different from the analysis type and it will automatically
get picked-up.
Note: This was probably the intended purpose and should not result in any
functional change.
Chad Rosier [Sun, 23 Jul 2017 16:38:08 +0000 (16:38 +0000)]
[AArch64] Redundant Copy Elimination - remove more zero copies.
This patch removes unnecessary zero copies in BBs that are targets of b.eq/b.ne
and we know the result of the compare instruction is zero. For example,
Max Kazantsev [Sun, 23 Jul 2017 15:40:19 +0000 (15:40 +0000)]
[SCEV] Limit max size of AddRecExpr during evolving
When SCEV calculates product of two SCEVAddRecs from the same loop, it
tries to combine them into one big AddRecExpr. If the sizes of the initial
SCEVs were `S1` and `S2`, the size of their product is `S1 + S2 - 1`, and every
operand of the resulting SCEV is combined from operands of initial SCEV and
has much higher complexity than they have.
As result, if we try to calculate something like:
%x1 = {a,+,b}
%x2 = mul i32 %x1, %x1
%x3 = mul i32 %x2, %x1
%x4 = mul i32 %x3, %x2
...
The size of such SCEVs grows as `2^N`, and the arguments
become more and more complex as we go forth. This leads
to long compilation and huge memory consumption.
This patch sets a limit after which we don't try to combine two
`SCEVAddRecExpr`s into one. By default, max allowed size of the
resulting AddRecExpr is set to 16.
[Modules] Rework r274270. Let Clang targets depend on intrinsics_gen.
This gets rid of almost LLVM targets unconditionally depending on intrinsic_gen.
Clang's modules still have weird dependencies and hard to remove intrinsics_gen in better way.
Then, it'd be better to give whole clang targets depend on intrinsic_gen.
[X86] Add patterns for memory forms of SARX/SHLX/SHRX with careful complexity adjustment to keep shift by immediate using the legacy instructions.
These patterns were only missing to favor using the legacy instructions when the shift was a constant. With careful adjustment of the pattern complexity we can make sure the immediate instructions still have priority over these patterns.