Recommit [MachineCombiner] Update instruction depths incrementally for large BBs.
This version of the patch fixes an off-by-one error causing PR34596. We
do not need to use std::next(BlockIter) when calling updateDepths, as
BlockIter already points to the next element.
Original commit message:
> For large basic blocks with lots of combinable instructions, the
> MachineTraceMetrics computations in MachineCombiner can dominate the compile
> time, as computing the trace information is quadratic in the number of
> instructions in a BB and it's relevant successors/predecessors.
> In most cases, knowing the instruction depth should be enough to make
> combination decisions. As we already iterate over all instructions in a basic
> block, the instruction depth can be computed incrementally. This reduces the
> cost of machine-combine drastically in cases where lots of instructions
> are combined. The major drawback is that AFAIK, computing the critical path
> length cannot be done incrementally. Therefore we only compute
> instruction depths incrementally, for basic blocks with more
> instructions than inc_threshold. The -machine-combiner-inc-threshold
> option can be used to set the threshold and allows for easier
> experimenting and checking if using incremental updates for all basic
> blocks has any impact on the performance.
>
> Reviewers: sanjoy, Gerolf, MatzeB, efriedma, fhahn
>
> Reviewed By: fhahn
>
> Subscribers: kiranchandramohan, javed.absar, efriedma, llvm-commits
>
> Differential Revision: https://reviews.llvm.org/D36619
Mohammad Shahid [Wed, 20 Sep 2017 08:18:28 +0000 (08:18 +0000)]
[SLP] Vectorize jumbled memory loads.
Summary:
This patch tries to vectorize loads of consecutive memory accesses, accessed
in non-consecutive or jumbled way. An earlier attempt was made with patch D26905
which was reverted back due to some basic issue with representing the 'use mask' of
jumbled accesses.
This patch fixes the mask representation by recording the 'use mask' in the usertree entry.
[X86] Remove isel checks for immediate size on floating point compare and xop compare instructions. NFCI
If these checks fail we end up not selecting an instruction at all. So we are already relying on the immediate being checked upstream of isel. So doing the check in isel is just bloat to the isel table. Interestingly, we didn't check on the AVX512 version of the instructions anyway.
Sanjoy Das [Wed, 20 Sep 2017 02:31:57 +0000 (02:31 +0000)]
Tighten the invariants around LoopBase::invalidate
Summary:
With this change:
- Methods in LoopBase trip an assert if the receiver has been invalidated
- LoopBase::clear frees up the memory held the LoopBase instance
This change also shuffles things around as necessary to work with this stricter invariant.
Many svn-based buildbots seem to be getting stuck continually
in tree conflicts due to the output of pyc files. I'm disabling
these as a temporary measure in an attempt to get everything
stable again.
I'll try to remove this code once I understand the problem
better.
[MIRPrinter] Print empty successor lists when they cannot be guessed
This re-applies commit r313685, this time with the proper updates to
the test cases.
Original commit message:
Unreachable blocks in the machine instr representation are these
weird empty blocks with no successors.
The MIR printer used to not print empty lists of successors. However,
the MIR parser now treats non-printed list of successors as "please
guess it for me". As a result, the parser tries to guess the list of
successors and given the block is empty, just assumes it falls through
the next block (if any).
For instance, the following test case used to fail the verifier.
The MIR printer would print
entry
/ \
true (def) false (no list of successors)
|
split.true (use)
Sanjoy Das [Tue, 19 Sep 2017 23:19:00 +0000 (23:19 +0000)]
[LoopInfo] Make LoopBase and Loop destructors non-public
Summary:
See comment for why I think this is a good idea.
This change also:
- Removes an SCEV test case. The SCEV test was not testing anything useful (most of it was `#if 0` ed out) and it would need to be updated to deal with a private ~Loop::Loop.
- Updates the loop pass manager test case to deal with a private ~Loop::Loop.
- Renames markAsRemoved to markAsErased to contrast with removeLoop, via the usual remove vs. erase idiom we already have for instructions and basic blocks.
Adam Nemet [Tue, 19 Sep 2017 23:00:55 +0000 (23:00 +0000)]
Allow ORE.emit to take a closure to delay building the remark object
In the lambda we are now returning the remark by value so we need to preserve
its type in the insertion operator. This requires making the insertion
operator generic.
I've also converted a few cases to use the new API. It seems to work pretty
well. See the LoopUnroller for a slightly more interesting case.
Summary: Introduces the llvm-cfi-verify tool to llvm. Includes the design document (docs/CFIVerify.rst). Current implementation of the tool is simply a disassembler that identifies and prints the indirect control flow instructions.
[MIRPrinter] Print empty successor lists when they cannot be guessed
Unreachable blocks in the machine instr representation are these
weird empty blocks with no successors.
The MIR printer used to not print empty lists of successors. However,
the MIR parser now treats non-printed list of successors as "please
guess it for me". As a result, the parser tries to guess the list of
successors and given the block is empty, just assumes it falls through
the next block (if any).
For instance, the following test case used to fail the verifier.
The MIR printer would print
entry
/ \
true (def) false (no list of successors)
|
split.true (use)
The MIR parser would understand this:
entry
/ \
true (def) false
| / <-- invalid edge
split.true (use)
Because of the invalid edge, we get the "def does not
dominate all uses" error.
The fix consists in printing empty successor lists, so that the parser
knows what to do for unreachable blocks.
Reland "[llvm-objcopy] Add support for nested and overlapping segments"
I didn't initialize a pointer to be nullptr that I needed to.
This change adds support for nested and even overlapping segments. This means
that PT_PHDR, PT_GNU_RELRO, PT_TLS, and PT_DYNAMIC can be supported properly.
Jonathan Roelofs [Tue, 19 Sep 2017 21:23:19 +0000 (21:23 +0000)]
[ARM] Relax 'cpsie'/'cpsid' flag parsing.
The ARM docs suggest in examples that the flags can have either case, and there
are applications in the wild that (libopencm3, for example) that expect to be
able to use the uppercase spelling.
Revert "[DebugInfo] Insert DW_OP_deref when spilling indirect DBG_VALUEs"
This reverts r313640, originally r313400, one more time for essentially
the same issue. My BitVector of spilled location numbers isn't working
because we coalesce identical DBG_VALUE locations as we rewrite them,
invalidating the location numbers used to index the BitVector.
Import all inlined indirect call targets for SamplePGO.
Summary: In the ThinLTO compilation, if a function is inlined in the profiling binary, we need to inline it before annotation. If the callee is not available in the primary module, a first step is needed to import that callee function. For the current implementation, if the call is an indirect call, which has been promoted to >1 targets and inlined, SamplePGO will only import one target with the largest sample count. This patch fixed the bug to import all targets instead.
[TableGen] Generate formatted DAGISelEmitter without relying on formatted_raw_ostream.
The generated DAG isel file currently makes use of formatted_raw_ostream primarily for generating a hierarchical representation while also skipping over the initial comment that contains the current index.
It was reported in D37957 that this formatting might be slow due to the need to keep track of column numbers by monitoring all the written data for new lines.
This patch attempts to rewrite the emitter to make use of simpler formatting mechanisms to generate a fairly similar output. The main difference is that the number in the index comment is now right justified and padded with spaces inside the comment. Previously we appended the spaces after the comment.
This reverts commit SVN r313654. Seems that it is triggering an
assertion on Windows specifically. Revert until I can build on Windows
and look into what is happening there.
[llvm-objcopy] Add support for .dynamic, .dynsym, and .dynstr
This change adds support for sections involved in dynamic loading such
as SHT_DYNAMIC, SHT_DYNSYM, and allocated string tables.
The two added binaries used for tests can be downloaded [[
https://drive.google.com/file/d/0B3gtIAmiMwZXOXE3T0RobFg4ZTg/view?usp=sharing
| here ]] and [[
https://drive.google.com/file/d/0B3gtIAmiMwZXTFJSQUJZMGxNSXc/view?usp=sharing
| here ]]
David Blaikie [Tue, 19 Sep 2017 19:20:08 +0000 (19:20 +0000)]
Fix test to not depend on another subdirectories Input directory
Inputs should be placed local to the test (or possibly in a common
parent? I think we do that in some places - but the only common parent
between these two directories is 'test' which seems a bit overly broad).
Recommit r313647 now that GCC seems to accept the offering
Add some member types to MachineValueTypeSet::const_iterator so that
iterator_traits can work with it.
Improve TableGen performance of -gen-dag-isel (motivated by X86 backend)
The introduction of parameterized register classes in r313271 caused the
matcher generation code in TableGen to run much slower, particularly so
in the unoptimized (debug) build. This patch recovers some of the lost
performance.
Summary of changes:
- Cache the set of legal types in TypeInfer::getLegalTypes. The contents
of this set do not change.
- Add LLVM_ATTRIBUTE_ALWAYS_INLINE to several small functions. Normally
this would not be necessary, but in the debug build TableGen is not
optimized, so this helps a little bit.
- Add an early exit from TypeSetByHwMode::operator== for the case when
one or both arguments are "simple", i.e. only have one mode. This
saves some time in GenerateVariants.
- Finally, replace the underlying storage type in TypeSetByHwMode::SetType
with MachineValueTypeSet based on std::array instead of std::set.
This significantly reduces the number of memory allocation calls.
I've done a number of experiments with the underlying type of InfoByHwMode.
The type is a map, and for targets that do not use the parameterization,
this map has only one entry. The best (unoptimized) performance, somewhat
surprisingly came from std::map, followed closely by std::unordered_map.
DenseMap was the slowest by a large margin.
Various hand-crafted solutions (emulating enough of the map interface
not to make sweeping changes to the users) did not yield any observable
improvements.
David Blaikie [Tue, 19 Sep 2017 18:36:11 +0000 (18:36 +0000)]
dwarfdump/symbolizer: Avoid loading unneeded CUs from a DWP
When symbolizing large binaries, parsing every CU in a DWP file is a
significant performance penalty. Instead, use the index to only load the
CUs that are needed.
Summary: Fix the bug when promoted call return type mismatches with the promoted function, we should not try to inline it. Otherwise it may lead to compiler crash.
[llvm-objcopy] Add support for nested and overlapping segments
This change adds support for nested and even overlapping segments. This means
that PT_PHDR, PT_GNU_RELRO, PT_TLS, and PT_DYNAMIC can be supported properly.
Add support for the R_AARCH64_ABS{16,32} relocations in the execution
engine. This is primarily used for DWARF debug information relocations
and needed by the LLVM JIT to support JITing for lldb.
Improve TableGen performance of -gen-dag-isel (motivated by X86 backend)
The introduction of parameterized register classes in r313271 caused the
matcher generation code in TableGen to run much slower, particularly so
in the unoptimized (debug) build. This patch recovers some of the lost
performance.
Summary of changes:
- Cache the set of legal types in TypeInfer::getLegalTypes. The contents
of this set do not change.
- Add LLVM_ATTRIBUTE_ALWAYS_INLINE to several small functions. Normally
this would not be necessary, but in the debug build TableGen is not
optimized, so this helps a little bit.
- Add an early exit from TypeSetByHwMode::operator== for the case when
one or both arguments are "simple", i.e. only have one mode. This
saves some time in GenerateVariants.
- Finally, replace the underlying storage type in TypeSetByHwMode::SetType
with MachineValueTypeSet based on std::array instead of std::set.
This significantly reduces the number of memory allocation calls.
I've done a number of experiments with the underlying type of InfoByHwMode.
The type is a map, and for targets that do not use the parameterization,
this map has only one entry. The best (unoptimized) performance, somewhat
surprisingly came from std::map, followed closely by std::unordered_map.
DenseMap was the slowest by a large margin.
Various hand-crafted solutions (emulating enough of the map interface
not to make sweeping changes to the users) did not yield any observable
improvements.
Resubmit "Fix llvm-lit script generation in libcxx."
After speaking with the libcxx owners, they agreed that this is
a bug in the bot that needs to be fixed by the bot owners, and
the CMake changes are correct.
Re-land r313400 "[DebugInfo] Insert DW_OP_deref when spilling indirect DBG_VALUEs"
I forgot to zero out the BitVector when reusing it between UserValues.
Later uses of the same location number for a different UserValue would
falsely indicate that they were spilled. Usually this would lead to
incorrect debug info, but in some cases they would indicate something
nonsensical like a memory location based on a vector register (Q8 on
ARM).
Tony Jiang [Tue, 19 Sep 2017 16:14:37 +0000 (16:14 +0000)]
[PowerPC Peephole] Constants into a join add, use ADDI over LI/ADD.
Two blocks prior to the join each perform an li and the the join block has an
add using the initialized register. Optimize each predecessor block to instead
use addi and delete the li's and add.
David Blaikie [Tue, 19 Sep 2017 15:13:55 +0000 (15:13 +0000)]
dwarfdump: Delay parsing abbreviations until they're needed
This speeds up dumping specific DIEs by not parsing abbreviations for
units that are not used.
(this is also handy to have in eventually to speed up llvm-symbolizer
for .dwp files, where parsing most of the DWP file can be avoided by
using the index)
Nikolai Bozhenov [Tue, 19 Sep 2017 11:54:29 +0000 (11:54 +0000)]
[Nios2] Subtarget, basic infrastructure for frame, instructions and registers
This is the second minimal patch keeping Nios2 target buildable.
I'm adding subtarget here and other stuff for frame lowering, instruction,
register information methods. I do not add any test cases, as still there
are missing parts like DAG selector and assembly printing. I plan to include
them into the next patch.
Patch by Andrei Grischenko <andrei.l.grischenko@intel.com>
This change:
- makes nodes ISD::ADDCARRY and ISD::SUBCARRY legal for i32
- lowering is done by first converting the boolean value into the carry flag
using (_, C) ← (ARMISD::ADDC R, -1) and converted back to an integer value
using (R, _) ← (ARMISD::ADDE 0, 0, C). An ARMISD::ADDE between the two
operations does the actual addition.
- for subtraction, given that ISD::SUBCARRY second result is actually a
borrow, we need to invert the value of the second operand and result before
and after using ARMISD::SUBE. We need to invert the carry result of
ARMISD::SUBE to preserve the semantics.
- given that the generic combiner may lower ISD::ADDCARRY and
ISD::SUBCARRYinto ISD::UADDO and ISD::USUBO we need to update their lowering
as well otherwise i64 operations now would require branches. This implies
updating the corresponding test for unsigned.
- add new combiner to remove the redundant conversions from/to carry flags
to/from boolean values (ARMISD::ADDC (ARMISD::ADDE 0, 0, C), -1) → C
- fixes PR34045
- fixes PR34564
Gadi Haber [Tue, 19 Sep 2017 06:19:27 +0000 (06:19 +0000)]
[X86][Skylake] Adding the scheduling information for the SkylakeClient target
This patch adds the instruction scheduling information for the SkylakeClient (SKL) architecture target by adding the file X86SchedSkylakeClient.td located under the X86 Target.
We used the scheduling information retrieved from the Skylake architects in order to create the file.
The scheduling information includes latency, number of micro-Ops and used ports by each SKL instruction.
The patch continues the scheduling replacement and insertion effort started with the SNB target in r307529 and r310792 and for HSW in r311879.
Please expect some performance fluctuations due to code alignment effects.
There is a bot that is checking out libcxx and lit with nothing
else and then running lit.py against the test tree. Since there's
no LLVM source tree, there's no LLVM CMake. CMake actually
reports this as a warning saying unsupported libcxx configuration,
but I guess someone is depending on it anyway.
[Coverage] Use gap regions to select better line exec counts
After clang started emitting deferred regions (r312818), llvm-cov has
had a hard time picking reasonable line execuction counts. There have
been one or two generic improvements in this area (e.g r310012), but
line counts can still report coverage for whitespace instead of code
(llvm.org/PR34612).
To fix the problem:
* Introduce a new region kind so that frontends can explicitly label
gap areas.
This is done by changing the encoding of the columnEnd field of
MappingRegion. This doesn't substantially increase binary size, and
makes it easy to maintain backwards-compatibility.
* Don't set the line count to a count from a gap area, unless the count
comes from a wrapped segment.
Since the path a user specifies to the llvm-lit script might be
different than the source tree they built from (since they could
be behind different symlinks), we need to use realpath to make
sure that path comparisons work as expected.
Even better would be to use a custom dictionary comparison with
actual file equivalence comparison semantics, but this is the
least friction to unbreak things for now.
Hans Wennborg [Mon, 18 Sep 2017 23:08:42 +0000 (23:08 +0000)]
Revert r313400 "[DebugInfo] Insert DW_OP_deref when spilling indirect DBG_VALUEs"
This caused asserts in Chromium. See http://crbug.com/766261
> Summary:
> This comes up in optimized debug info for C++ programs that pass and
> return objects indirectly by address. In these programs,
> llvm.dbg.declare survives optimization, which causes us to emit indirect
> DBG_VALUE instructions. The fast register allocator knows to insert
> DW_OP_deref when spilling indirect DBG_VALUE instructions, but the
> LiveDebugVariables did not until this change.
>
> This fixes part of PR34513. I need to look into why this doesn't work at
> -O0 and I'll send follow up patches to handle that.
>
> Reviewers: aprantl, dblaikie, probinson
>
> Subscribers: qcolombet, hiraditya, llvm-commits
>
> Differential Revision: https://reviews.llvm.org/D37911