LowerTypeTests: Implement exporting of type identifiers.
Type identifiers are exported by:
- Adding coarse-grained information about how to test the type
identifier to the summary.
- Creating symbols in the object file (aliases and absolute symbols)
containing fine-grained information about the type identifier.
Dehao Chen [Thu, 19 Jan 2017 00:44:11 +0000 (00:44 +0000)]
Add -debug-info-for-profiling to emit more debug info for sample pgo profile collection
Summary:
SamplePGO binaries built with -gmlt to collect profile. The current -gmlt debug info is limited, and we need some additional info:
* start line of all subprograms
* linkage name of all subprograms
* standalone subprograms (functions that has neither inlined nor been inlined)
This patch adds these information to the -gmlt binary. The impact on speccpu2006 binary size (size increase comparing with -g0 binary, also includes data for -g binary, which does not change with this patch):
Eli Friedman [Wed, 18 Jan 2017 23:56:42 +0000 (23:56 +0000)]
[SCEV] Make getUDivExactExpr handle non-nuw multiplies correctly.
To avoid regressions, make ScalarEvolution::createSCEV a bit more
clever.
Also get rid of some useless code in ScalarEvolution::howFarToZero
which was hiding this bug.
No new testcase because it's impossible to actually expose this bug:
we don't have any in-tree users of getUDivExactExpr besides the two
functions I just mentioned, and they both dodged the problem. I'll
try to add some interesting users in a followup.
Eli Friedman [Wed, 18 Jan 2017 23:26:37 +0000 (23:26 +0000)]
Preserve domtree and loop-simplify for runtime unrolling.
Mostly straightforward changes; we just didn't do the computation before.
One sort of interesting change in LoopUnroll.cpp: we weren't handling
dominance for children of the loop latch correctly, but
foldBlockIntoPredecessor hid the problem for complete unrolling.
Currently punting on loop peeling; made some minor changes to isolate
that problem to LoopUnrollPeel.cpp.
Adds a flag -unroll-verify-domtree; it verifies the domtree immediately
after we finish updating it. This is on by default for +Asserts builds.
Mehdi Amini [Wed, 18 Jan 2017 21:37:11 +0000 (21:37 +0000)]
Improve the `-filter-print-funcs` option to skip the banner for CGSCC pass when nothing is to be printed
Before, it would print a sequence of:
*** IR Dump After Function Integration/Inlining ******
*** IR Dump After Function Integration/Inlining ******
*** IR Dump After Function Integration/Inlining ******
...
[LV] Allow reductions that have several uses outside the loop
We currently check whether a reduction has a single outside user. We don't
really need to require that - we just need to make sure a single value is
used externally. The number of external users of that value shouldn't actually
matter.
Justin Bogner [Wed, 18 Jan 2017 19:01:58 +0000 (19:01 +0000)]
cmake: Only sanitize use-after-scope if the host compiler supports it
In r292256, we started adding -fsanitize-use-after-scope when using
the address sanitizer, but that flag wasn't always available. This
fixes the config to only add the flag if the host compiler supports
it.
Evandro Menezes [Wed, 18 Jan 2017 18:57:08 +0000 (18:57 +0000)]
[AArch64] Generate literals by the little end
ARM seems to prefer that long literals be formed from their little end in
order to promote the fusion of the instrs pairs MOV/MOVK and MOVK/MOVK on
Cortex A57 and others (v. "Cortex A57 Software Optimisation Guide", section
4.14).
Mehdi Amini [Wed, 18 Jan 2017 18:36:21 +0000 (18:36 +0000)]
[ThinLTO] Add a recursive step in Metadata lazy-loading
Summary:
Without this, we're stressing the RAUW of unique nodes,
which is a costly operation. This is intended to limit
the number of RAUW, and is very effective on the total
link-time of opt with ThinLTO, before:
Graydon Hoare [Wed, 18 Jan 2017 18:12:20 +0000 (18:12 +0000)]
[lit] Support sharding testsuites, for parallel execution.
Summary:
This change equips lit.py with two new options, --num-shards=M and
--run-shard=N (set by default from env vars LIT_NUM_SHARDS and LIT_RUN_SHARD).
The options must be used together, and N must be in 1..M.
Together these options effect only test selection: they partition the testsuite
into M equal-sized "shards", then select only the Nth shard. They can be used
in a cluster of test machines to achieve a very crude (static) form of
parallelism, with minimal configuration work.
[AMDGPU] Do not allow register coalescer to create big superregs
Limit register coalescer by not allowing it to artificially increase
size of registers beyond dword. Such super-registers are in fact
register sequences and not distinct HW registers.
With more super-regs we would need to allocate adjacent registers
and constraint regalloc more than needed. Moreover, our super
registers are overlapping. For instance we have VGPR0_VGPR1_VGPR2,
VGPR1_VGPR2_VGPR3, VGPR2_VGPR3_VGPR4 etc, which complicates registers
allocation even more, resulting in excessive spilling.
Teresa Johnson [Wed, 18 Jan 2017 16:58:43 +0000 (16:58 +0000)]
Don't create a comdat group for a dropped def with initializer
Non-prevailing weak/linkonce odr symbols will be dropped by ThinLTO to
available_externally when possible. If they had an initializer in the
global_ctors list, a comdat group was being created. This code
already had logic to skip available_externally defs, but now the
EliminateAvailableExternally pass will drop these symbols to
declarations earlier. Change the check to skip all declarations for
linker (which includes available_externally along with declarations).
Sam Parker [Wed, 18 Jan 2017 15:52:11 +0000 (15:52 +0000)]
[ARM] Create SubtargetFeatures from build attrs
An ELFObjectFile can now create SubtargetFeatures from the available
ARM build attributes, in a similar manner to MIPS. I've moved the
MIPS code into its own function and the ARM handler also has a
separate function.
Florian Hahn [Wed, 18 Jan 2017 15:01:22 +0000 (15:01 +0000)]
[thumb,framelowering] Reset NoVRegs in Thumb1FrameLowering::emitPrologue.
Summary:
In this function, virtual registers can be introduced (for example
through calls to emitThumbRegPlusImmInReg). doScavengeFrameVirtualRegs
will replace those virtual registers with concrete registers later on
in PrologEpilogInserter, which sets NoVRegs again.
This patch fixes the Codegen/Thumb/segmented-stacks.ll test case which
failed with expensive checks.
https://llvm.org/bugs/show_bug.cgi?id=27484
Daniel Sanders [Wed, 18 Jan 2017 14:17:50 +0000 (14:17 +0000)]
Re-commit: [globalisel] Tablegen-erate current Register Bank Information
Summary:
Adds a RegisterBank tablegen class that can be used to declare the register
banks and an associated tablegen pass to generate the necessary code.
Changes since last commit:
The new tablegen pass is now correctly guarded by LLVM_BUILD_GLOBAL_ISEL and
this should fix the buildbots however it may not be the whole fix. The previous
buildbot failures suggest there may be a memory bug lurking that I'm unable to
reproduce (including when using asan) or spot in the source. If they re-occur
on this commit then I'll need assistance from the bot owners to track it down.
Sam Parker [Wed, 18 Jan 2017 13:52:12 +0000 (13:52 +0000)]
[ARM] Create objdump subtarget from build attrs
Enable an ELFObjectFile to read the its arm build attributes to
produce a target triple with a specific ARM architecture.
llvm-objdump now uses this functionality to automatically produce
a more accurate target.
As discussed on D28777 - we don't need to handle 'all element' shuffles inside InstCombiner::visitCallInst as InstCombiner::SimplifyDemandedVectorElts will do everything we need.
[X86] Improve mul combine for negative multiplayer (2^c - 1)
This patch improves the mul instruction combine function (combineMul)
by adding new layer of logic.
In this patch, we are adding the ability to fold (mul x, -((1 << c) -1))
or (mul x, -((1 << c) +1)) into (neg(X << c) -x) or (neg((x << c) + x) respective.
Marina Yatsina [Wed, 18 Jan 2017 08:07:51 +0000 (08:07 +0000)]
[X86] Fix for bugzilla 31576 - add support for "data32" instruction prefix
This patch fixes bugzilla 31576 (https://llvm.org/bugs/show_bug.cgi?id=31576).
"data32" instruction prefix was not defined in the llvm.
An exception had to be added to the X86 tablegen and AsmPrinter because both "data16" and "data32" are encoded to 0x66 (but in different modes).
Eric Fiselier [Wed, 18 Jan 2017 00:12:41 +0000 (00:12 +0000)]
[LIT] Make util.executeCommand python3 friendly
Summary: The parameter `input` to `subprocess.Popen.communicate(...)` must be an object of type `bytes` . This is strictly enforced in python3. This patch (1) allows `to_bytes` to be safely called redundantly. (2) Explicitly convert `input` within `executeCommand`. This allows for usages like `executeCommand(['clang++', '-'], input='int main() {}\n')`.
Justin Lebar [Wed, 18 Jan 2017 00:09:01 +0000 (00:09 +0000)]
[NVPTX] Implement min/max in tablegen, rather than with custom DAGComine logic.
Summary:
This change also lets us use max.{s,u}16. There's a vague warning in a
test about this maybe being less efficient, but I could not come up with
a case where the resulting SASS (sm_35 or sm_60) was different with or
without max.{s,u}16. It's true that nvcc seems to emit only
max.{s,u}32, but even ptxas 7.0 seems to have no problem generating
efficient SASS from max.{s,u}16 (the casts up to i32 and back down to
i16 seem to be implicit and nops, happening via register aliasing).
In the absence of evidence, better to have fewer special cases, emit
more straightforward code, etc. In particular, if a new GPU has 16-bit
min/max instructions, we want to be able to use them.
Justin Lebar [Wed, 18 Jan 2017 00:08:27 +0000 (00:08 +0000)]
[NVPTX] Improve lowering of llvm.ctpop.
Summary:
Avoid an unnecessary conversion operation when using the result of
ctpop.i32 or ctpop.i16 as an i32, as in both cases the ptx instruction
we run returns an i32.
(Previously if we used the value as an i32, we'd do an unnecessary
zext+trunc.)
Justin Lebar [Wed, 18 Jan 2017 00:07:35 +0000 (00:07 +0000)]
[NVPTX] Improve lowering of llvm.ctlz.
Summary:
* Disable "ctlz speculation", which inserts a branch on every ctlz(x) which
has defined behavior on x == 0 to check whether x is, in fact zero.
* Add DAG patterns that avoid re-truncating or re-expanding the result
of the 16- and 64-bit ctz instructions.
The patch is to solve the performance problem described in PR27827.
Register coalescing sometimes cannot remove a copy because of interference.
But if we can find a reverse copy in one of the predecessor block of the copy,
the copy is partially redundent and we may remove the copy partially by moving
it to the predecessor block without the reverse copy.
Tim Northover [Tue, 17 Jan 2017 22:30:10 +0000 (22:30 +0000)]
GlobalISel: correctly handle varargs
Some platforms (notably iOS) use a different calling convention for unnamed vs
named parameters in varargs functions, so we need to keep track of this
information when translating calls.
Since not many platforms are involved, the guts of the special handling is in
the ValueHandler class (with a generic implementation that should work for most
targets).
Chandler Carruth [Tue, 17 Jan 2017 22:28:52 +0000 (22:28 +0000)]
[LoopDeletion] (cleanup, NFC) Use the dedicated helper to get a single
unique exit block if available rather than rolling it ourselves.
This is a little disappointing because that helper doesn't do anything
clever to short-circuit the (surprisingly expensive) computation of all
exit blocks. What's worse is that the way we compute this is hopelessly,
hilariously inefficient. We're literally computing the same information
two different ways and multiple times each way:
- hasDedicatedExits computes the exit block set and then looks at the
predecessors of each
- getExitingBlocks computes the set of loop blocks which have exiting
successors
- getUniqueExitBlock(s) computes the set of non-loop blocks reached from
loop blocks (sound familiar?)
Anyways, at some point we should clean all of this up in the LoopInfo
API, but for now just simplifying the user I'm about to touch.
Tim Northover [Tue, 17 Jan 2017 22:13:50 +0000 (22:13 +0000)]
[GlobalISel] track predecessor mapping during switch lowering.
Correctly populating Machine PHIs relies on knowing exactly how the IR level
CFG was lowered to MachineIR. This needs to be tracked by any translation
phases that meddle (currently only SwitchInst handling).
This reapplies r291973 which was reverted because of testing failures. Fixes:
+ Don't return an ArrayRef to a local temporary.
+ Incorporate Kristof's suggested comment improvements.
Chandler Carruth [Tue, 17 Jan 2017 22:00:52 +0000 (22:00 +0000)]
[LoopDeletion] (cleanup, NFC) Stop passing around reference to a vector
that we know has exactly one element when all we are going to do is get
that one element out of it.
Instead, pass around that one element.
There are more simplifications to come in this code...
Chandler Carruth [Tue, 17 Jan 2017 21:51:39 +0000 (21:51 +0000)]
[PM] Clean up variable and parameter names to match modern LLVM naming
conventions more conistently before hacking on this code to integrate
nicely with new PM's loop pass infrastructure. NFC.
Aaron Ballman [Tue, 17 Jan 2017 21:48:31 +0000 (21:48 +0000)]
Silence some Sphinx diagnostics in an attempt to get the documentation builder back to green (http://lab.llvm.org:8011/builders/llvm-sphinx-docs/builds/1895).