(also not sure about target triple/object file type for this tool - do I
really need a whole triple just to write an object file that contains
purely static/hardcoded bytes in each section? & I guess I should just
pick it based on the first input, maybe, rather than hardcoding for now
- but we only produce .dwo on ELF platforms with objcopy for now anyway)
Extend debug info for function parameters in SDAG.
SDAG currently can emit debug location for function parameters when
an llvm.dbg.declare points to either a function argument SSA temp,
or to an AllocaInst. This change extends this logic by adding a
fallback case when neither of the above is true.
This is required for SafeStack, which may copy the contents of a
byval function argument into something that is not an alloca, and
then describe the target as the new location of the said argument.
Cong Hou [Tue, 1 Dec 2015 00:02:51 +0000 (00:02 +0000)]
Replace all weight-based interfaces in MBB with probability-based interfaces, and update all uses of old interfaces.
The patch in http://reviews.llvm.org/D13745 is broken into four parts:
1. New interfaces without functional changes (http://reviews.llvm.org/D13908).
2. Use new interfaces in SelectionDAG, while in other passes treat probabilities
as weights (http://reviews.llvm.org/D14361).
3. Use new interfaces in all other passes.
4. Remove old interfaces.
This patch is 3+4 above. In this patch, MBB won't provide weight-based
interfaces any more, which are totally replaced by probability-based ones.
The interface addSuccessor() is redesigned so that the default probability is
unknown. We allow unknown probabilities but don't allow using it together
with known probabilities in successor list. That is to say, we either have a
list of successors with all known probabilities, or all unknown
probabilities. In the latter case, we assume each successor has 1/N
probability where N is the number of successors. An assertion checks if the
user is attempting to add a successor with the disallowed mixed use as stated
above. This can help us catch many misuses.
All uses of weight-based interfaces are now updated to use probability-based
ones.
Simon Pilgrim [Mon, 30 Nov 2015 22:22:06 +0000 (22:22 +0000)]
[X86][FMA4] Prefer FMA4 to FMA
We currently output FMA instructions on targets which support both FMA4 + FMA (i.e. later Bulldozer CPUS bdver2/bdver3/bdver4).
This patch flips this so FMA4 is preferred; this is for several reasons:
1 - FMA4 is non-destructive reducing the need for mov instructions.
2 - Its more straighforward to commute and fold inputs (although the recent work on FMA has reduced this difference).
3 - All supported targets have FMA4 performance equal or better to FMA - Piledriver (bdver2) in particular has half the throughput when executing FMA instructions.
Its looks like no future AMD processor lines will support FMA4 after the Bulldozer series so we're not causing problems for later CPUs.
Rafael Espindola [Mon, 30 Nov 2015 22:01:43 +0000 (22:01 +0000)]
Start deciding earlier what to link.
A traditional linker is roughly split in symbol resolution and "copying
stuff".
The two tasks are badly mixed in lib/Linker.
This starts splitting them apart.
With this patch there are no direct call to linkGlobalValueBody or
linkGlobalValueProto. Everything is linked via WapValue.
This also includes a few fixes:
* A GV goes undefined if the comdat is dropped (comdat11.ll).
* We error if an internal GV goes undefined (comdat13.ll).
* We don't link an unused comdat.
The first two match the behavior of an ELF linker. The second one is
equivalent to running globaldce on the input.
Matt Arsenault [Mon, 30 Nov 2015 21:16:03 +0000 (21:16 +0000)]
AMDGPU: Rework how private buffer passed for HSA
If we know we have stack objects, we reserve the registers
that the private buffer resource and wave offset are passed
and use them directly.
If not, reserve the last 5 SGPRs just in case we need to spill.
After register allocation, try to pick the next available registers
instead of the last SGPRs, and then insert copies from the inputs
to the reserved registers in the progloue.
This also only selectively enables all of the input registers
which are really required instead of always enabling them.
Matt Arsenault [Mon, 30 Nov 2015 21:15:53 +0000 (21:15 +0000)]
AMDGPU: Remove SIPrepareScratchRegs
It does not work because of emergency stack slots.
This pass was supposed to eliminate dummy registers for the
spill instructions, but the register scavenger can introduce
more during PrologEpilogInserter, so some would end up
left behind if they were needed.
The potential for spilling the scratch resource descriptor
and offset register makes doing something like this
overly complicated. Reserve registers to use for the resource
descriptor and use them directly in eliminateFrameIndex.
Also removes creating another scratch resource descriptor
when directly selecting scratch MUBUF instructions.
The choice of which registers are reserved is temporary.
For now it attempts to pick the next available registers
after the user and system SGPRs.
David Majnemer [Mon, 30 Nov 2015 19:04:19 +0000 (19:04 +0000)]
[X86] Add RIP to GR64_TCW64
The MachineVerifier wants to check that the register operands of an
instruction belong to the instruction's register class. RIP-relative
control flow instructions violated this by referencing RIP. While this
was fixed for SysV, it was never fixed for Win64.
Zoran Jovanovic [Mon, 30 Nov 2015 12:56:18 +0000 (12:56 +0000)]
[mips][microMIPS] Fix issue with offset operand of BALC and BC instructions
Value of offset operand for microMIPS BALC and BC instructions is currently shifted 2 bits, but it should be 1 bit.
Differential Revision: http://reviews.llvm.org/D14770
Daniel Sanders [Mon, 30 Nov 2015 09:52:00 +0000 (09:52 +0000)]
[mips][ias] Removed MSA instructions from base architecture valid-xfail.s's.
valid-xfail.s is for instructions that should be valid in the given ISA but
incorrectly fail. MSA instructions are correct to fail since MSA is not enabled.
Craig Topper [Mon, 30 Nov 2015 00:13:24 +0000 (00:13 +0000)]
[AVX512] The vpermi2 instructions require an integer vector for the index vector. This is reflected correctly in the intrinsics, but was not refelected in the isel patterns.
For the floating point types, this requires adding a bitcast to the index vector when its passed through to the output.
This one is enabled only under -ffast-math. There are cases where the
difference between the value computed and the correct value is huge
even for ffast-math, e.g. as Steven pointed out:
x = -1, y = -4
log(pow(-1), 4) = 0
4*log(-1) = NaN
I checked what GCC does and apparently they do the same optimization
(which result in the dramatic difference). Future work might try to
make this (slightly) less worse.
Diego Novillo [Sun, 29 Nov 2015 18:23:26 +0000 (18:23 +0000)]
SamplePGO - Do not use std::to_string in diagnostics.
This fixes buildbots in systems that std::to_string is not present. It
also tidies the output of the diagnostic to render doubles a bit better
(thanks Ben Kramer for help with string streams and format).
Simon Pilgrim [Sun, 29 Nov 2015 16:41:04 +0000 (16:41 +0000)]
[X86][SSE] Added support for lowering to ADDSUBPS/ADDSUBPD with commuted inputs
We could already recognise shuffle(FSUB, FADD) -> ADDSUB, this allow us to recognise shuffle(FADD, FSUB) -> ADDSUB by commuting the shuffle mask prior to matching.
[PGO] Move value profile format related structures and APIs to common file
This is the last step to enable profile runtime to share the same value prof
data format and reader/writer code with llvm host tools. The VP related
data structures are moved to a section in InstrProfData.inc enabled with macro
INSTR_PROF_VALUE_PROF_DATA, and common API implementations are enabled with
INSTR_PROF_COMMON_API_IMPL. There should be no functional change.
Jonas Paulsson [Sat, 28 Nov 2015 11:02:32 +0000 (11:02 +0000)]
[Stack realignment] Handling of aligned allocas.
This patch implements dynamic realignment of stack objects for targets
with a non-realigned stack pointer. Behaviour in FunctionLoweringInfo
is changed so that for a target that has StackRealignable set to
false, over-aligned static allocas are considered to be variable-sized
objects and are handled with DYNAMIC_STACKALLOC nodes.
It would be good to group aligned allocas into a single big alloca as
an optimization, but this is yet todo.
SystemZ benefits from this, due to its stack frame layout.
New tests SystemZ/alloca-03.ll for aligned allocas, and
SystemZ/alloca-04.ll for "no-realign-stack" attribute on functions.
Review and help from Ulrich Weigand and Hal Finkel.
[PGO] Allow value profile writer interface to allocated target buffer
Raw profile writer needs to write all data of one kind in one continuous block,
so the buffer needs to be pre-allocated and passed to the writer method in
pieces for function profile data. The change adds the support for raw value data
writing.
Diego Novillo [Fri, 27 Nov 2015 23:14:51 +0000 (23:14 +0000)]
SamplePGO - Add initial support for inliner annotations.
This adds two thresholds to the sample profiler to affect inlining
decisions: the concept of global hotness and coldness.
Functions that have accumulated more than a certain fraction of samples at
runtime, are annotated with the InlineHint attribute. Conversely,
functions that accumulate less than a certain fraction of samples, are
annotated with the Cold attribute.
This is very similar to the hints emitted by Clang when using
instrumentation profiles.
Notice that this is a very blunt instrument. A function may have
globally collected a significant fraction of samples, but that does not
necessarily mean that every callsite for that function is hot.
Ideally, we would annotate each callsite with the samples collected at
that callsite. This way, the inliner can incorporate all these weights
into its cost model.
Once the inliner offers this functionality, we can change the hints
emitted here to a more precise per-callsite annotation. For now, this is
providing some measure of speedups with our internal benchmarks. I've
observed speedups of up to 23% (though the geo mean is about 3%). I expect
these numbers to improve as the inliner gets better annotations.
Diego Novillo [Fri, 27 Nov 2015 23:14:49 +0000 (23:14 +0000)]
SamplePGO - Fix default threshold for hot callsites.
Based on testing of internal benchmarks, I'm lowering this threshold to
a value of 0.1%. This means that SamplePGO will respect 99.9% of the
original inline decisions when following a profile.
The performance difference is noticeable in some tests. With the
previous threshold, the speedups over baseline -O2 was about 0.63%. With
the new default, the speedups are around 3% on average.
The point of this threshold is not to do more aggressive inlining. When
an inlined callsite crosses this threshold, SamplePGO will redo the
inline decision so that it can better apply the input profile.
By respecting most original inline decisions, we can apply more of the
input profile because the shape of the code follows the profile more
closely.
In the next series, I'll be looking at adding some inline hints for the
cold callsites and for toplevel functions that are hot/cold as well.
Rafael Espindola [Fri, 27 Nov 2015 20:28:19 +0000 (20:28 +0000)]
Simplify the linking of recursive data.
Now the ValueMapper has two callbacks. The first one maps the
declaration. The ValueMapper records the mapping and then materializes
the body/initializer.