This one is enabled only under -ffast-math. There are cases where the
difference between the value computed and the correct value is huge
even for ffast-math, e.g. as Steven pointed out:
x = -1, y = -4
log(pow(-1), 4) = 0
4*log(-1) = NaN
I checked what GCC does and apparently they do the same optimization
(which result in the dramatic difference). Future work might try to
make this (slightly) less worse.
Diego Novillo [Sun, 29 Nov 2015 18:23:26 +0000 (18:23 +0000)]
SamplePGO - Do not use std::to_string in diagnostics.
This fixes buildbots in systems that std::to_string is not present. It
also tidies the output of the diagnostic to render doubles a bit better
(thanks Ben Kramer for help with string streams and format).
Simon Pilgrim [Sun, 29 Nov 2015 16:41:04 +0000 (16:41 +0000)]
[X86][SSE] Added support for lowering to ADDSUBPS/ADDSUBPD with commuted inputs
We could already recognise shuffle(FSUB, FADD) -> ADDSUB, this allow us to recognise shuffle(FADD, FSUB) -> ADDSUB by commuting the shuffle mask prior to matching.
[PGO] Move value profile format related structures and APIs to common file
This is the last step to enable profile runtime to share the same value prof
data format and reader/writer code with llvm host tools. The VP related
data structures are moved to a section in InstrProfData.inc enabled with macro
INSTR_PROF_VALUE_PROF_DATA, and common API implementations are enabled with
INSTR_PROF_COMMON_API_IMPL. There should be no functional change.
Jonas Paulsson [Sat, 28 Nov 2015 11:02:32 +0000 (11:02 +0000)]
[Stack realignment] Handling of aligned allocas.
This patch implements dynamic realignment of stack objects for targets
with a non-realigned stack pointer. Behaviour in FunctionLoweringInfo
is changed so that for a target that has StackRealignable set to
false, over-aligned static allocas are considered to be variable-sized
objects and are handled with DYNAMIC_STACKALLOC nodes.
It would be good to group aligned allocas into a single big alloca as
an optimization, but this is yet todo.
SystemZ benefits from this, due to its stack frame layout.
New tests SystemZ/alloca-03.ll for aligned allocas, and
SystemZ/alloca-04.ll for "no-realign-stack" attribute on functions.
Review and help from Ulrich Weigand and Hal Finkel.
[PGO] Allow value profile writer interface to allocated target buffer
Raw profile writer needs to write all data of one kind in one continuous block,
so the buffer needs to be pre-allocated and passed to the writer method in
pieces for function profile data. The change adds the support for raw value data
writing.
Diego Novillo [Fri, 27 Nov 2015 23:14:51 +0000 (23:14 +0000)]
SamplePGO - Add initial support for inliner annotations.
This adds two thresholds to the sample profiler to affect inlining
decisions: the concept of global hotness and coldness.
Functions that have accumulated more than a certain fraction of samples at
runtime, are annotated with the InlineHint attribute. Conversely,
functions that accumulate less than a certain fraction of samples, are
annotated with the Cold attribute.
This is very similar to the hints emitted by Clang when using
instrumentation profiles.
Notice that this is a very blunt instrument. A function may have
globally collected a significant fraction of samples, but that does not
necessarily mean that every callsite for that function is hot.
Ideally, we would annotate each callsite with the samples collected at
that callsite. This way, the inliner can incorporate all these weights
into its cost model.
Once the inliner offers this functionality, we can change the hints
emitted here to a more precise per-callsite annotation. For now, this is
providing some measure of speedups with our internal benchmarks. I've
observed speedups of up to 23% (though the geo mean is about 3%). I expect
these numbers to improve as the inliner gets better annotations.
Diego Novillo [Fri, 27 Nov 2015 23:14:49 +0000 (23:14 +0000)]
SamplePGO - Fix default threshold for hot callsites.
Based on testing of internal benchmarks, I'm lowering this threshold to
a value of 0.1%. This means that SamplePGO will respect 99.9% of the
original inline decisions when following a profile.
The performance difference is noticeable in some tests. With the
previous threshold, the speedups over baseline -O2 was about 0.63%. With
the new default, the speedups are around 3% on average.
The point of this threshold is not to do more aggressive inlining. When
an inlined callsite crosses this threshold, SamplePGO will redo the
inline decision so that it can better apply the input profile.
By respecting most original inline decisions, we can apply more of the
input profile because the shape of the code follows the profile more
closely.
In the next series, I'll be looking at adding some inline hints for the
cold callsites and for toplevel functions that are hot/cold as well.
Rafael Espindola [Fri, 27 Nov 2015 20:28:19 +0000 (20:28 +0000)]
Simplify the linking of recursive data.
Now the ValueMapper has two callbacks. The first one maps the
declaration. The ValueMapper records the mapping and then materializes
the body/initializer.
Artyom Skrobov [Fri, 27 Nov 2015 15:30:51 +0000 (15:30 +0000)]
[ARM] Generate ABI_optimization_goals build attribute, as described in the ARM ARM.
Summary:
Since this build attribute corresponds to a whole module, and
different functions in a module may differ in the optimizations
enabled for them, this attribute is emitted after all functions,
and only in the case that the optimization goals for all
functions match.
Oliver Stannard [Fri, 27 Nov 2015 13:04:48 +0000 (13:04 +0000)]
[AArch64] Add ARMv8.2-A FP16 scalar instructions
ARMv8.2-A adds 16-bit floating point versions of all existing VFP
floating-point instructions. This is an optional extension, so all of
these instructions require the FeatureFullFP16 subtarget feature.
Most of these instructions are the same as the 32- and 64-bit versions,
but with the type field (bits 23-22) set to 0b11. Previously the top bit
of the size field was always 0, so the instruction classes only provided
a 1-bit size field, which I have widened to 2 bits.
This patch changes the DFSan instrumentation for aarch64 to instead
of using fixes application mask defined by SANITIZER_AARCH64_VMA
to read the application shadow mask value from compiler-rt. The value
is initialized based on runtime VAM detection.
Along with this patch a compiler-rt one will also be added to export
the shadow mask variable.
Craig Topper [Fri, 27 Nov 2015 05:44:04 +0000 (05:44 +0000)]
[TableGen] Sort pattern predicates before concatenating into a string so that different orders of the same set will produce the same string. This can reduce the number of unique predicates in the isel tables. NFC
Andrew Wilkins [Fri, 27 Nov 2015 04:44:51 +0000 (04:44 +0000)]
Use $GO_EXECUTABLE in Go-based lit tests
Summary:
When running tests, pass the GO_EXECUTABLE CMake
cache variable to llvm-go. The "go" binary may
not be in $PATH, or may be different to the one
passed to CMake.
MC: Simplify handling of temporary symbols in COFF writer.
The COFF object writer was previously adding unnecessary symbols to its
temporary data structures and cleaning them up later. This made the code
harder to understand and caused a bug (aliases classed as temporary symbols
would cause an assertion failure). A much simpler way of handling such
symbols is to ask the layout for their section-relative position when needed.
Tested with a bootstrap on Windows and by building Chrome.
Charlie Turner [Thu, 26 Nov 2015 20:39:51 +0000 (20:39 +0000)]
[LoopVectorize] Use MapVector rather than DenseMap for MinBWs.
The order in which instructions are truncated in truncateToMinimalBitwidths
effects code generation. Switch to a map with a determinisic order, since the
iteration order over a DenseMap is not defined.
This code is not hot, so the difference in container performance isn't
interesting.
Many thanks to David Blaikie for making me aware of MapVector!
Rafael Espindola [Thu, 26 Nov 2015 19:53:12 +0000 (19:53 +0000)]
Add a few passing lto tests.
I found these while trying to get a prototype to bootstrap.
They cover things like
* Handling of non linker visible stuff (append, available_externally)
* Type merging
* Alias to dropped globals
* Dropping linkage when converting to a declaration.
These should hopefully be generally useful for anyone refactoring the
plugin.
Hal Finkel [Thu, 26 Nov 2015 19:23:49 +0000 (19:23 +0000)]
[bugpoint] Fix "Alias must point to a definition" problems
GlobalAliases may reference function definitions, but not function declarations.
bugpoint would sometimes create invalid IR by deleting a function's body (thus
mutating a function definition into a declaration) without first 'fixing' any
GlobalAliases that reference that function definition.
This change iteratively prevents that issue. Before deleting a function's body,
it scans the module for GlobalAliases which reference that function. When
found, it eliminates them using replaceAllUsesWith.
Rafael Espindola [Thu, 26 Nov 2015 19:22:59 +0000 (19:22 +0000)]
Disallow aliases to available_externally.
They are as much trouble as aliases to declarations. They are requiring
the code generator to define a symbol with the same value as another
symbol, but the second symbol is undefined.
If representing this is important for some optimization, we could add
support for available_externally aliases. They would be *required* to
point to a declaration (or available_externally definition).
Daniel Sanders [Thu, 26 Nov 2015 16:35:41 +0000 (16:35 +0000)]
[mips][ias] Range check uimm5 operands and fix several bugs this revealed.
Summary:
The bugs were:
* append, prepend, and balign were not tested
* balign takes a uimm2 not a uimm5.
* drotr32 was correctly implemented with a uimm5 but the tests expected
'52' to be valid.
* li/la were implemented with a uimm5 instead of simm32. simm32 isn't
completely correct either but I'll fix that when I get to simm32.
A notable omission are some of the shift instructions. Several of these
have been implemented using a single uimm6 instruction (rather than two
uimm5 instructions and a CodeGen-only uimm6 pseudo). These will be updated
in the uimm6 patch.
Oliver Stannard [Thu, 26 Nov 2015 15:34:44 +0000 (15:34 +0000)]
[AArch64] Add ARMv8.2-A new AT instruction variants
ARMv8.2-A adds new variants of the "at" (address translate) system
instruction, which take the PSTATE.PAN bit (added in ARMv8.1-A). These
are a required part of ARMv8.2-A, so no additional subtarget features
are required.
Oliver Stannard [Thu, 26 Nov 2015 15:32:30 +0000 (15:32 +0000)]
[AArch64] Add ARMv8.2-A UAO PSTATE bit
ARMv8.2-A adds a new PSTATE bit, PSTATE.UAO, which allows the LDTR/STTR
instructions to behave the same as LDR/STR with respect to execute-only
pages at higher privilege levels. New variants of the MSR/MRS
instructions are added to allow reading and writing this bit. It is a
required part of ARMv8.2-A, so no additional subtarget features are
required.
ARMv8.2-A adds the "dc cvap" instruction, which is a system instruction
that cleans caches to the point of persistence (for systems that have
persistent memory). It is a required part of ARMv8.2-A, so no additional
subtarget features are required.
Oliver Stannard [Thu, 26 Nov 2015 15:26:10 +0000 (15:26 +0000)]
[AArch64] Add ARMv8.2-A ID_A64MMFR2_EL1 register
ARMv8.2-A adds a new ID register, ID_A64MMFR2_EL1, which behaves in the
same way as ID_A64MMFR0_EL1 and ID_A64MMFR1_EL1. It is a required part
of ARMv8.2-A, so no additional subtarget features are required.
Oliver Stannard [Thu, 26 Nov 2015 15:23:32 +0000 (15:23 +0000)]
[AArch64] Add subtarget features for ARMv8.2-A
This adds subtarget features for ARMv8.2-A, which builds on (and
requires the features from) ARMv8.1-A. Most assembler-visible features
of ARMv8.2-A are system instructions, and are all required parts of the
architecture, so just depend on the HasV8_2aOps subtarget feature. There
is also one large, optional feature, which adds 16-bit floating point
versions of all existing floating-point instructions (VFP and SIMD),
this is represented by the FeatureFullFP16 subtarget feature.
Daniel Sanders [Thu, 26 Nov 2015 11:23:03 +0000 (11:23 +0000)]
[mips][ias] Explicitly disable IAS on tests that depend on not assembling.
Summary:
no-odd-spreg-msa.ll: This test deliberately uses an odd-numbered register
in inline assembly and expects the compiler to insert a move to an
even-numbered register.
inlineasm-operand-code.ll and inlineasm_constraint.ll:
Checks for IAS's output will be added once a matcher bug is resolved. This bug
causes the canonical output emitted by IAS to be incorrect for uimm16 constants
with the MSB set. We will still need the non-IAS checks at this point since
these tests primarily test formatting of operands.
X86-FMA3: Improved/enabled the memory folding optimization for scalar loads
generated for _mm_losd_s{s,d}() intrinsics and used in scalar FMAs generated
for FMA intrinsics _mm_f{madd,msub,nmadd,nmsub}_s{s,d}().
Reviewer: David Kreitzer
Differential Revision: http://reviews.llvm.org/D14762
Teach LLVM optimize to more precisely in the presence of "deopt" operand
bundles. "deopt" operand bundles imply that the call they're attached
to is at least `readonly` (i.e. they don't imply clobber semantics), and
they don't capture their bundle operands.
[PGO] Implement ValueProfiling Closure interfaces for runtime value profile data
This is one of the many steps to commonize value profiling support between profile
runtime and compiler/llvm tools.
After this change, profiler runtime now can share the same C APIs to do VP
serialization/deseriazation with LLVM host tools (and produces value data
in identical format between indexed and raw profile).
It is not yet enabled in profiler runtime yet.
Also added a unit test case to test runtime profile data serialization/deserialization
interfaces implemented using common closure code.
Richard Diamond [Wed, 25 Nov 2015 22:49:48 +0000 (22:49 +0000)]
Fix a use-after-free in `llvm-config`.
Summary:
This could happen if `GetComponentNames` is true, because `Name` from
`VisitComponent` would reference a stack instance of `std::string` in
`ComputeLibsForComponents`.