[ValueTracking] Prevent a call to computeKnownBits if we already know the state of the bit we would calculate. Also reuse a temporary APInt instead of creating a new one.
Anna Thomas [Thu, 13 Apr 2017 18:59:25 +0000 (18:59 +0000)]
[LV] Fix the vector code generation for first order recurrence
Summary:
In first order recurrences where phi's are used outside the loop,
we should generate an additional vector.extract of the second last element from
the vectorized phi update.
This is because we require the phi itself (which is the value at the second last
iteration of the vector loop) and not the phi's update within the loop.
Also fix the code gen when we just unroll, but don't vectorize.
Fixes PR32396.
[InstCombine] fold X == 0 || X == -1 to one compare (PR32524)
This is effectively a retry of:
https://reviews.llvm.org/rL299851
but now we have tests and an assert to make sure the bug
that was exposed with that attempt will not happen again.
I'll fix the code duplication and missing sibling fold next,
but I want to make this change as small as possible to reduce
risk since I messed it up last time.
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=32524
[ArgPromotion] Don't drop !prof metadata on promoted calls
Noticed by inspection while doing attribute work. DAE, InstCombineCalls,
and ArgPromotion have a fair amount of duplicated code for hacking on
call sites, and you can find bugs by comparing them.
[InstCombine] use similar ops for related folds; NFCI
It's less efficient to produce 'ule' than 'ult' since we know we're going to
canonicalize to 'ult', but we shouldn't have duplicated code for these folds.
As a trade-off, this was a pretty terrible way to make a '2'. :)
if (LHSC == SubOne(RHSC))
AddC = ConstantExpr::getSub(AddOne(RHSC), LHSC);
The next steps are to share the code to fix PR32524 and add the missing 'and'
fold that was left out when PR14708 was fixed:
https://bugs.llvm.org/show_bug.cgi?id=14708
Brian Gesiak [Thu, 13 Apr 2017 16:44:25 +0000 (16:44 +0000)]
[Analysis] Support bitreverse in -demanded-bits pass
Summary:
* Add a bitreverse case in the demanded bits analysis pass.
* Add tests for the bitreverse (and bswap) intrinsic in the
demanded bits pass.
* Add a test case to the BDCE tests: that manipulations to
high-order bits are eliminated once the bits are reversed
and then right-shifted.
LTO: Pass SF_Executable flag through to InputFile::Symbol
Summary:
The linker needs to be able to determine whether a symbol is text or data to
handle the case of a common being overridden by a strong definition in an
archive. If the archive contains a text member of the same name as the common,
that function is discarded. However, if the archive contains a data member of
the same name, that strong definition overrides the common. This is a behavior
of ld.bfd, which the Qualcomm linker also supports in LTO.
Here's a test case to illustrate:
####
cat > 1.c << \!
int blah;
!
cat > 2.c << \!
int blah() {
return 0;
}
!
cat > 3.c << \!
int blah = 20;
!
clang -c 1.c
clang -c 2.c
clang -c 3.c
ar cr lib.a 2.o 3.o
ld 1.o lib.a -t
####
The correct output is:
1.o
(lib.a)3.o
Thanks to Shankar Easwaran and Hemant Kulkarni for the test case!
If we had these tests, the bug caused by https://reviews.llvm.org/rL299851 would have been caught sooner.
There's also an assert in the code that should have caught that bug, but the assert line itself has a bug.
Use methods to access data stored with frame instructions
Instructions CALLSEQ_START..CALLSEQ_END and their target dependent
counterparts keep data like frame size, stack adjustment etc. These
data are accessed by getOperand using hard coded indices. It is
error prone way. This change implements the access by special methods,
which improve readability and allow changing data representation without
massive changes of index values.
Ayman Musa [Thu, 13 Apr 2017 10:03:45 +0000 (10:03 +0000)]
[X86] Added missing mayLoad/mayStore attributes to some X86 instructions.
Throughout the effort of automatically generating the X86 memory folding tables these missing information were encountered.
This is a preparation work for a future patch including the automation of these tables.
[LV] Refactor ILV to provide vectorizeInstruction(); NFC
Refactoring InnerLoopVectorizer's vectorizeBlockInLoop() to provide
vectorizeInstruction(). Aligning DeadInstructions with its only user.
Facilitates driving the transformation by VPlan - follows
https://reviews.llvm.org/D28975 and its tentative breakdown.
[APInt] Reorder fields to avoid a hole in the middle of the class
Summary:
APInt is currently implemented with an unsigned BitWidth field first and then a uint_64/pointer union. Due to the 64-bit size of the union there is a hole after the bitwidth.
Putting the union first allows the class to be packed. Making it 12 bytes instead of 16 bytes. An APSInt goes from 20 bytes to 16 bytes.
This shows a 4k reduction on the size of the opt binary on my local x86-64 build. So this enables some other improvement to the code as well.
[APInt] Generalize the implementation of tcIncrement to support adding a full 'word' by introducing tcAddPart. Use this to support tcIncrement, operator++ and operator+=(uint64_t). Do the same for subtract. NFCI.
At the very least, we have CallInst::setIsNoInline() for adding the
noinline attribute to callsites, and I'm told alwaysinline seems to
work.
Thought of adding "not all attributes are guaranteed to work here". If
someone thinks that would be better (or has a better way of phrasing
that, etc.), happy to add it.
Lang Hames [Thu, 13 Apr 2017 03:51:35 +0000 (03:51 +0000)]
[ORC] Add RPC and serialization support for Errors and Expecteds.
This patch allows Error and Expected types to be passed to and returned from
RPC functions.
Serializers and deserializers for custom error types (types deriving from the
ErrorInfo class template) can be registered with the SerializationTraits for
a given channel type (see registerStringError in RPCSerialization.h for an
example), allowing a given custom type to be sent/received. Unregistered types
will be serialized/deserialized as StringErrors using the custom type's log
message as the error string.
This is a magic header file supported by the build system that provides a
single definition, LLVM_REVISION, containing an LLVM revision identifier,
if available. This functionality previously lived in the LTO library, but
I am moving it out to lib/Support because I want to also start using it in
lib/Object to create the IR symbol table.
This change also fixes a bug where LLVM_REVISION was never actually being
used in lib/LTO because the macro HAS_LLVM_REVISION was never defined (it
was misspelled as HAVE_SVN_VERSION_INC in lib/LTO/CMakeLists.txt, and was
only being defined in a non-existent file Version.cpp).
I also changed the code to use "git rev-parse --git-dir" to locate the .git
directory, instead of looking for it in the LLVM source root directory,
which makes this compatible with monorepos as well as git worktrees.
[IR] Take func, ret, and arg attrs separately in AttributeList::get
This seems like a much more natural API, based on Derek Schuff's
comments on r300015. It further hides the implementation detail of
AttributeList that function attributes come last and appear at index
~0U, which is easy for the user to screw up. git diff says it saves code
as well: 97 insertions(+), 137 deletions(-)
This also makes it easier to change the implementation, which I want to
do next.
[IR] Remove the APIntMoveTy typedef from ConstantRange. Use APInt type directly.
This typedef used to be conditional based on whether rvalue references were supported. Looks like it got left behind when we switched to always having rvalue references with c++11. I don't think it provides any value now.
Fix compiler error in Attributes.cpp
```
Compiling Attributes.cpp ...
../../../Attributes.cpp: In member function 'std::__1::pair<unsigned int, llvm::Optional<unsigned int> > llvm::AttributeSet::getAllocSizeArgs() const':
../../../Attributes.cpp:542:69: error: operands to ?: have different types 'std::__1::pair<unsigned int, llvm::Optional<unsigned int> >' and 'std::__1::pair<int, int>'
return SetNode ? SetNode->getAllocSizeArgs() : std::make_pair(0, 0);
^
../../../Attributes.cpp:543:1: error: control reaches end of non-void function [-Werror=return-type]
}
^
```
Richard Smith [Wed, 12 Apr 2017 23:19:51 +0000 (23:19 +0000)]
ArgList: cache index ranges containing arguments with each ID
Improve performance of argument list parsing with large numbers of IDs and
large numbers of arguments, by tracking a conservative range of indexes within
the argument list that might contain an argument with each ID. In the worst
case (when the first and last argument with a given ID are at the opposite ends
of the argument list), this still results in a linear-time walk of the list,
but it helps substantially in the common case where each ID occurs only once,
or a few times close together in the list.
This gives a ~10x speedup to clang's `test/Driver/response-file.c`, which
constructs a very large set of command line arguments and feeds them to the
clang driver.
[llvm-pdbdump] Minor prepatory refactor of Class Def Dumper.
In a followup patch I intend to introduce an additional dumping
mode which dumps a graphical representation of a class's layout.
In preparation for this, the text-based layout printer needs to
be split out from the graphical layout printer, and both need
to be able to use the same code for printing the intro and outro
of a class's definition (e.g. base class list, etc).
This patch does so, and in the process introduces a skeleton
definition for the graphical printer, while currently making
the graphical printer just print nothing.
[llvm-pdbdump] More advanced class definition dumping.
Previously the dumping of class definitions was very primitive,
and it made it hard to do more than the most trivial of output
formats when dumping. As such, we would only dump one line for
each field, and then dump non-layout items like nested types
and enums.
With this patch, we do a complete analysis of the object
hierarchy including aggregate types, bases, virtual bases,
vftable analysis, etc. The only immediately visible effects
of this are that a) we can now dump a line for the vfptr where
before we would treat that as padding, and b) we now don't
treat virtual bases that come at the end of a class as padding
since we have a more detailed analysis of the class's storage
usage.
In subsequent patches, we should be able to use this analysis
to display a complete graphical view of a class's layout including
recursing arbitrarily deep into an object's base class / aggregate
member hierarchy.
The test fails on Darwin because Fuzzer::DeathCallback (which calls
DumpCurrentUnit("crash-")) is called before DumpCurrentUnit("oom-") is
called in Fuzzer::RssLimitCallback. DeathCallback is transitively called
from __sanitizer_print_memory_profile.
This should fix the fuzzer bot that has been failing for a while:
[ValueTracking] Teach GetUnderlyingObject to stop when it reachs an alloca instruction.
Previously it tried to call SimplifyInstruction which doesn't know anything about alloca so defers to constant folding which also doesn't do anything with alloca. This results in wasted cycles making calls that won't do anything. Given the frequency with which this function is called this time adds up.
Piotr Padlewski [Wed, 12 Apr 2017 20:45:12 +0000 (20:45 +0000)]
Remove readnone from invariant.group.barrier
Summary:
Readnone attribute would cause CSE of two barriers with
the same argument, which is invalid by example:
struct Base {
virtual int foo() { return 42; }
};
struct Derived1 : Base {
int foo() override { return 50; }
};
struct Derived2 : Base {
int foo() override { return 100; }
};
void foo() {
Base *x = new Base{};
new (x) Derived1{};
int a = std::launder(x)->foo();
new (x) Derived2{};
int b = std::launder(x)->foo();
}
Here 2 calls of std::launder will produce @llvm.invariant.group.barrier,
which would be merged into one call, causing devirtualization
to devirtualize second call into Derived1::foo() instead of
Derived2::foo()
[Support] Add support for unique_ptr<> to Casting.h.
Often you have a unique_ptr<T> where T supports LLVM's
casting methods, and you wish to cast it to a unique_ptr<U>.
Prior to this patch, this requires doing hacky things like:
unique_ptr<U> Casted;
if (isa<U>(Orig.get()))
Casted.reset(cast<U>(Orig.release()));
This is overly verbose, and it would be nice to just be able
to use unique_ptr directly with cast and dyn_cast. To this end,
this patch updates cast<> to work directly with unique_ptr<T>,
so you can now write:
auto Casted = cast<U>(std::move(Orig));
Since it's possible for dyn_cast<> to fail, however, we choose
to use a slightly different API here, because it's awkward to
write
if (auto Casted = dyn_cast<U>(std::move(Orig))) {}
when Orig may end up not having been moved at all. So the
interface for dyn_cast is
if (auto Casted = unique_dyn_cast<U>(Orig)) {}
Where the inclusion of `unique` in the name of the cast operator
re-affirms that regardless of success of or fail of the casting,
exactly one of the input value and the return value will contain
a non-null result.
[InstCombine] Teach SimplifyMultipleUseDemandedBits to handle And/Or/Xor known bits using the LHS/RHS known bits it already acquired without recursing back into computeKnownBits.
This replicates the known bits and constant creation code from the single use case for these instructions and adds it here. The computeKnownBits and constant creation code for other instructions is now in the default case of the opcode switch.
[InstCombine] Remove unreachable code for turning an And where all demanded bits on both sides are known to be zero into a constant 0.
We already handled a superset check that included the known ones too and folded to a constant that may include ones. But it can also handle the case of no ones.
[InstCombine] fix wrong undef handling when converting select to shuffle
As discussed in:
https://bugs.llvm.org/show_bug.cgi?id=32486
...the canonicalization of vector select to shufflevector does not hold up
when undef elements are present in the condition vector.
Try to make the undef handling clear in the code and the LangRef.
CodeGen: BlockPlacement: Add comment about DenseMap Safety.
The use of a DenseMap in precomputeTriangleChains does not cause
non-determinism, even though it is iterated over, as the only thing the
iteration does is to insert entries into a new DenseMap, which is not iterated.
Comment only change.
[InstCombine] Teach SimplifyDemandedInstructionBits that even if we reach an instruction that has multiple uses, if we know all the bits for the demanded bits for this context we can go ahead and create a constant.
Currently if we reach an instruction with multiples uses we know we can't do any optimizations to that instruction itself since we only have the demanded bits for one of the users. But if we know all of the bits are zero/one for that one user we can still go ahead and create a constant to give to that user.
This might then reduce the instruction to having a single use and allow additional optimizations on the other path.
This picks up an additional case that r300075 didn't catch.
MachineScheduler: Skip acyclic latency heuristic for in-order cores
The current heuristic is triggered on `InFlightCount > BufferLimit`
which isn't really helpful on in-order cores where BufferLimit is zero.
Note that we already get latency hiding effects for in order cores
by instructions staying in the pending queue on stalls; The additional
latency scheduling heuristics only have minimal effects after that while
occasionally increasing register pressure too much resulting in extra
spills.
My motivation here is additional spills/reloads ending up in a loop in
464.h264ref / BlockMotionSearch function resulting in a 4% overal
regression on an in order core. rdar://30264380
Teach SimplifyDemandedUseBits that adding or subtractings 0s from every bit below the highest demanded bit can be simplified
If we are adding/subtractings 0s below the highest demanded bit we can just use the other operand and remove the operation.
My primary motivation is observing that we can call ShrinkDemandedConstant for the add/sub and create a 0 constant, rather than removing the add completely. In the case I saw, we modified the constant on an add instruction to a 0, but the add is not put into the worklist. So we didn't revisit it until the next InstCombine iteration. This caused an IR modification to remove add and a subsequent iteration to be ran.
With this change we get bypass the add in the first iteration and prevent the second iteration from changing anything.
[InstCombine] morph an existing instruction instead of creating a new one
One potential way to make InstCombine (very slightly?) faster is to recycle instructions
when possible instead of creating new ones. It's not explicitly stated AFAIK, but we don't
consider this an "InstSimplify". We could, however, make a new layer to house transforms
like this if that makes InstCombine more manageable (just throwing out an idea; not sure
how much opportunity is actually here).
Ed Maste [Wed, 12 Apr 2017 13:51:00 +0000 (13:51 +0000)]
Fix detection of backtrace() availability on FreeBSD
On FreeBSD backtrace is not part of libc and depends on libexecinfo
being available. Instead of using manual checks we can use the builtin
CMake module FindBacktrace.cmake to detect availability of backtrace()
in a portable way.
Patch By: Alex Richardson
Differential Revision: https://reviews.llvm.org/D27143
Jonas Paulsson [Wed, 12 Apr 2017 13:29:25 +0000 (13:29 +0000)]
[SLPVectorizer] Pass the right type argument to getCmpSelInstrCost()
In getEntryCost(), make the scalar type for a compare instruction that of the
operands, not i1. This is needed in order to call getCmpSelInstrCost() for a
compare in a sensible way, the same way as the LoopVectorizer does.
New test: test/Transforms/SLPVectorizer/SystemZ/SLP-cmp-cost-query.ll
Review: Matthew Simpson
https://reviews.llvm.org/D31601
Jonas Paulsson [Wed, 12 Apr 2017 13:13:15 +0000 (13:13 +0000)]
[LoopVectorizer] Improve handling of branches during cost estimation.
The cost for a branch after vectorization is very different depending on if
the vectorizer will if-convert the block (branch is eliminated), or if
scalarized and predicated blocks will be produced (branch duplicated before
each block). There is also the case of remaining scalar branches, such as the
back-edge branch.
This patch handles these cases differently with TTI based cost estimates.
Review: Matthew Simpson
https://reviews.llvm.org/D31175
Jonas Paulsson [Wed, 12 Apr 2017 12:41:37 +0000 (12:41 +0000)]
[LoopVectorizer, TTI] New method supportsEfficientVectorElementLoadStore()
Since SystemZ supports vector element load/store instructions, there is no
need for extracts/inserts if a vector load/store gets scalarized.
This patch lets Target specify that it supports such instructions by means of
a new TTI hook that defaults to false.
The use for this is in the LoopVectorizer getScalarizationOverhead() method,
which will with this patch produce a smaller sum for a vector load/store on
SystemZ.
New test: test/Transforms/LoopVectorize/SystemZ/load-store-scalarization-cost.ll
Review: Adam Nemet
https://reviews.llvm.org/D30680