Gadi Haber [Thu, 28 Dec 2017 15:00:41 +0000 (15:00 +0000)]
[X86][PREFETCH]: Adding full coverage of MC encoding for the PREFETCH isa sets.<NFC>
NFC.
Adding MC regressions tests to cover the PREFETCH isa sets for both 32 and 64 bit.
This patch is part of a larger task to cover MC encoding of all X86 ISA Sets started in revision: https://reviews.llvm.org/D39952
[dsymutil][NFC] Replace calls to CoreFoundation with LLVM equivalent.
This patch replaces a block of logic that was implemented using
CoreFoundations calls with functionally equivalent logic that makes use
of LLVM libraries.
Max Kazantsev [Thu, 28 Dec 2017 12:03:12 +0000 (12:03 +0000)]
[RewriteStatepoints] Fix incorrect assertion
`RewriteStatepointsForGC` iterates over function blocks and their predecessors
in order of declaration. One of outcomes of this is that callsites are placed in
arbitrary order which has nothing to do with travelsar order.
On the other hand, function `recomputeLiveInValues` asserts that bases are
added to `Info.PointerToBase` before their deried pointers are updated. But
if call sites are processed in order different from RPOT, this is not necessarily
true. We cannot guarantee that the base was placed there before every
pointer derived from it. All we can guarantee is that this base was marked as
known base by this point.
This patch replaces the fact that we assert from checking that the base was
added to the map with assert that the base was marked as known base.
Simon Pilgrim [Thu, 28 Dec 2017 10:05:49 +0000 (10:05 +0000)]
[X86][SSE] Use PMADDWD for v4i32 multiplies with 17 or more leading zeros
If there are 17 or more leading zeros to the v4i32 elements, then we can use PMADD for the integer multiply when PMULLD is unavailable or slow.
The 17 bits need to be zero as the PMADDWD performs a v8i16 signed-mul-extend + pairwise-add - the upper 16 so we're adding a zero pair and the 17th bit so we don't incorrectly sign extend.
Reid Kleckner [Thu, 28 Dec 2017 05:10:33 +0000 (05:10 +0000)]
Revert "[memcpyopt] Teach memcpyopt to optimize across basic blocks"
This reverts r321138. It seems there are still underlying issues with
memdep. PR35519 seems to still be present if debug info is enabled. We
end up losing a memcpy. Somehow during store to memset merging, we
insert the memset after the memcpy or fail to update the memdep analysis
to account for the newly inserted memset of a pair.
int __attribute__((optnone)) main() {
// Put some data in the vector and then remove it so we take the push_back
// fast path.
std::vector<std::pair<std::string, std::vector<std::string>>> crl_set;
crl_set.push_back({"asdf", {}});
crl_set.pop_back();
printf("first word in vector storage: %p\n", *(void**)crl_set.data());
// Do the push_back which may fail to initialize the data.
do_push_back(&crl_set);
auto* first = &crl_set.back().first;
printf("first word in vector storage (should be zero): %p\n",
*(void**)crl_set.data());
assert(first->empty());
puts("ok");
}
Compile with libc++, enable optimizations, and enable debug info:
$ clang++ -stdlib=libc++ -g -O2 t.cpp -o t.exe -Wl,-rpath=llvm/build/lib
Craig Topper [Wed, 27 Dec 2017 19:09:40 +0000 (19:09 +0000)]
[X86] Reimplement r321437 using custom lowering instead of as a DAG combine.
My original implementation ran as a DAG combine post type legalization, but it turns out we don't run that DAG combine step if type legalization didn't change anything. Attempts to make the combine run before type legalization as well hit other issues.
So just do it in LowerMUL where we can catch more cases.
Matthew Simpson [Wed, 27 Dec 2017 15:25:01 +0000 (15:25 +0000)]
[AArch64] Change order of candidate FMLS patterns
r319980 added new patterns to the machine combiner for transforming (fsub (fmul
x y) z) into (fmla (fneg z) x y). That is, fsub's where the first source
operand is an fmul are transformed. We previously only matched the case where
the second source operand of an fsub was an fmul, transforming (fsub z (fmul x
y)) into (fmls z x y). Now, if we have an fsub where both source operands are
fmuls, both of the above patterns are applicable.
However, the order in which we add the patterns to the list of candidates
determines the transformation that takes place, since only the first pattern
that matches will be used. This patch changes the order these two patterns are
added to the list of candidates such that we prefer the case where the second
source operand is an fmul (the fmls case), rather than the other one (the
fmla/fneg case). When both source operands are fmuls, this ordering results in
fewer instructions.
Mikael Holmen [Wed, 27 Dec 2017 08:48:33 +0000 (08:48 +0000)]
[Lint] Don't warn about noalias argument aliasing if other argument is byval
Summary:
When using byval, the data is effectively copied as part of the call
anyway, so we aren't actually passing the pointer and thus there is no
reason to issue a warning.
Gadi Haber [Wed, 27 Dec 2017 08:35:57 +0000 (08:35 +0000)]
[X86][RD]: Adding full coverage of MC encoding for RD isa sets.<NFC>
NFC.
Adding MC regressions tests to cover RDPMC, RDRAND, RDRAND, RDSEED, RDTSCP, DWRFSGS isa sets.
This patch is part of a larger task to cover MC encoding of all X86 isa sets started in revision: https://reviews.llvm.org/D39952
Serguei Katkov [Wed, 27 Dec 2017 08:26:22 +0000 (08:26 +0000)]
[SCEV] Be careful with nuw/nsw/exact in InsertBinop
InsertBinop tries to find an appropriate instruction instead of
creating a new instruction. When it checks whether instruction is
the same as we need to create it ignores nuw/nsw/exact flags.
It leads to invalid behavior when poison instruction can be used
when it was not expected. Specifically, for example Expander
expands the SCEV built for instruction
%a = add i32 %v, 1
It is possible that InsertBinop can find an instruction
% b = add nuw nsw i32 %v, 1
and will use it instead of version w/o nuw nsw.
It is incorrect.
The patch conservatively ignores all instructions with any of
poison flags installed.
Serguei Katkov [Wed, 27 Dec 2017 07:15:23 +0000 (07:15 +0000)]
[SCEV] Do not insert if it is already in cache
This is fix for the crash caused by ScalarEvolution::getTruncateExpr.
It expects that if it checked the condition that SCEV is not in UniqueSCEVs cache in
the beginning that it will not be there inside this method.
However during recursion and transformation/simplification for sub expression,
it is possible that these modifications will end up with the same SCEV as we started from.
So we must always check whether SCEV is in cache and do not insert item if it is already there.
Sanjay Patel [Tue, 26 Dec 2017 15:09:19 +0000 (15:09 +0000)]
[ValueTracking] ignore FP signed-zero when detecting a casted-to-integer fmin/fmax pattern
This is a preliminary step for the patch discussed in D41136 (and denoted here with the FIXME comment).
When we match an FP min/max that is cast to integer, any intermediate difference between +0.0 or -0.0
should be muted in the result by the conversion (either fptosi or fptoui) of the result. Thus, we can
enable 'nsz' for the purpose of matching fmin/fmax.
Note that there's probably room to generalize this more, possibly by fixing the current calls to the
weak version of isKnownNonZero() in matchSelectPattern() to the more powerful recursive version.
George Rimar [Mon, 25 Dec 2017 09:41:00 +0000 (09:41 +0000)]
[MC] - Disallow invalid section groups declarations.
This fixes parseGroup() so that it always sets error condition on error.
Previously it was not done, because parseIdentifier looks never do that,
assuming that caller should do it if he wants to.
So previously cases from test were silently accepted and produced broken output.
Max Kazantsev [Mon, 25 Dec 2017 09:35:10 +0000 (09:35 +0000)]
[SafepointIRVerifier] Allow non-dereferencing uses of unrelocated or poisoned PHI nodes
PHI that has at least one unrelocated input cannot cause any issues by itself,
though its uses should be carefully verified. With this patch PHIs are allowed
to have any inputs but when all inputs are unrelocated the PHI is marked as
unrelocated and if not all inputs are unrelocated then the PHI is marked as
poisoned. Poisoned pointers can be used only in three ways: to derive new
pointers, in PHIs or in comparisons against constants that are exclusively
derived from null.
Craig Topper [Mon, 25 Dec 2017 06:47:10 +0000 (06:47 +0000)]
[X86] Add a DAG combines to turn vXi64 muls into VPMULDQ/VPMULUDQ if the upper bits are all sign bits or zeros.
Normally we catch this during lowering, but vXi64 mul is considered legal when we have AVX512DQ.
This DAG combine allows us to avoid PMULLQ with AVX512DQ if we can prove its unnecessary. PMULLQ is 3 uops that take 4 cycles each. While pmuldq/pmuludq is only one 4 cycle uop.
Don Hinton [Mon, 25 Dec 2017 01:23:09 +0000 (01:23 +0000)]
[cmake] Always respect existing CMAKE_REQUIRED_FLAGS when adding additional ones.
Summary:
Always respect existing CMAKE_REQUIRED_FLAGS when adding
additional ones. This is important when cross compiling where
--sysroot and -target were already added.
In particular, this is needed when cross compiling from Darwin to
Linux, since --sysroot is required to find headers and libraries.
Cmake has a similar bug in check_include_file[_cxx] where
CMAKE_REQUIRED_LIBRARIES isn't passed, which causes
try_compile to fail.
(please see https://gitlab.kitware.com/cmake/cmake/merge_requests/1620)
Simon Pilgrim [Sun, 24 Dec 2017 12:20:21 +0000 (12:20 +0000)]
[X86][X87] Mark pseudo memory fold instructions as load/sideeffects (PR21160, PR34080, PR34454).
Match regular x87 memory fold instructions with load/sideeffects tags, to prevent the schedulers from re-ordering them across the fnstcw/fldcw sequences for truncating stores while they are still pseudo during the stack conversion pass.
Craig Topper [Sun, 24 Dec 2017 06:51:36 +0000 (06:51 +0000)]
[X86] Fix (v2f64 (s/uint_to_fp (v2i1))) to avoid scalarization without AVX512DQ.
Previously we extended v2i1 to v2f64 and then tried to use cvtuqq2pd/cvtqq2pd, but that only works with avx512dq. So we ended up scalarizing it. Now we widen to v4i1 first and extend to v4i32.
Craig Topper [Sun, 24 Dec 2017 02:05:18 +0000 (02:05 +0000)]
[DAGCombiners] Don't turn ANDs to shuffles with zero so early. Give some other combines a chance to run.
This moves the combine for turning ANDs into shuffle with zero out of SimplifyVBinOps and places it only in visitAND below the reassociate handling. This fixes the specific case I noticed where we failed to combine two ands with constants.
Davide Italiano [Sat, 23 Dec 2017 15:06:30 +0000 (15:06 +0000)]
[SCCP] Manually fold branches on undef.
This code was originally removed and replace with an assertion
because believed unnecessary. It turns out there was simply
no test coverage for this case, and the constant folder doesn't
yet know about patterns like `br undef %label1, %label2`.
Presumably at some point the constant folder might learn about
these patterns, but it's a broader change.
A testcase will be added to make sure this doesn't regress again
in the future.
Craig Topper [Sat, 23 Dec 2017 02:54:50 +0000 (02:54 +0000)]
[SelectionDAG][X86] Don't use ->getValueType(0) after a call to getOperand to get the type of the operand.
getOperand returns an SDValue that contains the node and the result number. There is no guarantee that the result number if 0. By using the -> operator we are calling SDNode::getValueType rather than SDValue::getValueType. This requires supplying a result number and we shouldn't assume it was 0.
I don't have a test case. Just noticed while cleaning up some other code and saw that it occurred in other places.
Walter Lee [Fri, 22 Dec 2017 21:19:13 +0000 (21:19 +0000)]
[git-llvm] Handle files ignored by svn correctly
Summary: Correctly handle files ignored by svn (such as .o files,
which are ignored by default) by adding "--no-ignore" flag to "svn
status" and "svn add".
Alina Sbirlea [Fri, 22 Dec 2017 19:54:03 +0000 (19:54 +0000)]
[MemorySSA] Allow reordering of loads that alias in the presence of volatile loads.
Summary:
Make MemorySSA allow reordering of two loads that may alias, when one is volatile.
This makes MemorySSA less conservative and behaving the same as the AliasSetTracker.
For more context, see D16875.
LLVM language reference: "The optimizers must not change the number of volatile operations or change their order of execution relative to other volatile operations. The optimizers may change the order of volatile operations relative to non-volatile operations. This is not Java’s “volatile” and has no cross-thread synchronization behavior."
Guozhi Wei [Fri, 22 Dec 2017 18:54:04 +0000 (18:54 +0000)]
[SimplifyCFG] Don't do if-conversion if there is a long dependence chain
If after if-conversion, most of the instructions in this new BB construct a long and slow dependence chain, it may be slower than cmp/branch, even if the branch has a high miss rate, because the control dependence is transformed into data dependence, and control dependence can be speculated, and thus, the second part can execute in parallel with the first part on modern OOO processor.
This patch checks for the long dependence chain, and give up if-conversion if find one.
In https://reviews.llvm.org/rL321077 and https://reviews.llvm.org/D41231 I fixed a regression in the c-api which prevented the pruning from being *effectively* disabled.
However this approach, helpfully recommended by @labath, is cleaner.
It is also nice to remove the weasel words about effectively disabling from the api comments.
Sanjoy Das [Fri, 22 Dec 2017 18:21:59 +0000 (18:21 +0000)]
(Re-landing) Expose a TargetMachine::getTargetTransformInfo function
Re-land r321234. It had to be reverted because it broke the shared
library build. The shared library build broke because there was a
missing LLVMBuild dependency from lib/Passes (which calls
TargetMachine::getTargetIRAnalysis) to lib/Target. As far as I can
tell, this problem was always there but was somehow masked
before (perhaps because TargetMachine::getTargetIRAnalysis was a
virtual function).
Original commit message:
This makes the TargetMachine interface a bit simpler. We still need
the std::function in TargetIRAnalysis to avoid having to add a
dependency from Analysis to Target.
See discussion:
http://lists.llvm.org/pipermail/llvm-dev/2017-December/119749.html
I avoided adding all of the backend owners to this review since the
change is simple, but let me know if you feel differently about this.
Craig Topper [Fri, 22 Dec 2017 17:18:13 +0000 (17:18 +0000)]
[SelectionDAG] Reverse the order of operands in the ISD::ADD created by TargetLowering::getVectorElementPointer so that the FrameIndex is on the left.
This seems to improve X86's ability to match this into an address computation. Otherwise the other operand gets assigned to the base register and the stack pointer + frame index ends up in the index register. But index registers can't encode ESP/RSP so we end up having to move it into another register to meet the constraint.
I could try to improve the address matcher in X86, but swapping the producer seemed easier. Several other places already have the operands in this order so this is at least consistent.
Craig Topper [Fri, 22 Dec 2017 17:18:11 +0000 (17:18 +0000)]
[X86] When lowering insert_vector_elt/extract_vector_elt of vXi1 with a non-constant index just use either a 128-bit type or the vXi8 type with the correct number of elements.
Despite what the comment said there isn't better codegen for 512-bit vectors. The 128/256/512 bit implementation jus stores to memory and loads an element. There's no advantage to doing that with a larger size. In fact in many cases it causes a stack realignment and generates worse code.
Haicheng Wu [Fri, 22 Dec 2017 17:09:09 +0000 (17:09 +0000)]
[InlineCost] Find more free binary operations
Currently, inline cost model considers a binary operator as free only if both
its operands are constants. Some simple cases are missing such as a + 0, a - a,
etc. This patch modifies visitBinaryOperator() to call SimplifyBinOp() without
going through simplifyInstruction() to get rid of the constant restriction.
Thus, visitAnd() and visitOr() are not needed.
Diana Picus [Fri, 22 Dec 2017 11:09:18 +0000 (11:09 +0000)]
[ARM GlobalISel] Support pointer constants
Pointer constants are pretty rare, since we usually represent them as
integer constants and then cast to pointer. One notable exception is the
null pointer constant, which is represented directly as a G_CONSTANT 0
with pointer type. Mark it as legal and make sure it is selected like
any other integer constant.
Chandler Carruth [Fri, 22 Dec 2017 06:41:23 +0000 (06:41 +0000)]
Rewrite the cached map used for locating the most precise DIE among
inlined subroutines for a given address.
This is essentially the hot path of llvm-symbolizer when extracting
inlined frames during symbolization. Previously, we would read every
subprogram and every inlined subroutine, building a std::map across the
entire PC space to the best DIE, and then do only a handful of queries
as we symbolized a backtrace. A huge fraction of the time was spent
building the map itself.
This patch changes it two a two-level system. First, we just build a map
from PC-interval to DWARF subprograms. These are required to be disjoint
and so constructing this is pretty easy. Second, we build a map *just*
for the inlined subroutines within the subprogram containing the query
address. This allows us to look at far fewer DIEs and build a *much*
smaller set of cached maps in the llvm-symbolizer case where only a few
address get symbolized during the entire run.
It also builds both interval maps in a very different way. It constructs
a single flat vector of pairs that maps from offset -> index. The
indices point into collections of DIE objects, but can also be
"tombstones" (-1) to mark gaps. In the case of subprograms, this mostly
just simplifies the data structure a bit. For inlined subroutines,
because we carefully split them as we build the map, we end up in many
cases having no holes and not having to store both start and stop
offsets.
Finally, the PC ranges for the inlined subroutines are compressed into
32-bits by making them relative to the base PC of the outer subprogram.
This means that if you have a single function body with over 2gb of
executable code in it, we will stop mapping address past the first 2gb
of that function into inlined subroutines and just give you the
subprogram. This doesn't seem like a problem. ;]
All of this combines to make llvm-symbolizer *well* over 2x faster for
symbolizing backtraces out of LLVM's unittests. Death-test heavy unit
tests are running >2x faster. I'm still going to look at completely
disabling symbolization there, but figured while I had a good benchmark
we should make symbolization a bit better.
Sadly, the logic to build the flat interval map for the inlined
subroutines is fairly complex. I'm not super happy about this and
welcome any simplifying suggestions.
Huge thanks to Dave Blaikie who helped walk me through what the various
things I needed to do in DWARF to make this work.