[PowerPC] On PPC32, 128-bit shifts might be runtime calls
The counter-loops formation pass needs to know what operations might be
function calls (because they can't appear in counter-based loops). On PPC32,
128-bit shifts might be runtime calls (even though you can't use __int128 on
PPC32, it seems that SROA might form them).
autoconf: Fix soname for libLLVM-Major.Minor.so (2nd try)
We were using libLLVM-Major.Minor.Patch.so for the soname, but we
need the soname to stay consistent for all Major.Minor.* releases
otherwise operating system distributors will need to rebuild all
packages that link with LLVM every time there is a new point release.
This patch also reverses the compatibility symlink, so
libLLVM-Major.Minor.Patch.so is now a symlink that points
to libLLVM-Major-Minor.so.
Fix diassembler handling of rex.b when mod=00/01/10 and bbb=101. Mod=00 should ignore the base register entirely. Mod=01/10 should treat this as R13 plus displacment. Fixes PR18860.
BasicAA: Use reachabilty instead of dominance for checking value equality in phi
cycles
This allows the value equality check to work even if we don't have a dominator
tree. Also add some more comments.
I was worried about compile time impacts and did not implement reachability but
used the dominance check in the initial patch. The trade-off was that the
dominator tree was required.
The llvm utility function isPotentiallyReachable cuts off the recursive search
after 32 visits. Testing did not show any compile time regressions showing my
worries unjustfied.
No compile time or performance regressions at O3 -flto -mavx on test-suite +
externals.
When there are cycles in the value graph we have to be careful interpreting
"Value*" identity as "value" equivalence. We interpret the value of a phi node
as the value of its operands.
When we check for value equivalence now we make sure that the "Value*" dominates
all cycles (phis).
[PowerPC] Add a full condition code register to make the "cc" clobber work
gcc inline asm supports specifying "cc" as a clobber of all condition
registers. Add just enough modeling of the full register to make this work.
Fixed PR19326.
Fix PR19144: Incorrect offset generated for int-to-fp conversion at -O0.
When converting a signed 32-bit integer to double-precision floating point on
hardware without a lfiwax instruction, we have to instead use a lfd followed
by fcfid. We were erroneously offsetting the address by 4 bytes in
preparation for either a lfiwax or lfiwzx when generating the lfd. This fixes
that silly error.
This was not caught in the test suite since the conversion tests were run with
-mcpu=pwr7, which implies availability of lfiwax. I've added another test
case for older hardware that checks the code we expect in the absence of
lfiwax and other flavors of fcfid. There are fewer tests in this test case
because we punt to DAG selection in more cases on older hardware. (We must
generate complex fiddly sequences in those cases, and there is marginal
benefit in duplicating that logic in fast-isel.)
GPRC_NOR0 is not a subclass of GPRC (because it also contains the ZERO pseudo
register). As a result, we also need to check for it in the spilling code.
My understanding (from reading just the llvm code) is that
* most ppc cpus have a "sync n" instruction and an msync alias that is
* "sync 0".
* "book e" cpus instead have a msync instruction and not the more
general "sync n"
This patch reflects that in the .td files, allowing a single codepath
for
asm ond obj streamer and incidentelly fixes a crash when EmitRawText was
called on a obj streamer.
For PPC64 SVR (and Darwin), the stores that take byval aggregate parameters
from registers into the stack frame had MachinePointerInfo objects with
incorrect offsets. These offsets are relative to the object itself, not to the
stack frame base.
This fixes self hosting on PPC64 when compiling with -enable-aa-sched-mi.
LoopVectorizer: A reduction that has multiple uses of the reduction value is not
a reduction.
Really. Under certain circumstances (the use list of an instruction has to be
set up right - hence the extra pass in the test case) we would not recognize
when a value in a potential reduction cycle was used multiple times by the
reduction cycle.
Fix loop rerolling pass failure with non-consant loop lower bound
The loop rerolling pass was failing with an assertion failure from a
failed cast on loops like this:
void foo(int *A, int *B, int m, int n) {
for (int i = m; i < n; i+=4) {
A[i+0] = B[i+0] * 4;
A[i+1] = B[i+1] * 4;
A[i+2] = B[i+2] * 4;
A[i+3] = B[i+3] * 4;
}
}
The code was casting the SCEV-expanded code for the new
induction variable to a phi-node. When the loop had a non-constant
lower bound, the SCEV expander would end the code expansion with an
add insted of a phi node and the cast would fail.
It looks like the cast to a phi node was only needed to get the
induction variable value coming from the backedge to compute the end
of loop condition. This patch changes the loop reroller to compare
the induction variable to the number of times the backedge is taken
instead of the iteration count of the loop. In other words, we stop
the loop when the current value of the induction variable ==
IterationCount-1. Previously, the comparison was comparing the
induction variable value from the next iteration == IterationCount.
This problem only seems to occur on 32-bit targets. For some reason,
the loop is not rerolled on 64-bit targets.
Put a limit on ScheduleDAGSDNodes::ClusterNeighboringLoads to avoid blowing up compile time.
Fixes PR16365 - Extremely slow compilation in -O1 and -O2.
The SD scheduler has a quadratic implementation of load clustering
which absolutely blows up compile time for large blocks with constant
pool loads. The MI scheduler has a better implementation of load
clustering. However, we have not done the work yet to completely
eliminate the SD scheduler. Some benchmarks still seem to benefit from
early load clustering, although maybe by chance.
As an intermediate term fix, I just put a nice limit on the number of
DAG users to search before finding a match. With this limit there are no
binary differences in the LLVM test suite, and the PR16365 test case
does not suffer any compile time impact from this routine.
Issue outcomes from DAGCombiner::MergeConsequtiveStores, more precisely from
mem-ops sequence sorting.
Consider, how MergeConsequtiveStores works for next example:
store i8 1, a[0]
store i8 2, a[1]
store i8 3, a[1] ; a[1] again.
return ; DAG starts here
1. Method will collect all the 3 stores.
2. It sorts them by distance from the base pointer (farthest with highest
index).
3. It takes first consecutive non-overlapping stores and (if possible) replaces
them with a single store instruction.
The point is, we can't determine here which 'store' instruction
would be the second after sorting ('store 2' or 'store 3').
It happens that 'store 3' would be the second, and 'store 2' would be the third.
So after merging we have the next result:
store i16 (1 | 3 << 8), base ; is a[0] but bit-casted to i16
store i8 2, a[1]
So actually we swapped 'store 3' and 'store 2' and got wrong contents in a[1].
Fix: In sort routine just also take into account mem-op sequence number.
[LPM] A terribly simple fix to a terribly complex bug: PR18773.
The crux of the issue is that LCSSA doesn't preserve stateful alias
analyses. Before r200067, LICM didn't cause LCSSA to run in the LTO pass
manager, where LICM runs essentially without any of the other loop
passes. As a consequence the globalmodref-aa pass run before that loop
pass manager was able to survive the loop pass manager and be used by
DSE to eliminate stores in the function called from the loop body in
Adobe-C++/loop_unroll (and similar patterns in other benchmarks).
When LICM was taught to preserve LCSSA it had to require it as well.
This caused it to be run in the loop pass manager and because it did not
preserve AA, the stateful AA was lost. Most of LLVM's AA isn't stateful
and so this didn't manifest in most cases. Also, in most cases LCSSA was
already running, and so there was no interesting change.
The real kicker is that LCSSA by its definition (injecting PHI nodes
only) trivially preserves AA! All we need to do is mark it, and then
everything goes back to working as intended. It probably was blocking
some other weird cases of stateful AA but the only one I have is
a 1000-line IR test case from loop_unroll, so I don't really have a good
test case here.
Hopefully this fixes the regressions on performance that have been seen
since that revision.
Fixed old typo in ScalarEvolution, that caused wrong SCEVs zext operation.
Detailed description is here:
http://llvm.org/bugs/show_bug.cgi?id=18000#c16
For participation in bugfix process special thanks to David Wiberg.
[patch] Adjust behavior of FDE cross-section relocs for targets that don't support abs-differences.
Modern versions of OSX/Darwin's ld (ld64 > 97.17) have an optimisation present that allows the back end to omit relocations (and replace them with an absolute difference) for FDE some text section refs.
This patch allows a backend to opt-in to this behaviour by setting "DwarfFDESymbolsUseAbsDiff". At present, this is only enabled for modern x86 OSX ports.
R600/SI: Initialize M0 and emit S_WQM_B64 whenever DS instructions are
used
DS instructions that access local memory can only uses addresses that
are less than or equal to the value of M0. When M0 is uninitialized,
then we experience undefined behavior.
This patch also changes the behavior to emit S_WQM_B64 on pixel shaders
no matter what kind of DS instruction is used.
This doesn't change any functionality, since we only have two shader
types (compute and pixel) that use local memory. We're just changing
the logic to match the documentation.
R600: Remove successive JUMP in AnalyzeBranch when AllowModify is true
This fixes a crash in the OpenCV OpenCL test suite.
There is no lit test for this, because the test would be very large
and could easily be invalidated by changes to the scheduler
or other parts of the compiler.
This pattern uses an SDNodeXForm, which isn't being emitted for some
reason. I can get it to work by attaching the PatLeaf that has the
XForm to the argument in the output pattern, but this results in an
immediate being used in a register operand, which the backend can't
handle yet.
Add patch level to llvm version in CMake and Autoconf
The shared library generated by autoconf will now be called
libLLVM-$(VERSION_MAJOR).$(VERSION_MINOR).$(VERSION_PATCH)$(VERSION_SUFFIX).so
and a symlink named
libLLVM-$(VERSION_MAJOR).$(VERSION_MINOR)$(VERSION_SUFFIX).so will
also be created in the install directory.
Revert "Revert "Mark vastart_save_xmm_regs as changing EFLAGS""
This reverts commit r197481, recommiting r197469 with an extra fix.
The vastart_save_xmm_regs pseudo-instruction expands to a test and a
branch, so it modifies EFLAGS. Mark it so, or else the scheduler might
place it in the middle of another test+branch.
This fixes a bug exposed by r192750, which changed the initial scheduler
to source-order as part of enabling the MI Scheduler for X86.
This re-commit changes the VASTART_SAVE_XMM_REGS custom inserter not to
try to save %flags, and adds a test that catches the bad behavior of
r197469.
The reason is that EltsFromConsecutiveLoads could emit such load instruction
both before and after legalize stage. Though this instruction is not legal for
machines with SSSE3 and lower.
The fix: In EltsFromConsecutiveLoads, if we have passed legalize stage, we
check whether nodes it emits are legal.
P.S.: If you get failure in time from 12:00 and till 22:00 (UTC-8),
perhaps I'll slow with response, so you better reject this commit. Thanks!
Bill Wendling [Fri, 20 Dec 2013 04:26:57 +0000 (04:26 +0000)]
Merging r197718:
------------------------------------------------------------------------
r197718 | hans | 2013-12-19 12:32:44 -0800 (Thu, 19 Dec 2013) | 10 lines
Make sys::ThreadLocal<> zero-initialized on non-thread builds (PR18205)
According to the docs, ThreadLocal<>::get() should return NULL
if no object has been set. This patch makes that the case also for non-thread
builds and adds a very basic unit test to check it.
(This was causing PR18205 because PrettyStackTraceHead didn't get zero-
initialized and we'd crash trying to read past the end of that list. We didn't
notice this so much on Linux since we'd crash after printing all the entries,
but on Mac we print into a SmallString, and would crash before printing that.)
------------------------------------------------------------------------
Bill Wendling [Mon, 16 Dec 2013 03:48:58 +0000 (03:48 +0000)]
Merging r195411:
------------------------------------------------------------------------
r195411 | mgottesman | 2013-11-21 21:00:51 -0800 (Thu, 21 Nov 2013) | 1 line
[block-freq] Update data in test case to be unsigned long long to fix mingw build.
------------------------------------------------------------------------
[inliner] Fix PR18206 by preventing inlining functions that call setjmp
through an invoke instruction.
The original patch for this was written by Mark Seaborn, but I've
reworked his test case into the existing returns_twice test case and
implemented the fix by the prior refactoring to actually run the cost
analysis over invoke instructions, and then here fixing our detection of
the returns_twice attribute to work for both calls and invokes. We never
noticed because we never saw an invoke. =[
------------------------------------------------------------------------
[inliner] Completely change (and fix) how the inline cost analysis
handles terminator instructions.
The inline cost analysis inheritted some pretty rough handling of
terminator insts from the original cost analysis, and then made it much,
much worse by factoring all of the important analyses into a separate
instruction visitor. That instruction visitor never visited the
terminator.
This works fine for things like conditional branches, but for many other
things we simply computed The Wrong Value. First example are
unconditional branches, which should be free but were counted as full
cost. This is most significant for conditional branches where the
condition simplifies and folds during inlining. We paid a 1 instruction
tax on every branch in a straight line specialized path. =[
Oh, we also claimed that the unreachable instruction had cost.
But it gets worse. Let's consider invoke. We never applied the call
penalty. We never accounted for the cost of the arguments. Nope. Worse
still, we didn't handle the *correctness* constraints of not inlining
recursive invokes, or exception throwing returns_twice functions. Oops.
See PR18206. Sadly, PR18206 requires yet another fix, but this
refactoring is at least a huge step in that direction.
------------------------------------------------------------------------
Fix a use-after-free error in GlobalOpt CleanupConstantGlobalUsers
GlobalOpt's CleanupConstantGlobalUsers function uses a worklist array to manage
constant users to be visited. The pointers in this array need to be weak
handles because when we delete a constant array, we may also be holding a
pointer to one of its elements (or an element of one of its elements if we're
dealing with an array of arrays) in the worklist.
X86: When lowering shl_parts, don't emit shift amounts larger than the bit width.
While it's safe for the X86-specific shift nodes, dag combining will
kill generic nodes. Insert an AND to make it safe, isel will nuke it
as x86's shift instructions have an implicit AND.
Fixes PR16108, which contains a contraption to hit this case in between
constant folders.
------------------------------------------------------------------------
With some test case changes:
- intrinsic test added to the existing /test/CodeGen/AArch64/neon-aba-abd.ll.
- New test cases to cover movi 1D scenario without using the intrinsic in
test/CodeGen/AArch64/neon-mov.ll.
Debug Info: drop debug info via upgrading path if version number does not match.
Add a helper function getDebugInfoVersionFromModule to return the debug info
version number for a module.
"Verifier/module-flags-1.ll" checks for verification errors.
It will seg fault when calling getDebugInfoVersionFromModule because of the
incorrect format for module flags in the testing case. We make
getModuleFlagsMetadata more robust by checking for error conditions.
Debug Info: update testing cases to specify the debug info version number.
We are going to drop debug info without a version number or with a different
version number, to make sure we don't crash when we see bitcode files with
different debug info metadata format.
Make tests more robust by removing hard-coded metadata numbers in CHECK lines.
Debug Info: update testing cases to specify the debug info version number.
We are going to drop debug info without a version number or with a different
version number, to make sure we don't crash when we see bitcode files with
different debug info metadata format.
Tim Northover [Mon, 9 Dec 2013 10:48:32 +0000 (10:48 +0000)]
Merge rest of r196210. Some bits strayed into r196701, turning 3.4 red. This
should fix the issue.
------------------------------------------------------------------------
r196210 | haoliu | 2013-12-03 06:06:55 +0000 (Tue, 03 Dec 2013) | 3 lines
[AArch64]Add missing floating point convert, round and misc intrinsics.
E.g. int64x1_t vcvt_s64_f64(float64x1_t a) -> FCVTZS Dd, Dn
Tim Northover [Mon, 9 Dec 2013 09:05:30 +0000 (09:05 +0000)]
Merge r196725 (conflicts on same API as before):
------------------------------------------------------------------------
r196725 | tnorthover | 2013-12-08 15:56:50 +0000 (Sun, 08 Dec 2013) |
19 lines
ARM: fix folding of stack-adjustment (yet again).
When trying to eliminate an "sub sp, sp, #N" instruction by folding
it into an existing push/pop using dummy registers, we need to account
for the fact that this might affect precisely how "fp" gets set in the
prologue.
We were attempting this, but assuming that *whenever* we performed a
fold it would make a difference. This is false, for example, in:
push {r4, r7, lr}
add fp, sp, #4
vpush {d8}
sub sp, sp, #8
we can fold the "sub" into the "vpush", forming "vpush {d7, d8}".
However, in that case the "add fp" instruction mustn't change, which
we were getting wrong before.
Should fix PR18160.
------------------------------------------------------------------------