Tim Northover [Fri, 14 Nov 2014 01:30:14 +0000 (01:30 +0000)]
X86: use getConstant rather than getTargetConstant behind BUILD_VECTOR.
getTargetConstant should only be used when you can guarantee the instruction
selected will be able to cope with the raw value. BUILD_VECTOR is rather too
generic for this so we should use getConstant instead. In that case, an
instruction can still consume the constant, but if it doesn't it'll be
materialised through its own round of ISel.
Stop using `Value::getName()` to get the string behind an `MDString`.
Switch to `StringMapEntry<MDString>` so that we can find the string by
its coallocation.
Tim Northover [Fri, 14 Nov 2014 00:34:59 +0000 (00:34 +0000)]
CodeGen: assert an instruction is being inserted with the correct iterator.
When "MBB->Insert(It, ...)" is called, we want It to be pointing inside the
correct basic block. No actual failures at the moment, but it's caused problems
before.
Hide the fact that `MDString`'s string is stored in `Value::Name` --
that's going to change soon. Update the only in-tree client that was
using it instead of `Value::getString()`.
Reed Kotler [Thu, 13 Nov 2014 23:37:45 +0000 (23:37 +0000)]
First stage of call lowering for Mips fast-isel
Summary:
This has most of what is needed for mips fast-isel call lowering for O32.
What is missing I will add on the next patch because this patch is already too large.
It should not be doing anything wrong but it will punt on some cases that it is basically
capable of doing.
The mechanism is there for parameters to be passed on the stack but I have not enabled it because it serves as a way for now to prevent some of the strange cases of O32 register passing that I have not fully checked yet and have some issues.
The Mips O32 abi rules are very complicated as far how data is passed in floating and integer registers.
However there is a way to think about this all very simply and this implementation reflects that.
Basically, the ABI rules are written as if everything is passed on the stack and aligned as such.
Once that is conceptually done, it is nearly trivial to reassign those locations to registers and
then all the complexity disappears.
So I have told tablegen that all the data is passed on the stack and during the lowering I fix
this by assigning to registers as per the ABI doc.
This has been my approach and you can line up what I did with the ABI document and see 1 to 1 what
is going on.
Reid Kleckner [Thu, 13 Nov 2014 23:32:52 +0000 (23:32 +0000)]
Fix symbol resolution of floating point libc builtins in MCJIT
Fix for LLI failure on Windows\X86: http://llvm.org/PR5053
LLI.exe crashes on Windows\X86 when single precession floating point
intrinsics like the following are used: acos, asin, atan, atan2, ceil,
copysign, cos, cosh, exp, floor, fmin, fmax, fmod, log, pow, sin, sinh,
sqrt, tan, tanh
The above intrinsics are defined as inline-expansions in math.h, and are
not exported by msvcr120.dll (Win32 API GetProcAddress returns null).
For an FREM instruction, the JIT compiler generates a call to a stub for
the fmodf() intrinsic, and adds a relocation to fixup at load time. The
loader searches the libraries for the function, but fails because the
symbol is not exported. So, the call target remains NULL and the
execution crashes.
Since the math functions are loaded at JIT/runtime, the JIT can patch
CALL instruction directly instead of the searching the libraries'
exported symbols. However, this fix caused build failures due to
unresolved symbols like _fmodf at link time.
Therefore, the current fix defines helper functions in the Runtime
link/load library to perform the above operations. The address of these
helper functions are used to patch up the CALL instruction at load time.
Reid Kleckner [Thu, 13 Nov 2014 22:55:19 +0000 (22:55 +0000)]
Use nullptr instead of NULL for variadic sentinels
Windows defines NULL to 0, which when used as an argument to a variadic
function, is not a null pointer constant. As a result, Clang's
-Wsentinel fires on this code. Using '0' would be wrong on most 64-bit
platforms, but both MSVC and Clang make it work on Windows. Sidestep the
issue with nullptr.
Chandler Carruth [Thu, 13 Nov 2014 22:49:44 +0000 (22:49 +0000)]
[x86] Add some tests for specific patterns of lane-flips combined with
in-lane shuffles that aren't always handled well by the current vector
shuffle lowering.
No functionality change yet, that will follow in a subsequent commit.
Juergen Ributzka [Thu, 13 Nov 2014 20:50:44 +0000 (20:50 +0000)]
[FastISel][AArch64] Don't bail during simple GEP instruction selection.
The generic FastISel code would bail, because it can't emit a sign-extend for
AArch64. This copies the code over and uses AArch64 specific emit functions.
This is not ideal and 'computeAddress' should handles this, so it can fold the
address computation into the memory operation.
I plan to clean up 'computeAddress' anyways, so I will add that in a future
commit.
Matt Arsenault [Thu, 13 Nov 2014 20:23:36 +0000 (20:23 +0000)]
R600/SI: Fix definition for s_cselect_b32
These were directly using the old base instruction
class, and specifying the wrong register classes
for operands. The operands can be the other special
inputs besides SGPRs. The op name was also being
directly used for the asm string, so this was printed
without any operands.
Matt Arsenault [Thu, 13 Nov 2014 19:26:47 +0000 (19:26 +0000)]
R600/SI: Allow commuting some 3 op instructions
e.g. v_mad_f32 a, b, c -> v_mad_f32 b, a, c
This simplifies matching v_madmk_f32.
This looks somewhat surprising, but it appears to be
OK to do this. We can commute src0 and src1 in all
of these instructions, and that's all that appears
to matter.
Tim Northover [Thu, 13 Nov 2014 17:58:53 +0000 (17:58 +0000)]
ARM: allow constpool entry to be moved to the user's block in all cases.
Normally entries can only move to a lower address, but when that wasn't viable,
the user's block was considered anyway. Unfortunately, it went via
createNewWater which wasn't designed to handle the case where there's already
an island after the block.
Unfortunately, the test we have is slow and fragile, and I couldn't reduce it
to anything sane even with the @llvm.arm.space intrinsic. The test change here
is recreating the previous one after the change.
Tim Northover [Thu, 13 Nov 2014 17:58:51 +0000 (17:58 +0000)]
ARM: avoid duplicating branches during constant islands.
We were using a naive heuristic to determine whether a basic block already had
an unconditional branch at the end. This mostly corresponded to reality
(assuming branches got optimised) because there's not much point in a branch to
the next block, but could go wrong.
Tim Northover [Thu, 13 Nov 2014 17:58:48 +0000 (17:58 +0000)]
ARM: add @llvm.arm.space intrinsic for testing ConstantIslands.
Creating tests for the ConstantIslands pass is very difficult, since it depends
on precise layout details. Having the ability to precisely inject a number of
bytes into the stream helps greatly.
Aaron Ballman [Thu, 13 Nov 2014 13:55:13 +0000 (13:55 +0000)]
Fixing -Wtype-limits warnings with the asserts (the expression would always evaluate to true). Also fixing a -Wcast-qual warning, where the cast expression isn't required.
Hal Finkel [Thu, 13 Nov 2014 09:16:54 +0000 (09:16 +0000)]
Revert r219432 - "Revert "[BasicAA] Revert "Revert r218714 - Make better use of zext and sign information."""
Let's try this again...
This reverts r219432, plus a bug fix.
Description of the bug in r219432 (by Nick):
The bug was using AllPositive to break out of the loop; if the loop break
condition i != e is changed to i != e && AllPositive then the
test_modulo_analysis_with_global test I've added will fail as the Modulo will
be calculated incorrectly (as the last loop iteration is skipped, so Modulo
isn't updated with its Scale).
Nick also adds this comment:
ComputeSignBit is safe to use in loops as it takes into account phi nodes, and
the == EK_ZeroEx check is safe in loops as, no matter how the variable changes
between iterations, zero-extensions will always guarantee a zero sign bit. The
isValueEqualInPotentialCycles check is therefore definitely not needed as all
the variable analysis holds no matter how the variables change between loop
iterations.
And this patch also adds another enhancement to GetLinearExpression - basically
to convert ConstantInts to Offsets (see test_const_eval and
test_const_eval_scaled for the situations this improves).
Original commit message:
This reverts r218944, which reverted r218714, plus a bug fix.
Description of the bug in r218714 (by Nick):
The original patch forgot to check if the Scale in VariableGEPIndex flipped the
sign of the variable. The BasicAA pass iterates over the instructions in the
order they appear in the function, and so BasicAliasAnalysis::aliasGEP is
called with the variable it first comes across as parameter GEP1. Adding a
%reorder label puts the definition of %a after %b so aliasGEP is called with %b
as the first parameter and %a as the second. aliasGEP later calculates that %a
== %b + 1 - %idxprom where %idxprom >= 0 (if %a was passed as the first
parameter it would calculate %b == %a - 1 + %idxprom where %idxprom >= 0) -
ignoring that %idxprom is scaled by -1 here lead the patch to incorrectly
conclude that %a > %b.
Revised patch by Nick White, thanks! Thanks to Lang to isolating the bug.
Slightly modified by me to add an early exit from the loop and avoid
unnecessary, but expensive, function calls.
Original commit message:
Two related things:
1. Fixes a bug when calculating the offset in GetLinearExpression. The code
previously used zext to extend the offset, so negative offsets were converted
to large positive ones.
2. Enhance aliasGEP to deduce that, if the difference between two GEP
allocations is positive and all the variables that govern the offset are also
positive (i.e. the offset is strictly after the higher base pointer), then
locations that fit in the gap between the two base pointers are NoAlias.
David Majnemer [Thu, 13 Nov 2014 08:46:37 +0000 (08:46 +0000)]
Object, COFF: Increase code reuse
Split getObject's smarts into checkOffset, use this to replace the
handwritten check in getSectionContents. Similarly, replace checks in
section_rel_begin/section_rel_end with getNumberOfRelocations.
lib/Object is supposed to be robust to malformed object files. Don't
assert if we don't have a symbol table. I'll try to come up with a test
case later.
David Majnemer [Thu, 13 Nov 2014 07:42:07 +0000 (07:42 +0000)]
Object, COFF: Fix some theoretical bugs
getObject didn't consider the case where a pointer came before the start
of the object file. No test is included, trying to come up with
something reasonable.
Chandler Carruth [Thu, 13 Nov 2014 04:06:10 +0000 (04:06 +0000)]
[x86] Teach the vector shuffle lowering to make a more nuanced decision
between splitting a vector into 128-bit lanes and recombining them vs.
decomposing things into single-input shuffles and a final blend.
This handles a large number of cases in AVX1 where the cross-lane
shuffles would be much more expensive to represent even though we end up
with a fast blend at the root. Instead, we can do a better job of
shuffling in a single lane and then inserting it into the other lanes.
This fixes the remaining bits of Halide's regression captured in PR21281
for AVX1. However, the bug persists in AVX2 because I've made this
change reasonably conservative. The cases where it makes sense in AVX2
to split into 128-bit lanes are much more rare because we can often do
full permutations across all elements of the 256-bit vector. However,
the particular test case in PR21281 is an example of one of the rare
cases where it is *always* better to work in a single 128-bit lane. I'm
going to try to teach the logic to detect and form the good code even in
AVX2 next, but it will need to use a separate heuristic.
Finally, there is one pesky regression here where we previously would
craftily use vpermilps in AVX1 to shuffle both high and low halves at
the same time. We no longer pull that off, and not for any really good
reason. Ultimately, I think this is just another missing nuance to the
selection heuristic that I'll try to add in afterward, but this change
already seems strictly worth doing considering the magnitude of the
improvements in common matrix math shuffle patterns.
As always, please let me know if this causes a surprising regression for
you.
Chandler Carruth [Thu, 13 Nov 2014 02:42:08 +0000 (02:42 +0000)]
[x86] Don't form overly fragmented blends when splitting and
re-combining shuffles because nothing was available in the wider vector
type.
The key observation (which I've put in the comments for future
maintainers) is that at this point, no further combining is really
possible. And so even though these shuffles trivially could be combined,
we need to actually do that as we produce them when producing them this
late in the lowering.
This fixes another (huge) part of the Halide vector shuffle regressions.
As it happens, this was already well covered by the tests, but I hadn't
noticed how bad some of these got. The specific patterns that turn
directly into unpckl/h patterns were occurring *many* times in common
vector processing code.
There are still more problems here sadly, but trying to incrementally
tease them apart and it looks like this is the core of the problem in
the splitting logic.
There is some chance of regression here, you can see it in the test
changes. Specifically, where we stop forming pshufb in some cases, it is
possible that pshufb was in fact faster. Intel "says" that pshufb is
slower than the instruction sequences replacing it.
Quentin Colombet [Thu, 13 Nov 2014 01:44:51 +0000 (01:44 +0000)]
[CodeGenPrepare] Handle zero extensions in the TypePromotionHelper.
Prior to this patch the TypePromotionHelper was promoting only sign extensions.
Supporting zero extensions changes:
- How constants are extended.
- How sign extensions, zero extensions, and truncate are composed together.
- How the type of the extended operation is recorded. Now we need to know the
kind of the extension as well as its type.
Each change is fairly small, unlike the diff.
Most of the diff are comments/variable renaming to say "extension" instead of
"sign extension".
The performance improvements on the test suite are within the noise.
Paul Robinson [Thu, 13 Nov 2014 00:12:14 +0000 (00:12 +0000)]
Improve long path name support on Windows.
Windows normally limits the length of an absolute path name to 260
characters; directories can have lower limits. These limits increase
to about 32K if you use absolute paths with the special '\\?\'
prefix. Teach Support\Windows\Path.inc to use that prefix as needed.
TODO: Other parts of Support could also learn to use this prefix.
Sanjoy Das [Thu, 13 Nov 2014 00:00:58 +0000 (00:00 +0000)]
Teach ScalarEvolution to sharpen range information.
If x is known to have the range [a, b), in a loop predicated by (icmp
ne x, a) its range can be sharpened to [a + 1, b). Get
ScalarEvolution and hence IndVars to exploit this fact.
This change triggers an optimization to widen-loop-comp.ll, so it had
to be edited to get it to pass.
This change was originally landed in r219834 but had a bug and broke
ASan. It was reverted in r219878, and is now being re-landed after
fixing the original bug.
Frederic Riss [Wed, 12 Nov 2014 23:48:14 +0000 (23:48 +0000)]
Fix emission of Dwarf accelerator table when there are multiple CUs.
The DIE offset in the accel tables is an offset relative to the start
of the debug_info section, but we were encoding the offset to the
start of the containing CU.
Frederic Riss [Wed, 12 Nov 2014 23:48:04 +0000 (23:48 +0000)]
Allow DWARFFormValue::extractValue to be called with a null CU.
Currently FormValues are only used for attributes of DIEs and thus
uers always have a CU lying around when calling into the FormValue
API.
Accelerator tables encode their information using the same Forms
as the attributes, thus it is natural to use DWARFFormValue to
extract/dump them. There is no CU in that case though. Allow the
API to be called with a null CU arguemnt by making the RelocMap
lookup conditional on the CU pointer validity. And document this
new behvior in the header. (Test coverage for this use of the API
comes in the DwarfAccelTable support patch)
Ahmed Bougacha [Wed, 12 Nov 2014 23:05:03 +0000 (23:05 +0000)]
[CodeGenPrepare] Replace other uses of EVT::getEVT with TL::getValueType.
r221820 fixed a problem (PR21548) where an iPTR was used in TLI legality checks,
which isn't valid and resulted in a failed assertion.
The solution was to lower pointer types into the correct target's VT, by
using TL::getValueType instead of EVT::getEVT.
This commit changes 3 other uses of EVT::getEVT, but without any tests:
- One of these non-lowered EVTs is passed to allowsMisalignedMemoryAccesses,
which goes into target's TL implementation and doesn't cause any problem (yet.)
- Two others are passed to TLI.isOperationLegalOrCustom:
- one only looks at extensions, so doesn't concern pointers.
- one only looks at binary operators, so also isn't a problem.
The latter might some day be exposed to pointers and cause the same assert as
the original PR, because there's a comment hinting at also supporting cast ops.
For consistency, update all of them and be done with it.
Sanjay Patel [Wed, 12 Nov 2014 21:39:01 +0000 (21:39 +0000)]
Expose the number of Newton-Raphson iterations applied to the hardware's reciprocal estimate as a parameter (x86).
This is a follow-on to r221706 and r221731 and discussed in more detail in PR21385.
This patch also loosens the testcase checking for btver2. We know that the "1.0" will be loaded, but
we can't tell exactly when, so replace the CHECK-NEXT specifiers with plain CHECKs. The CHECK-NEXT
sequence relied on a quirk of post-RA-scheduling that may change independently of anything in these tests.
Ahmed Bougacha [Wed, 12 Nov 2014 21:23:34 +0000 (21:23 +0000)]
Add fortified (__*_chk) library functions to TLI (NFC)
One of them (__memcpy_chk) was already there, the others were checked
by comparing function names.
Note that the fortified libfuncs are now part of TLI, but are always
available, because they aren't generated, only optimized into the
non-checking versions.
Sanjay Patel [Wed, 12 Nov 2014 18:25:47 +0000 (18:25 +0000)]
CGSCC should not treat intrinsic calls like function calls (PR21403)
Make the handling of calls to intrinsics in CGSCC consistent:
they are not treated like regular function calls because they
are never lowered to function calls.
Without this patch, we can get dangling pointer asserts from
the subsequent loop that processes callsites because it already
ignores intrinsics.
See http://llvm.org/bugs/show_bug.cgi?id=21403 for more details / discussion.
Jingyue Wu [Wed, 12 Nov 2014 18:09:15 +0000 (18:09 +0000)]
Disable indvar widening if arithmetics on the wider type are more expensive
Summary:
Reapply r221772. The old patch breaks the bot because the @indvar_32_bit test
was run whether NVPTX was enabled or not.
IndVarSimplify should not widen an indvar if arithmetics on the wider
indvar are more expensive than those on the narrower indvar. For
instance, although NVPTX64 treats i64 as a legal type, an ADD on i64 is
twice as expensive as that on i32, because the hardware needs to
simulate a 64-bit integer using two 32-bit integers.
Split from D6188, and based on D6195 which adds NVPTXTargetTransformInfo.
Fixes PR21148.
Test Plan:
Added @indvar_32_bit that verifies we do not widen an indvar if the arithmetics
on the wider type are more expensive. This test is run only when NVPTX is
enabled.
Justin Hibbits [Wed, 12 Nov 2014 15:16:30 +0000 (15:16 +0000)]
Add support for small-model PIC for PowerPC.
Summary:
Large-model was added first. With the addition of support for multiple PIC
models in LLVM, now add small-model PIC for 32-bit PowerPC, SysV4 ABI. This
generates more optimal code, for shared libraries with less than about 16380
data objects.