Matthias Braun [Mon, 16 Feb 2015 19:34:27 +0000 (19:34 +0000)]
RegisterCoalescer: Improve previous fix for wrong def after.
The previous fix in r225503 was needlessly complicated. The problem goes
away as well if the arguments to MergeValueNumberInto are supplied in the
correct order.
This was previously missed because the existing code already had the
wrong order but an additional later Merge was hiding the bug for the
main liverange VNI.
Bitcode: Fix major regression: large files w/ debug info
The metadata/value split introduced a major regression reading large
bitcode files that contain debug info (or other cyclic (non-self
reference) metadata graphs). For the first time in a while, I dropped
from libLTO.dylib down to `llvm-lto` with a non-trivial bitcode file
(~350MB), and I hit this when reading the result of ld64's `-save-temps`
in `llvm-lto`.
Here's pseudo-code for what was going on:
read-main-metadata-block:
for each md:
if has-fwd-ref: // Only true for cyclic graphs.
any-fwd-refs <- true
if any-fwd-refs:
foreach md:
resolve-cycles(md) // Handle cycles.
foreach function:
read-function-metadata-block: // Such as !alias, !loop
if any-fwd-refs:
foreach md: // (all metadata, not just this block)
resolve-cycles(md) // A no-op, but the loop is expensive!!
This commit resets the `AnyFwdRefs` flag to `false`. This on its own
was enough to change my Release+Asserts `llvm-lto` time for reading this
bitcode from over 20 minutes (I gave up on it) to 20 seconds. I've gone
further by tracking the min/max metadata forward-references in a
metadata block. This protects against a schema that has lots of
functions that each reference their own metadata cycle.
Unfortunately, this regression is in the 3.6 branch as well.
Andrew Trick [Mon, 16 Feb 2015 18:10:47 +0000 (18:10 +0000)]
AArch64: Safely handle the incoming sret call argument.
This adds a safe interface to the machine independent InputArg struct
for accessing the index of the original (IR-level) argument. When a
non-native return type is lowered, we generate the hidden
machine-level sret argument on-the-fly. Before this fix, we were
representing this argument as OrigArgIndex == 0, which is an outright
lie. In particular this crashed in the AArch64 backend where we
actually try to access the type of the original argument.
Now we use a sentinel value for machine arguments that have no
original argument index. AArch64, ARM, Mips, and PPC now check for this
case before accessing the original argument.
Fixes <rdar://19792160> Null pointer assertion in AArch64TargetLowering
James Molloy [Mon, 16 Feb 2015 17:02:00 +0000 (17:02 +0000)]
[LoopReroll] Relax some assumptions a little.
We won't find a root with index zero in any loop that we are able to reroll.
However, we may find one in a non-rerollable loop, so bail gracefully instead
of failing hard.
James Molloy [Mon, 16 Feb 2015 17:01:52 +0000 (17:01 +0000)]
[LoopReroll] Don't crash on dead code
If a PHI has no users, don't crash; bail gracefully. This shouldn't
happen often, but we can make no guarantees that previous passes didn't leave
dead code around.
Chandler Carruth [Mon, 16 Feb 2015 12:28:18 +0000 (12:28 +0000)]
[x86] Add a generic unpack-targeted lowering technique. This can be used
to generically lower blends and is particularly nice because it is
available frome SSE2 onward. This removes a lot of the remaining domain
crossing blends in SSE2 code.
I'm hoping to replace some of the "interleaved" lowering hacks with
something closer to this which should be more principled. First, this
needs to learn how to detect and use other interleavings besides that of
the natural type provided. That will be a follow-up patch though.
Fix quoting of #pragma comment for MS compat, LLVM part.
For #pragma comment(linker, ...) MSVC expects the comment string to be quoted, but for #pragma comment(lib, ...) the compiler itself quotes the library name.
Since this distinction disappears by the time the directive reaches the backend, move quoting for the "lib" version to the frontend.
Chandler Carruth [Mon, 16 Feb 2015 10:58:23 +0000 (10:58 +0000)]
[x86] Add initial basic support for forming blends of v16i8 vectors.
This blend instruction is ... really lame. The register usage is insane.
As a consequence this is probably only *barely* better than 2 pshufbs
followed by a por, and that mostly because it only has to read from
a single memory location.
However, this doesn't fix as much as I kind of expected, so more to go.
Pretty sure that the ordering and delegation of v16i8 is just really,
really bad.
Chandler Carruth [Mon, 16 Feb 2015 08:22:35 +0000 (08:22 +0000)]
Switch our index sequence away from template aliases and just use
classes. We can't use template aliases because on MSVC they don't appear
to work correctly in the common usage such as Format.h.
Many thanks to Zach for doing all the testing and debugging here. I just
slotted the fix into the code.
David Majnemer [Mon, 16 Feb 2015 04:02:09 +0000 (04:02 +0000)]
IR: Properly return nullptr when getAggregateElement is out-of-bounds
We didn't properly handle the out-of-bounds case for
ConstantAggregateZero and UndefValue. This would manifest as a crash
when the constant folder was asked to fold a load of a constant global
whose struct type has no operands.
Chandler Carruth [Mon, 16 Feb 2015 01:52:02 +0000 (01:52 +0000)]
[x86] Teach the 128-bit vector shuffle lowering routines to take
advantage of the existence of a reasonable blend instruction.
The 256-bit vector shuffle lowering has leveraged the general technique
of decomposed shuffles and blends for quite some time, but this never
made it back into the 128-bit code, and there are a large number of
patterns where this is substantially better. For example, this removes
almost all domain crossing in vector shuffles that involve some blend
and some permutation with SSE4.1 and later. See the massive reduction
in 'shufps' for integer test cases in this commit.
This isn't perfect yet for a few reasons:
1) The v8i16 shuffle lowering continues to plague me. We don't always
form an unpack-based blend when that would be better. But the wins
pretty drastically outstrip the losses here.
2) The v16i8 shuffle lowering is just a disaster here. I never went and
implemented blend support here for some terrible reason. I'll do
that next probably. I've not updated it for now.
More variations on this technique are coming as well -- we don't
shuffle-into-unpack or shuffle-into-palignr, both of which would also be
profitable.
Note that some test cases grow significantly in the number of
instructions, but I expect to actually be faster. We use
pshufd+pshufd+blendw instead of a single shufps, but the pshufd's are
very likely to pipeline well (two ports on most modern intel chips) and
the blend is a *very* fast instruction. The domain switch penalty will
essentially always be more than a blend instruction, which is the only
increase in tree height.
Benjamin Kramer [Sun, 15 Feb 2015 22:15:41 +0000 (22:15 +0000)]
Format: Modernize using variadic templates.
Introduces a subset of C++14 integer sequences in STLExtras. This is
just enough to support unpacking a std::tuple into the arguments of
snprintf, we can add more of it when it's actually needed.
Also removes an ancient macro hack that leaks a macro into the global
namespace. Clean up users that made use of the convenient hack.
Philip Reames [Sun, 15 Feb 2015 19:07:31 +0000 (19:07 +0000)]
Revert 229175
This change is a logical suspect in 22587 and 22590. Given it's of minimal importanance and I can't get clang to build on my home machine, I'm reverting so that I can deal with this next week.
Chandler Carruth [Sun, 15 Feb 2015 12:42:15 +0000 (12:42 +0000)]
[x86] Teach the decomposed shuffle/blend lowering to use an early blend
when that will allow it to lower with a single permute instead of
multiple permutes.
It tries to detect when it will only have to do a single permute in
either case to maximize folding of loads and such.
This cuts a *lot* of the avx2 shuffle permute counts in half. =]
Chandler Carruth [Sun, 15 Feb 2015 12:18:12 +0000 (12:18 +0000)]
[SDAG] Teach the SelectionDAG to canonicalize vector shuffles of splats
directly into blends of the splats.
These patterns show up even very late in the vector shuffle lowering
where we don't have any chance for DAG combining to kick in, and
blending is a tremendously simpler operation to model. By coercing the
shuffle into a blend we can much more easily match and lower shuffles of
splats.
Immediately with this change there are significantly more blends being
matched in the x86 vector shuffle lowering.
Chandler Carruth [Sun, 15 Feb 2015 12:07:55 +0000 (12:07 +0000)]
[x86] Teach the shuffle mask equivalence test to look through build
vectors and detect equivalent inputs.
This lets the code match unpck-style instructions when only one of the
inputs are lined up but the other input is a splat and so which lanes we
pull from doesn't matter. Today, this doesn't really happen, but just by
accident. I have a patch that normalizes how we shuffle splats, and with
that patch this will be necessary for a lot of the mask equivalence
tests to work.
I don't really know how to write a test case for this specific change
until the other change lands though.
Chandler Carruth [Sun, 15 Feb 2015 12:01:14 +0000 (12:01 +0000)]
[x86] Tweak the ordering of unpack matching vs. element insertion, and
don't try to do element insertion for non-zero-index floating point
vectors.
We don't have any useful patterns or lowering for element insertion into
high elements of a floating point vector, and the generic shuffle
lowering will end up being better -- namely it will fall back to unpck.
But we should try to handle other forms of element insertion before
matching unpck patterns.
While this doesn't matter much right now, I'm working on a patch that
makes unpck matching much more powerful, and that patch will break
without this re-ordering.
Chandler Carruth [Sun, 15 Feb 2015 10:34:52 +0000 (10:34 +0000)]
[x86] Stop shuffling zero vectors. =]
I was somewhat surprised this pattern really came up, but it does. It
seems better to just directly handle it than try to special case every
place where we end up forming a shuffle that devolves to a shuffle of
a zero vector.
Chandler Carruth [Sun, 15 Feb 2015 10:12:02 +0000 (10:12 +0000)]
[x86] When splitting 256-bit vectors into 128-bit vectors, don't extract
subvectors from buildvectors. That doesn't really make any sense and it
breaks all of the down-stream matching of buildvectors to cleverly lower
shuffles.
With this, we now get the shift-based lowering of 256-bit vector
shuffles with AVX1 when we split them into 128-bit vectors. We also do
much better on the zero-extension patterns, although there remains quite
a bit of room for improvement here.
Chandler Carruth [Sun, 15 Feb 2015 09:33:36 +0000 (09:33 +0000)]
[x86] Make computing the zeroable elements slightly more powerful, at
least in theory.
I don't actually have a test case that benefits from this, but
theoretically, it could come up, and I don't want to try to think about
whether this is the culprit or something else is, so I'd rather just
make this code powerful. =/ Makes me sad that I can't really test it
though.
gold-plugin: fix test to allow default visibility on local symbols
GNU ld sets default, not hidden, visibility on local symbols.
Having default or hidden visibility on local symbols makes no difference in run-time behavior.
Chandler Carruth [Sun, 15 Feb 2015 08:26:30 +0000 (08:26 +0000)]
[x86] Add a slight variation on some of the other generic shuffle
lowerings -- one which decomposes into an initial blend followed by
a permute.
Particularly on newer chips, blends are handled independently of
shuffles and so this is much less bottlenecked on the single port that
floating point shuffles are executed with on Intel.
I'll be adding this lowering to a bunch of other code paths in
subsequent commits to handle still more places where we can effectively
leverage blends when they're available in the ISA.
Chandler Carruth [Sun, 15 Feb 2015 07:05:50 +0000 (07:05 +0000)]
[x86] Add a test case for PR22390 which was a dup of PR22377 and fixed
by r229285. This is a nice different test case though, so I'd like to
have the extra testing of these kinds of patterns.
Chandler Carruth [Sun, 15 Feb 2015 07:01:10 +0000 (07:01 +0000)]
[x86] Fix PR22377, a regression with the new vector shuffle legality
test.
This was just a matter of the DAG combine for vector shuffles being too
aggressive. This is a bit of a grey area, but I think generally if we
can re-use intermediate shuffles, we should. Certainly, given the test
cases I have available, this seems like the right call.
Chandler Carruth [Sun, 15 Feb 2015 06:37:21 +0000 (06:37 +0000)]
[x86] Switch a collection of tests explicitly to the new vector shuffle
legality test (essentially, everything is legal).
I'm planning to make this the default shortly, but I'd like to fix
a collection of the bugs it exposes first, and this will let me easily
test them. It also showcases both the improvements and a few of the
regressions triggered by the change. The biggest improvements by far are
the significantly reduced shuffling and domain crossing in the combining
test case. The biggest regressions are missing some clever blending
patterns.
Craig Topper [Sun, 15 Feb 2015 04:54:55 +0000 (04:54 +0000)]
[X86] Add assembler predicates for the rest of the AVX512 feature flags. This makes the assembly matching consistent across all AVX512 instructions. Without this we were allowing some AVX512 instructions to be parsed always, but not the foundation instructions.
Chandler Carruth [Sun, 15 Feb 2015 00:08:01 +0000 (00:08 +0000)]
[x86] Teach my test updating script about another quirk of the printed
asm and port the mmx vector shuffle test to it.
Not thrilled with how it handles the stack manipulation logic, but I'm
much less bothered by that than I am by updating the test manually. =]
If anyone wants to teach the test checks management script about stack
adjustment patterns, that'd be cool too.
Simon Pilgrim [Sat, 14 Feb 2015 22:40:46 +0000 (22:40 +0000)]
[X86][XOP] Enable commutation for XOP instructions
Patch to allow XOP instructions (integer comparison and integer multiply-add) to be commuted. The comparison instructions sometimes require the compare mode to be flipped but the remaining instructions can use default commutation modes.
This patch also sets the SSE domains of all the XOP instructions.
Craig Topper [Sat, 14 Feb 2015 21:54:03 +0000 (21:54 +0000)]
[X86] Improve parsing support AVX/SSE floating point compare instruction mnemonic aliases. They'll now print with the alias the parser received instead of converting to the explicit immediate form.
InstCombine: propagate deref via new addDereferenceableAttr
The "dereferenceable" attribute cannot be added via .addAttribute(),
since it also expects a size in bytes. AttrBuilder#addAttribute or
AttributeSet#addAttribute is wrapped by classes Function, InvokeInst,
and CallInst. Add corresponding wrappers to
AttrBuilder#addDereferenceableAttr.
Having done this, propagate the dereferenceable attribute via
gc.relocate, adding a test to exercise it. Note that -datalayout is
required during execution over and above -instcombine, because
InstCombine only optionally requires DataLayoutPass.
Chandler Carruth [Sat, 14 Feb 2015 09:43:57 +0000 (09:43 +0000)]
[gold] Consolidate the gold plugin options and actually search for
a gold binary explicitly. Substitute this binary into the tests rather
than just directly executing the 'ld' binary.
This should allow folks to inject a cross compiling gold binary, or in
my case to use a gold binary built and installed somewhere other than
/usr/bin/ld. It should also allow the tests to find 'ld.gold' so that
things work even if gold isn't the default on the system.
I've only stubbed out support in the makefile to preserve the existing
behavior with none of the fancy logic. If someone else wants to add
logic here, they're welcome to do so.
Chandler Carruth [Sat, 14 Feb 2015 09:05:58 +0000 (09:05 +0000)]
Back out two accidental changes that snuck in with r229245. Sorry these
snuck in, they weren't ready for prime time and had *nothing* to do
with that commit.