Daniel Sanders [Thu, 14 Aug 2014 09:18:14 +0000 (09:18 +0000)]
[mips] Rename [gs]etCanHaveModuleDir to more natural names
Summary:
getCanHaveModuleDir() is renamed to isModuleDirectiveAllowed(), and
setCanHaveModuleDir() is renamed to forbidModuleDirective() since it is only
ever given a false argument.
Chandler Carruth [Thu, 14 Aug 2014 08:18:34 +0000 (08:18 +0000)]
[SDAG] Fix a bug in the DAG combiner where we would fail to return the
input node after manually adding it to the worklist and using CombineTo.
Once we use CombineTo the input node may have been deleted. Despite this
being *completely confusing* and somewhat broken, the only way to
"correctly" return from a DAG combine after potentially deleting the
input node is to return *that exact node*....
But really, this code should just never have used CombineTo. It won't do
what it wants (returning the node as mentioned above just causes the
combine to infloop). The correct way to combine away a casted load to
a load of the correct type is to RAUW the chain directly and then return
the loaded value to replace the actual value node.
I managed to find this with the vector shuffle fuzzer even though it
clearly has nothing at all to do with vector shuffles and rather those
happen to trigger a load of a constant pool that hits this combine *just
right*. I've included the test as it is small and a nice stress test
that the infrastructure isn't asserting.
Chandler Carruth [Thu, 14 Aug 2014 01:07:37 +0000 (01:07 +0000)]
[SDAG] Fix a case where we would iteratively legalize a node during
combining by replacing it with something else but not re-process the
node afterward to remove it.
In a truly remarkable stroke of bad luck, this would (in the test case
attached) end up getting some other node combined into it without ever
getting re-processed. By adding it back on to the worklist, in addition
to deleting the dead nodes more quickly we also ensure that if it
*stops* being dead for any reason it makes it back through the
legalizer. Without this, the test case will end up failing during
instruction selection due to an and node with a type we don't have an
instruction pattern for.
It took many million runs of the shuffle fuzz tester to find this.
Akira Hatanaka [Wed, 13 Aug 2014 23:23:58 +0000 (23:23 +0000)]
[AArch64, fast-isel] Fall back to SelectionDAG to select tail calls.
Certain functions such as objc_autoreleaseReturnValue have to be called as
tail-calls even at -O0. Since normal fast-isel doesn't emit calls as tail calls,
we have to fall back to SelectionDAG to select calls that are marked as tail.
Juergen Ributzka [Wed, 13 Aug 2014 22:53:29 +0000 (22:53 +0000)]
[FastISel][AArch64] Add support for more addressing modes.
FastISel didn't take much advantage of the different addressing modes available
to it on AArch64. This commit allows the ComputeAddress method to recognize more
addressing modes that allows shifts and sign-/zero-extensions to be folded into
the memory operation itself.
Juergen Ributzka [Wed, 13 Aug 2014 22:25:35 +0000 (22:25 +0000)]
[FastISel][X86] Add large code model support for materializing floating-point constants.
In the large code model for X86 floating-point constants are placed in the
constant pool and materialized by loading from it. Since the constant pool
could be far away, a PC relative load might not work. Therefore we first
materialize the address of the constant pool with a movabsq and then load
from there the floating-point value.
Juergen Ributzka [Wed, 13 Aug 2014 22:18:11 +0000 (22:18 +0000)]
[FastISel][X86] Emit more efficient instructions for integer constant materialization.
This mostly affects the i64 value type, which always resulted in an 15byte
mobavsq instruction to materialize any constant. The custom code checks the
value of the immediate and tries to use a different and smaller mov
instruction when possible.
Juergen Ributzka [Wed, 13 Aug 2014 22:13:14 +0000 (22:13 +0000)]
[FastISel][AArch64] Make use of the zero register when possible.
This change materializes now the value "0" from the zero register.
The zero register can be folded by several instruction, so no
materialization is need at all.
Juergen Ributzka [Wed, 13 Aug 2014 22:08:02 +0000 (22:08 +0000)]
[FastISel] Let the target decide first if it wants to materialize a constant.
This changes the order in which FastISel tries to materialize a constant.
Originally it would try to use a simple target-independent approach, which
can lead to the generation of inefficient code.
On X86 this would result in the use of movabsq to materialize any 64bit
integer constant - even for simple and small values such as 0 and 1. Also
some very funny floating-point materialization could be observed too.
On AArch64 it would materialize the constant 0 in a register even the
architecture has an actual "zero" register.
On ARM it would generate unnecessary mov instructions or not use mvn.
This change simply changes the order and always asks the target first if it
likes to materialize the constant. This doesn't fix all the issues
mentioned above, but it enables the targets to implement such
optimizations.
Gerolf Hoflehner [Wed, 13 Aug 2014 22:07:36 +0000 (22:07 +0000)]
[MachineCombiner] Removal of dangling DBG_VALUES after combining [20598]
This is a cleaner solution to the problem described in r215431.
When instructions are combined a dangling DBG_VALUE is removed.
This resolves bug 20598.
Juergen Ributzka [Wed, 13 Aug 2014 21:42:19 +0000 (21:42 +0000)]
[FastISel][ARM] Use MOVT/MOVW if the subtarget requests it.
This change is also in preparation for a future change to make sure that
the constant materialization uses MOVT/MOVW when available and not a load
from the constant pool.
Juergen Ributzka [Wed, 13 Aug 2014 21:39:18 +0000 (21:39 +0000)]
[FastISel][ARM] Fix a bug in the integer materialization code.
getRegClassFor returns the incorrect register class when in Thumb2 mode.
This fix simply manually selects the register class as in the code just a few
lines above.
There is no test case for this code, because the code is currently
unreachable. This will be changed in a future commit and existing test
cases will exercise this code.
Gerolf Hoflehner [Wed, 13 Aug 2014 21:15:23 +0000 (21:15 +0000)]
[Cleanup] Utility function to erase instruction and mark DBG_Values
New function to erase a machine instruction and mark DBG_VALUE
for removal. A DBG_VALUE is marked for removal when it references
an operand defined in the instruction.
Use the new function to cleanup code in dead machine instruction
removal pass.
Quentin Colombet [Wed, 13 Aug 2014 21:00:07 +0000 (21:00 +0000)]
[MachineDominatorTree] Provide a method to inform a MachineDominatorTree that a
critical edge has been split. The MachineDominatorTree will when lazy update the
underlying dominance properties when require.
** Context **
This is a follow-up of r215410.
Each time a critical edge is split this invalidates the dominator tree
information. Thus, subsequent queries of that interface will be slow until the
underlying information is actually recomputed (costly).
** Problem **
Prior to this patch, splitting a critical edge needed to query the dominator
tree to update the dominator information.
Therefore, splitting a bunch of critical edges will likely produce poor
performance as each query to the dominator tree will use the slow query path.
This happens a lot in passes like MachineSink and PHIElimination.
** Proposed Solution **
Splitting a critical edge is a local modification of the CFG. Moreover, as soon
as a critical edge is split, it is not critical anymore and thus cannot be a
candidate for critical edge splitting anymore. In other words, the predecessor
and successor of a basic block inserted on a critical edge cannot be inserted by
critical edge splitting.
Using these observations, we can pile up the splitting of critical edge and
apply then at once before updating the DT information.
The core of this patch moves the update of the MachineDominatorTree information
from MachineBasicBlock::SplitCriticalEdge to a lazy MachineDominatorTree.
** Performance **
Thanks to this patch, the motivating example compiles in 4- minutes instead of
6+ minutes. No test case added as the motivating example as nothing special but
being huge!
The binaries are strictly identical for all the llvm test-suite + SPECs with and
without this patch for both Os and O3.
Regarding compile time, I observed only noise, although on average I saw a
small improvement.
Matt Arsenault [Wed, 13 Aug 2014 18:14:11 +0000 (18:14 +0000)]
R600: Correctly set the src value offset for scalarized kernel args
This for some reason fixes v1i64 kernel arguments on pre-SI. This
currently breaks some other cases in the kernel-args.ll test for R600,
but I'm not particularly confident in the new output. VTX_READ_* are not
used for some of the scalarized cases, and the code reading from the
constant buffer doesn't make much sense to me.
Benjamin Kramer [Wed, 13 Aug 2014 16:26:38 +0000 (16:26 +0000)]
Canonicalize header guards into a common format.
Add header guards to files that were missing guards. Remove #endif comments
as they don't seem common in LLVM (we can easily add them back if we decide
they're useful)
This patch improves the existing algorithm in DAGCombiner that
attempts to fold shuffles according to rule:
shuffle(shuffle(x, y, M1), undef, M2) -> shuffle(y, undef, M3)
Before this change, there were cases where the DAGCombiner conservatively
avoided folding shuffles even if the resulting mask would have been legal.
That is because the algorithm wrongly assumed that commuting
an illegal shuffle mask would always produce an illegal mask.
With this change, we now correctly compute the commuted shuffle mask before
calling method 'isShuffleMaskLegal' on it.
On X86, this improves for example the codegen for the following function:
define <4 x i32> @test(<4 x i32> %A, <4 x i32> %B) {
%1 = shufflevector <4 x i32> %B, <4 x i32> %A, <4 x i32> <i32 1, i32 2, i32 6, i32 7>
%2 = shufflevector <4 x i32> %1, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 2, i32 3>
ret <4 x i32> %2
}
Before this change the X86 backend (-mcpu=corei7) generated
the following assembly code for function @test:
shufps $-23, %xmm0, %xmm1 # xmm1 = xmm1[1,2],xmm0[2,3]
movhlps %xmm1, %xmm1 # xmm1 = xmm1[1,1]
movaps %xmm1, %xmm0
Now we produce:
movhlps %xmm0, %xmm0 # xmm0 = xmm0[1,1]
Added extra test cases in combine-vec-shuffle-2.ll to verify that we correctly
fold according to the above-mentioned rule.
Toma Tabacu [Wed, 13 Aug 2014 12:48:12 +0000 (12:48 +0000)]
[mips] Refactor calls to setCanHaveModuleDir.
Summary:
Moved some calls to setCanHaveModuleDir to the MipsTargetStreamer base class and removed the resulting empty functions from the MipsTargetELFStreamer class.
Also fixed a missing call to setCanHaveModuleDir in MipsTargetELFStreamer::emitDirectiveSetMicroMips.
Chandler Carruth [Wed, 13 Aug 2014 12:27:18 +0000 (12:27 +0000)]
[shuffle] Stand back! I'm about to (try to) do math!
Especially with blends and large tree heights there was a problem with
the fuzzer where it would end up with enough undef shuffle elements in
enough parts of the tree that in a birthday-attack kind of way we ended
up regularly having large numbers of undef elements in the result. I was
seeing reasonably frequent cases of *all* results being undef which
prevents us from doing any correctness checking at all. While having
undef lanes is important, this was too much.
So I've tried to apply some math to the probabilities of having an undef
lane and balance them against the tree height. Please be gentle, I'm
really terrible at math. I probably made a bunch of amateur mistakes
here. Fixes, etc. are quite welcome. =D At least in running it some, it
seems to be producing more interesting (for correctness testing)
results.
Chandler Carruth [Wed, 13 Aug 2014 10:49:33 +0000 (10:49 +0000)]
[optnone] Make the optnone attribute effective at suppressing function
attribute and function argument attribute synthesizing and propagating.
As with the other uses of this attribute, the goal remains a best-effort
(no guarantees) attempt to not optimize the function or assume things
about the function when optimizing. This is particularly useful for
compiler testing, bisecting miscompiles, triaging things, etc. I was
hitting specific issues using optnone to isolate test code from a test
driver for my fuzz testing, and this is one step of fixing that.
Daniel Sanders [Wed, 13 Aug 2014 10:07:34 +0000 (10:07 +0000)]
Re-commit: [mips] Implement .ent, .end, .frame, .mask and .fmask.
Patch by Matheus Almeida and Toma Tabacu
The lld test failure on the previous attempt to commit was caused by the
addition of the .pdr section causing the offsets it was checking to change.
This has been fixed by removing the .ent/.end directives from that test since
they weren't really needed.
Chandler Carruth [Wed, 13 Aug 2014 09:05:59 +0000 (09:05 +0000)]
[shuffle] Teach the shuffle fuzzer to fuzz blends, including forming
a tree of inputs to blend iteratively together.
This required a pretty substantial rewrite of the innards. The number of
shuffle instructions is now bounded in terms of tree-height. There is
a flag to disable blends so that its still possible to test single input
shuffles. I've also improved various aspects of how the test program is
generated, primarily to simplify the test harness and allow some
optimizations to clean up how we actually check the results and build up
the inputs.
Again, apologies for my likely horrible use of Python... But hey, it
works! (Ish?)
As of r214452, isa<MemSDNode> will return true for nodes for which
isa<MemIntrinsicSDNode> will return true (classof now respects the actual class
hierarchy). So we no longer need to check for both MemIntrinsicSDNode and
MemSDNode separately.
Chandler Carruth [Wed, 13 Aug 2014 01:25:45 +0000 (01:25 +0000)]
[x86] Rewrite a core part of the new vector shuffle lowering to handle
one pesky test case correctly.
This test case caused the old code to infloop occilating between solving
the low-half and the high-half. The 'side balancing' part of
single-input v8 shuffle lowering didn't handle the one pattern which can
cause it to occilate. Fortunately the fuzz testing found this case.
Unfortuately it was *terrible* to handle. I'm really sorry for the
amount and density of the code here, I'd love suggestions on how to
simplify it. I feel like there *must* be a simpler form here, but after
a lot of days I've not found it. This is the only one I've found that
even works. I've added the one pesky test case along with some nice
comments explaining the core problem that we have to solve here.
So far this has survived approximately 32k test cases. More strenuous
fuzzing commencing.
This implements PPCTargetLowering::getTgtMemIntrinsic for Altivec load/store
intrinsics. As with the construction of the MachineMemOperands for the
intrinsic calls used for unaligned load/store lowering, the only slight
complication is that we need to represent a larger memory range than the
loaded/stored value-type size (because the address is rounded down to an
aligned address, and we need to conservatively represent the entire possible
range of the actual access). This required adding an extra size field to
TargetLowering::IntrinsicInfo, and this was done in a way that required no
modifications to other targets (the size defaults to the store size of the
provided memory data type).
This fixes test/CodeGen/PowerPC/unal-altivec-wint.ll (so it can be un-XFAILed).
Hal Finkel [Wed, 13 Aug 2014 01:15:37 +0000 (01:15 +0000)]
Fix classof for ISD::INTRINSIC_W_CHAIN and INTRINSIC_VOID
Unfortunately, our use of the SDNode class hierarchy for INTRINSIC_W_CHAIN and
INTRINSIC_VOID nodes is somewhat broken right now. These nodes sometimes are
used for memory intrinsics (those with MachineMemOperands), and sometimes not.
When not, the nodes are not created as instances of MemIntrinsicSDNode, but
rather created as some other subclass of SDNode using DAG::getNode. When they
are memory intrinsics, they are created using DAG::getMemIntrinsicNode as
instances of MemIntrinsicSDNode. MemIntrinsicSDNode is a subclass of
MemSDNode, but prior to r214452, we had a non-self-consistent setup whereby
MemIntrinsicSDNode::classof on INTRINSIC_W_CHAIN and INTRINSIC_VOID would
return true but MemSDNode::classof on INTRINSIC_W_CHAIN and INTRINSIC_VOID
would return false. In r214452, MemSDNode::classof was changed to return true
for INTRINSIC_W_CHAIN and INTRINSIC_VOID, which is now self-consistent. The
problem is that neither the pre-r214452 logic and the post-r214452 logic are
really right. The truth is that not all INTRINSIC_W_CHAIN and INTRINSIC_VOID
nodes are instances of MemIntrinsicSDNode (or MemSDNode for that matter), and
the return value from classof needs to reflect that. This was broken before
r214452 (because MemIntrinsicSDNode::classof always returned true), and was
broken afterward (because MemSDNode::classof also always returned true), and
will now be correct.
The minimal solution is to grab one of the SubclassData bits (there is one left
for MemIntrinsicSDNode nodes) and use it to store whether or not a particular
INTRINSIC_W_CHAIN or INTRINSIC_VOID is really an instance of
MemIntrinsicSDNode or not. Doing this allows both MemIntrinsicSDNode::classof
and MemSDNode::classof to return the correct answer for the underlying object
for both the memory-intrinsic and non-memory-intrinsic cases.
This fixes the problem that r214452 created in the SelectionDAGDumper (thanks
to Matt Arsenault for pointing it out).
Because PowerPC does not implement getTgtMemIntrinsic, this change breaks
test/CodeGen/PowerPC/unal-altivec-wint.ll. I've XFAILed it for now, and will
fix it in a follow-up commit.
Reid Kleckner [Tue, 12 Aug 2014 22:01:39 +0000 (22:01 +0000)]
APInt: Make self-move-assignment a no-op to fix stage3 clang-cl
It's not clear what the semantics of a self-move should be. The
consensus appears to be that a self-move should leave the object in a
moved-from state, which is what our existing move assignment operator
does.
However, the MSVC 2013 STL will perform self-moves in some cases. In
particular, when doing a std::stable_sort of an already sorted APSInt
vector of an appropriate size, one of the merge steps will self-move
half of the elements.
We don't notice this when building with MSVC, because MSVC will not
synthesize the move assignment operator for APSInt. Presumably MSVC
does this because APInt, the base class, has user-declared special
members that implicitly delete move special members. Instead, MSVC
selects the copy-assign operator, which defends against self-assignment.
Clang, on the other hand, selects the move-assign operator, and we get
garbage APInts.
Adam Nemet [Tue, 12 Aug 2014 21:13:12 +0000 (21:13 +0000)]
[AVX512] Handle valign masking intrinsic via C++ lowering
I think that this will scale better in most cases than adding a Pat<> for each
mapping from the intrinsic DAG to the intruction (i.e. rri, rrik, rrikz). We
can just lower to the SDNode and have the resulting DAG be matches by the DAG
patterns.
Alternatively (long term), we could keep the Pat<>s but generate them via the
new AVX512_masking multiclass. The difficulty is that in order to formulate
that we would have to concatenate DAGs. Currently this is only supported if
the operators of the input DAGs are identical.
Don't upgrade global constructors when reading bitcode
An optional third field was added to `llvm.global_ctors` (and
`llvm.global_dtors`) in r209015. Most of the code has been changed to
deal with both versions of the variables. Users of the C API might
create either version, the helper functions in LLVM create the two-field
version, and clang now creates the three-field version.
However, the BitcodeReader was changed to always upgrade to the
three-field version. This created an unnecessary inconsistency in the
IR before/after serializing to bitcode.
This commit resolves the inconsistency by making the third field truly
optional (and not upgrading in the bitcode reader). Since `llvm-link`
was relying on this upgrade code, rather than deleting it I've moved it
into `ModuleLinker`, where it upgrades these arrays as necessary to
resolve inconsistencies between modules.
The ideal resolution would be to remove the 2-field version and make the
third field required. I filed PR20506 to track that.
I changed `test/Bitcode/upgrade-global-ctors.ll` to a negative test and
duplicated the `llvm-link` check in `test/Linker/global_ctors.ll` to
check both upgrade directions.
Since I came across this as part of PR5680 (serializing use-list order),
I've also added the missing `verify-uselistorder` RUN line to
`test/Bitcode/metadata-2.ll`.
Rafael Espindola [Tue, 12 Aug 2014 15:39:14 +0000 (15:39 +0000)]
Add a plugin testcase for merging weak variables.
I initially thought I could implement COMDATs with aliases by just
internalizing GVs instead of dropping them. This is a counter
example: Internalizing one of the @a would make @b and @c point
to different variables.
Eric Christopher [Tue, 12 Aug 2014 08:00:56 +0000 (08:00 +0000)]
Have MachineRegisterInfo take and store the MachineFunction it
was created for rather than the TargetMachine since we only
needed the TM for the subtarget and we can get that from the
MF.
Justin Bogner [Tue, 12 Aug 2014 03:24:59 +0000 (03:24 +0000)]
IR: Print a newline when dumping Types
Type::dump() doesn't print a newline, which makes for a poor
experience in a debugger. This looks like it was an ommission
considering Value::dump() two lines above, so I've changed Type to add
a newline as well.
Of the two in-tree callers, one added a newline anyway, and I've
updated the other one to use Type::print instead.
Adrian Prantl [Tue, 12 Aug 2014 01:07:53 +0000 (01:07 +0000)]
DebugLocEntry: Restore the comparison predicate from before the
refactoring in 215384. This way it can unique multiple entries describing
the same piece even if they don't have the exact same location.
(The same piece may get merged in and be added from OpenRanges).
There ought to be a more elegant solution for this, though.
Reid Kleckner [Tue, 12 Aug 2014 00:12:43 +0000 (00:12 +0000)]
msan: Handle musttail calls
First, avoid calling setTailCall(false) on musttail calls. The funciton
prototypes should be "congruent", so the shadow layout should be exactly
the same.
Second, avoid inserting instrumentation after a musttail call to
propagate the return value shadow. We don't need to propagate the
result of a tail call, it should already be in the right place.
Quentin Colombet [Mon, 11 Aug 2014 23:52:01 +0000 (23:52 +0000)]
[MachineSink] Improve the compile time by preserving the dominance information
as long as possible.
** Context **
Each time the dominance information is modified, the dominator tree analysis
switches in a slow query mode. After a few queries without any modification on
the dominator tree, it performs an expensive update of its internal structure to
provide fast queries again.
** Problem **
Prior to this patch, the MachineSink pass was splitting the critical edges on
demand while relying heavy on the dominator tree information. In some cases,
this leads to pathological behavior where:
- We end up in the slow query mode right after splitting an edge.
- We update the dominance information.
- We break the dominance information again, thus ending up in the slow query
mode and so on.
** Proposed Solution **
To mitigate this effect, this patch postpones all the splitting of the edges at
the end of each iteration of the main loop.
The benefits are:
- The dominance information is valid for the life time of an iteration.
- This simplifies the code as we do not have to special treat instructions that
are sunk on critical edges. Indeed, the related block will be available
through the next iteration.
The downside is that when edges splitting is required, this incurs an additional
iteration of the main loop compared to the previous scheme.
** Performance **
Thanks to this patch, the motivating example compiles in 6+ minutes instead of
10+ minutes. No test case added as the motivating example as nothing special but
being huge!
I have measured only noise for both the compile time and the runtime on the llvm
test-suite + SPECs with Os and O3.
Note: The current implementation of MachineBasicBlock::SplitCriticalEdge also
uses the dominance information and therefore, hits this problem. A subsequent
patch will address that.
Tom Stellard [Mon, 11 Aug 2014 22:18:11 +0000 (22:18 +0000)]
R600/SI: Add check for low 32 bits of encoding to mubuf tests
There are no variable values like registers encoded in the low 32 bits of MUBUF
instructions, so it is relatively easy to check these bits, and it will
help prevent us from introducing encoding bugs.
Quentin Colombet [Mon, 11 Aug 2014 22:17:14 +0000 (22:17 +0000)]
Add isRegSequence property.
This patch adds a new property: isRegSequence and the related target hooks:
TargetIntrInfo::getRegSequenceInputs and
TargetInstrInfo::getRegSequenceLikeInputs to specify that a target specific
instruction is a (kind of) REG_SEQUENCE.
Adrian Prantl [Mon, 11 Aug 2014 21:05:55 +0000 (21:05 +0000)]
Debug Info: Move the sorting and uniqueing of pieces from emitLocPieces()
into buildLocationList(). By keeping the list of Values sorted,
DebugLocEntry::Merge can also merge multi-piece entries.
ARM: try harder to detect non-IT eligible instructions
For many Thumb-1 register register instructions, setting the CPSR is not
permitted inside an IT block. We would not correctly flag those instructions.
The previous change to identify this scenario was insufficient as it did not
actually catch all the instances. The current list is formed by manual
inspection of the ARMv6M ARM.
The change to the Thumb2 IT block test is due to the fact that the new more
stringent checking of the MIs results in the If Conversion pass being prevented
from executing (since not all the instructions in the BB are predicable). This
results in code gen changes.
Thanks to Tim Northover for pointing out that the previous patch was
insufficient and hinting that the use of the v6M ARM would be much easier to use
than the v7 or v8!