Bill Wendling [Fri, 13 Apr 2012 01:06:27 +0000 (01:06 +0000)]
Code-gen may inject code into the IR before it emits the ASM. The linker
obviously cannot know that this code is present, let alone used. So prevent the
internalize pass from internalizing those global values which code-gen may
insert.
Dan Gohman [Fri, 13 Apr 2012 00:59:57 +0000 (00:59 +0000)]
Don't move objc_autorelease calls past autorelease pool boundaries when
optimizing autorelease calls on phi nodes with null operands.
This fixes rdar://11207070.
Dan Gohman [Thu, 12 Apr 2012 23:31:46 +0000 (23:31 +0000)]
Add forms of dominates and isReachableFromEntry that accept a Use
directly instead of a user Instruction. This allows them to test
whether a def dominates a particular operand if the user instruction
is a PHI.
There is an assert at line 558 in ScheduleDAGInstrs::buildSchedGraph(AliasAnalysis *AA).
This assert needs to addressed for post RA scheduler. Until that assert is addressed,
any passes that uses post ra scheduler will fail. So, I am temporarily disabling the
hexagon tests until that fix is in.
The assert is as follows:
assert(!MI->isTerminator() && !MI->isLabel() &&
"Cannot schedule terminators or labels!");
This patch improves the MCJIT runtime dynamic loader by adding new handling
of zero-initialized sections, virtual sections and common symbols
and preventing the loading of sections which are not required for
execution such as debug information.
Jim Grosbach [Wed, 11 Apr 2012 17:40:18 +0000 (17:40 +0000)]
ARM 'vuzp.32 Dd, Dm' is a pseudo-instruction.
While there is an encoding for it in VUZP, the result of that is undefined,
so we should avoid it. Define the instruction as a pseudo for VTRN.32
instead, as the ARM ARM indicates.
Jim Grosbach [Wed, 11 Apr 2012 16:53:25 +0000 (16:53 +0000)]
ARM 'vzip.32 Dd, Dm' is a pseudo-instruction.
While there is an encoding for it in VZIP, the result of that is undefined,
so we should avoid it. Define the instruction as a pseudo for VTRN.32
instead, as the ARM ARM indicates.
Benjamin Kramer [Wed, 11 Apr 2012 14:06:54 +0000 (14:06 +0000)]
Cache the hash value of the operands in the MDNode.
FoldingSet is implemented as a chained hash table. When there is a hash
collision during insertion, which is common as we fill the table until a
load factor of 2.0 is hit, we walk the chained elements, comparing every
operand with the new element's operands. This can be very expensive if the
MDNode has many operands.
We sacrifice a word of space in MDNode to cache the full hash value, reducing
compares on collision to a minimum. MDNode grows from 28 to 32 bytes + operands
on x86. On x86_64 the new bits fit nicely into existing padding, not growing
the struct at all.
The actual speedup depends a lot on the test case and is typically between
1% and 2% for C++ code with clang -c -O0 -g.
Add a C binding to the Target and TargetMachine classes to allow for emitting
binary and assembly. Patch by Carlo Kok. Emitting was inspired by but not based
on the D llvm bindings.
Fix a dagcombine optimization which assumes that the vsetcc result type is always
of the same size as the compared values. This is ture for SSE/AVX/NEON but not
for all targets.
Original message:
Modify the code that lowers shuffles to blends from using blendvXX to vblendXX.
blendV uses a register for the selection while Vblend uses an immediate.
On sandybridge they still have the same latency and execute on the same execution ports.
Clean up ARM fused multiply + add/sub support some more: rename some isel
predicates.
Also remove NEON2 since it's not really useful and it is confusing. If
NEON + VFP4 implies NEON2 but NEON2 doesn't imply NEON + VFP4, what does it
really mean?
Kevin Enderby [Wed, 11 Apr 2012 00:25:40 +0000 (00:25 +0000)]
Fix ARM disassembly of VLD instructions with writebacks. Â And add test a case
for all opcodes handed by DecodeVLDInstruction() in ARMDisassembler.cpp .
Fix a number of problems with ARM fused multiply add/subtract instructions.
1. The new instruction itinerary entries are not properly described.
2. The asm parser can't handle vfms and vfnms.
3. There were no assembler, disassembler test cases.
4. HasNEON2 has the wrong assembler predicate.
rdar://10139676
Tweak MachineLICM heuristics for cheap instructions.
Allow cheap instructions to be hoisted if they are register pressure
neutral or better. This happens if the instruction is the last loop use
of another virtual register.
Only expensive instructions are allowed to increase loop register
pressure.
Owen Anderson [Tue, 10 Apr 2012 22:46:53 +0000 (22:46 +0000)]
Move the constant-folding support for FP_ROUND in SelectionDAG from the one-operand version of getNode() to the two-operand version, since it became a two-operand node at sound point.
Zap a testcase that this allows us to completely fold away.
ConstantFP::get(Type*, double) is unreliably host-specific:
it can't handle a type like PPC128 on an x86 host. It even
has a comment to that effect: "This should only be used for
simple constant values like 2.0/1.0 etc, that are
known-valid both as host double and as the target format."
Instead, use APFloat. While we're at it, randomize the floating
point value more thoroughly; it was previously limited
to the range 0 to 2**19 - 1.
[tsan] two more compile-time optimizations:
- don't isntrument reads from constant globals.
Saves ~1.5% of instrumented instructions on CPU2006
(counting static instructions, not their execution).
- don't insrument reads from vtable (which is a global constant too).
Saves ~5%.
I did not measure the run-time impact of this,
but it is certainly non-negative.
Bill Wendling [Tue, 10 Apr 2012 20:12:16 +0000 (20:12 +0000)]
The MDString class stored a StringRef to the string which was already in a
StringMap. This was redundant and unnecessarily bloated the MDString class.
Because the MDString class is a "Value" and will never have a "name", and
because the Name field in the Value class is a pointer to a StringMap entry, we
repurpose the Name field for an MDString. It stores the StringMap entry in the
Name field, and uses the normal methods to get the string (name) back.
[tsan] compile-time instrumentation: do not instrument a read if
a write to the same temp follows in the same BB.
Also add stats printing.
On Spec CPU2006 this optimization saves roughly 4% of instrumented reads
(which is 3% of all instrumented accesses):
Writes : 161216
Reads : 446458
Reads-before-write: 18295
Eric Christopher [Tue, 10 Apr 2012 18:18:10 +0000 (18:18 +0000)]
To ensure that we have more accurate line information for a block
don't elide the branch instruction if it's the only one in the block,
otherwise it's ok.
Jim Grosbach [Tue, 10 Apr 2012 17:31:55 +0000 (17:31 +0000)]
ARM fix cc_out operand handling for t2SUBrr instructions.
We were incorrectly conflating some add variants which don't have a
cc_out operand with the mirroring sub encodings, which do. Part of the
awesome non-orthogonality legacy of thumb1. Similarly, handling of
add/sub of an immediate was sometimes incorrectly removing the cc_out
operand for add/sub register variants.
Fix a dagcombine optimization which assumes that the vsetcc result type is always
of the same size as the compared values. This is ture for SSE/AVX/NEON but not
for all targets.
Modify the code that lowers shuffles to blends from using blendvXX to vblendXX.
blendv uses a register for the selection while vblend uses an immediate.
On sandybridge they still have the same latency and execute on the same execution ports.
Make a somewhat subtle change in the logic of block placement. Sometimes
the loop header has a non-loop predecessor which has been pre-fused into
its chain due to unanalyzable branches. In this case, rotating the
header into the body of the loop in order to place a loop exit at the
bottom of the loop is a Very Bad Idea as it makes the loop
non-contiguous.
I'm working on a good test case for this, but it's a bit annoynig to
craft. I should get one shortly, but I'm submitting this now so I can
begin the (lengthy) performance analysis process. An initial run of LNT
looks really, really good, but there is too much noise there for me to
trust it much.
Andrew Trick [Tue, 10 Apr 2012 05:14:42 +0000 (05:14 +0000)]
Fix 12513: Loop unrolling breaks with indirect branches.
Take this opportunity to generalize the indirectbr bailout logic for
loop transformations. CFG transformations will never get indirectbr
right, and there's no point trying.
Andrew Trick [Tue, 10 Apr 2012 02:25:24 +0000 (02:25 +0000)]
Added register unit sets to the target description.
This is a new algorithm that finds sets of register units that can be
used to model registers pressure. This handles arbitrary, overlapping
register classes. Each register class is associated with a (small)
list of pressure sets. These are the dimensions of pressure affected
by the register class's liveness.
Andrew Trick [Tue, 10 Apr 2012 02:25:21 +0000 (02:25 +0000)]
Added register unit weights to the target description.
This is a new algorithm that associates registers with weighted
register units to accuretely model their effect on register
pressure. This handles registers with multiple overlapping
subregisters. It is possible, but almost inconceivable that the
algorithm fails to find an exact solution for a target description. If
an exact solution cannot be found, an inexact, but reasonable solution
will be chosen.
Fix a long standing tail call optimization bug. When a libcall is emitted
legalizer always use the DAG entry node. This is wrong when the libcall is
emitted as a tail call since it effectively folds the return node. If
the return node's input chain is not the entry (i.e. call, load, or store)
use that as the tail call input chain.
Chad Rosier [Mon, 9 Apr 2012 20:32:02 +0000 (20:32 +0000)]
When performing a truncating store, it's possible to rearrange the data
in-register, such that we can use a single vector store rather then a
series of scalar stores.
Lang Hames [Mon, 9 Apr 2012 20:17:30 +0000 (20:17 +0000)]
Patch r153892 for PR11861 apparently broke an external project (see PR12493).
This patch restores TwoAddressInstructionPass's pre-r153892 behaviour when
rescheduling instructions in TryInstructionTransform. Hopefully this will fix
PR12493. To refix PR11861, lowering of INSERT_SUBREGS is deferred until after
the copy that unties the operands is emitted (this seems to be a more
appropriate fix for that issue anyway).