xorl + setcc is generally the preferred sequence due to the partial register
stall setcc + movzbl suffers from. As a bonus, it also encodes one byte smaller.
On CPUs with the zero cycle zeroing feature enabled "movi v.2d" should
be used to zero a vector register. This was previously done at
instruction selection time, however the register coalescer sometimes
widened multiple vregs to the Q width because of that leading to extra
spills. This patch leaves the decision on how to zero a register to the
AsmPrinter phase where it doesn't affect register allocation anymore.
This patch also sets isAsCheapAsAMove=1 on FMOVS0, FMOVD0.
This fixes http://llvm.org/PR27454, rdar://25866262
AArch64: Replace a RegScavenger instance with LivePhysRegs
findScratchNonCalleeSaveRegister() just needs a simple liveness
analysis, use LivePhysRegs for that as it is simpler and does not depend
on the kill flags.
This commit adds a convenience function available() to LivePhysRegs:
This function returns true if the given register is not reserved and
neither the register nor any of its aliases are alive.
[InstCombine] use more specific pattern matchers; NFCI
Follow-up from r274465: we don't need to capture the value in these cases,
so just match the constant that we're looking for. m_One/m_Zero work with
vector splats as well as scalars.
Piotr Padlewski [Wed, 6 Jul 2016 20:26:25 +0000 (20:26 +0000)]
Add 'thinlto_src_module' metadata to imported function
Added metadata to be able to make statistics on how many functions
that have been imported have been removed. Also module name might
be helpfull when debugging.
Chad Rosier [Wed, 6 Jul 2016 19:48:52 +0000 (19:48 +0000)]
[DSE] Avoid iterator invalidation bugs.
The dse_with_dbg_value.ll test committed with r273141 is removed because this
we no longer performs any type of back tracking, which is what was causing the
codegen differences with and without debug information.
Paul Robinson [Wed, 6 Jul 2016 19:32:41 +0000 (19:32 +0000)]
[Conduct] Refine what "rare cases" means wrt violations outside our spaces.
Text suggested by Daniel Berlin. While it is likely to be exactly what
the advisory committee would do anyway, codifying it does no harm and
helps reassure people that rare does not mean arbitrary.
[x86] fix cost of SINT_TO_FP for i32 --> float (PR21356, PR28434)
This is "cvtdq2ps" which does not appear to be particularly slow on any CPU
according to Agner's tables. Choosing "5" as a cost here as suggested in:
https://llvm.org/bugs/show_bug.cgi?id=21356
...but it seems very conservative given that the instruction is fully pipelined,
and I think these costs are supposed to model throughput.
Note that related costs are also most likely too high, but this fixes PR21356
and partly fixes PR28434.
This logic was introduced in r157663 and does not make any sense to me.
The motivating example in rdar://11538365 looks like this:
This is the tail:
BB#16: derived from LLVM BB %if.end68
Live Ins: %R0 %R4 %R5
Predecessors according to CFG: BB#15 BB#5
tBLXi pred:14, pred:%noreg, <ga:@CFRelease>, %R0<kill>, <regmask>, %LR<imp-def,dead>, %SP<imp-use>, %SP<imp-def>
t2B <BB#20>, pred:14, pred:%noreg
Successors according to CFG: BB#20
This is the predBB:
BB#5:
Live Ins: %R5
Predecessors according to CFG: BB#4
%R4<def> = t2MOVi 0, pred:14, pred:%noreg, opt:%noreg
t2B <BB#16>, pred:14, pred:%noreg
Successors according to CFG: BB#16
However this is invalid machine code to begin with, if %R0 is live-in to
BB#16 then it must be live-in to BB#5 as well if BB#5 does not define
it. We should not need logic to retroactively fix broken machine code
and in fact the example from r157663 passes cleanly with the code
removed and I do not see any (newly) failing tests with the machine
verifier enabled.
[SystemZ] Remove AND mask of bottom 6 bits when result is used for shift/rotate
On SystemZ, shift and rotate instructions only use the bottom 6 bits of the shift/rotate amount.
Therefore, if the amount is ANDed with an immediate mask that has all of the bottom 6 bits set, we
can remove the AND operation entirely.
Simon Pilgrim [Wed, 6 Jul 2016 18:09:08 +0000 (18:09 +0000)]
[X86][SSE] Fixed typo in insertps lowering.
We were checking for 2 insertions (which is caught earlier in the pattern matching loop) instead of the case where we have no insertions.
Turns out this code never fires as we always try to lower to insertps after trying to lower to blendps, which would catch these cases - I'm about to make some changes to support combining to insertps which could cause this to fire so I don't want to remove it.
Ensure all uses of permute instructions feed vector stores
There is a problem in VSXSwapRemoval where it is incorrectly removing permute instructions.
In this case, the permute is feeding both a vector store and also a non-store instruction. In this case, the permute cannot be removed.
The fix is to simply look at all the uses of the vector register defined by the permute and ensure that all the uses are vector store instructions.
This problem was reported in PR 27735 (https://llvm.org/bugs/show_bug.cgi?id=27735).
Tim Shen [Wed, 6 Jul 2016 17:44:03 +0000 (17:44 +0000)]
[DAGCombiner] Fix visitSTORE to continue processing current SDNode, if findBetterNeighborChains doesn't actually CombineTo it.
Summary:
findBetterNeighborChains may or may not find a better chain for each node it finds, which include the node ("St") that visitSTORE is currently processing. If no better chain is found for St, visitSTORE should continue instead of return SDValue(St, 0), as if it's CombinedTo'ed.
This fixes bug 28130. There might be other ways to make the test pass (see D21409). I think both of the patches are fixing actual bugs revealed by the same testcase.
[TTI] The cost model should not assume vector casts get completely scalarized
The cost model should not assume vector casts get completely scalarized, since
on targets that have vector support, the common case is a partial split up to
the legal vector size. So, when a vector cast gets split, the resulting casts
end up legal and cheap.
Instead of pessimistically assuming scalarization, base TTI can use the costs
the concrete TTI provides for the split vector, plus a fudge factor to account
for the cost of the split itself. This fudge factor is currently 1 by default,
except on AMDGPU where inserts and extracts are considered free.
Matthew Simpson [Wed, 6 Jul 2016 14:26:59 +0000 (14:26 +0000)]
[LV] Don't widen trivial induction variables
We currently always vectorize induction variables. However, if an induction
variable is only used for counting loop iterations or computing addresses with
getelementptr instructions, we don't need to do this. Vectorizing these trivial
induction variables can create vector code that is difficult to simplify later
on. This is especially true when the unroll factor is greater than one, and we
create vector arithmetic when computing step vectors. With this patch, we check
if an induction variable is only used for counting iterations or computing
addresses, and if so, scalarize the arithmetic when computing step vectors
instead. This allows for greater simplification.
This patch addresses the suboptimal pointer arithmetic sequence seen in
PR27881.
[ARM] Do not test for CPUs, use SubtargetFeatures. Also remove 2 flags.
This is a follow-up for r273544.
The end goal is to get rid of the isSwift / isCortexXY / isWhatever methods.
This commit also removes two command-line flags that weren't used in any of the
tests: widen-vmovs and swift-partial-update-clearance. The former may be easily
replaced with the mattr mechanism, but the latter may not (as it is a subtarget
property, and not a proper feature).
AVX-512: Optimization for patterns with i1 scalar type
The patch removes redundant kmov instructions (not all, we still have a lot of work here) and redundant "and" instructions after "setcc".
I use "AssertZero" marker between X86ISD::SETCC node and "truncate" to eliminate extra "and $1" instruction.
I also changed zext, aext and trunc patterns in the .td file. It allows to remove extra "kmov" instruictions.
This patch fixes https://llvm.org/bugs/show_bug.cgi?id=28173.
Fast ISEL mode is not supported correctly for AVX-512. ICMP/FCMP scalar instruction should return result in k-reg. It will be fixed in one of the next patches. I redirected handling of "cmp" to the DAG builder mode. (The code looks worse in one specific test case, but without this fix the new patch fails).
Summary:
Since "AMDGPU: Fix verifier errors in SILowerControlFlow", the logic that
ensures that a non-void-returning shader falls off the end of the last
basic block was effectively disabled, since SI_RETURN is now used.
StratifiedSets (as implemented) is very fast, but its accuracy is also
limited. If we take a more aggressive andersens-like approach, we can be
way more accurate, but we'll also end up being slower.
So, we've decided to split CFLAA into CFLSteensAA and CFLAndersAA.
Long-term, we want to end up in a place where CFLSteens is queried
first; if it can provide an answer, great (since queries are basically
map lookups). Otherwise, we'll fall back to CFLAnders, BasicAA, etc.
This patch splits everything out so we can try to do something like
that when we get a reasonable CFLAnders implementation.
Tim Northover [Tue, 5 Jul 2016 23:15:58 +0000 (23:15 +0000)]
AArch64: try to fix optimized build failure.
I think the Ops filled out by Regex::match contain pointers into the temporary
std::string returned by StringRef::upper. Its lifetime is extended by the call
to match, but only until the end of that call (not to the uses of Ops later
on).
Tim Northover [Tue, 5 Jul 2016 21:23:04 +0000 (21:23 +0000)]
AArch64: TableGenerate system instruction operands.
The way the named arguments for various system instructions are handled at the
moment has a few problems:
- Large-scale duplication between AArch64BaseInfo.h and AArch64BaseInfo.cpp
- That weird Mapping class that I have no idea what I was on when I thought
it was a good idea.
- Searches are performed linearly through the entire list.
- We print absolutely all registers in upper-case, even though some are
canonically mixed case (SPSel for example).
- The ARM ARM specifies sysregs in terms of 5 fields, but those are relegated
to comments in our implementation, with a slightly opaque hex value
indicating the canonical encoding LLVM will use.
This adds a new TableGen backend to produce efficiently searchable tables, and
switches AArch64 over to using that infrastructure.
Not all code-paths set the relocation model to static for Windows. This
currently breaks on Windows ARM with `-mlong-calls` when built with clang.
Loosen the assertion to what it was previously. We would ideally ensure that
all the configuration sets Windows to static relocation model.
Tim Northover [Tue, 5 Jul 2016 18:02:57 +0000 (18:02 +0000)]
AArch64: use correct SDValue # when looking for bitfield placement.
The other use really does only care about the SDNode (it checks the
opcode against a whitelist), but bitFieldPlacement can be misled if
the node produces multiple results.
Tom Stellard [Tue, 5 Jul 2016 16:10:44 +0000 (16:10 +0000)]
AMDGPU/SI: Remove address space query functions from AMDGPUDAGToDAGISel
Summary:
These have been replaced with TableGen code (except for isConstantLoad,
which is still used for R600). The queries were broken for cases
where MemOperand was a PseudoSourceValue.
John Brawn [Tue, 5 Jul 2016 13:16:54 +0000 (13:16 +0000)]
[CMake] Adjust export_executable_symbols to cope with non-target link libraries
export_executable_symbols looks though the link libraries of the executable in
order to figure out transitive dependencies, but in doing so it assumes that
all link libraries are also targets. This is not true as of r273302, so adjust
it to check if they actually are targets.
James Molloy [Tue, 5 Jul 2016 12:37:13 +0000 (12:37 +0000)]
[Thumb] Reapply r272251 with a fix for PR28348 (mk 2)
The important thing I was missing was ensuring newly added constants were kept in topological order. Repositioning the node is correct if the constant is newly added (so it has no topological ordering) but wrong if it already existed - positioning it next in the worklist would break the topological ordering.
Original commit message:
[Thumb] Select a BIC instead of AND if the immediate can be encoded more optimally negated
If an immediate is only used in an AND node, it is possible that the immediate can be more optimally materialized when negated. If this is the case, we can negate the immediate and use a BIC instead;
int i(int a) {
return a & 0xfffffeec;
}
Used to produce:
ldr r1, [CONSTPOOL]
ands r0, r1
CONSTPOOL: 0xfffffeec
And now produces:
movs r1, #255
adds r1, #20 ; Less costly immediate generation
bics r0, r1
[PowerPC] - Legalize vector types by widening instead of integer promotion
This patch corresponds to review:
http://reviews.llvm.org/D20443
It changes the legalization strategy for illegal vector types from integer
promotion to widening. This only applies for vectors with elements of width
that is a multiple of a byte since we have hardware support for vectors with
1, 2, 3, 8 and 16 byte elements.
Integer promotion for vectors is quite expensive on PPC due to the sequence
of breaking apart the vector, extending the elements and reconstituting the
vector. Two of these operations are expensive.
This patch causes between minor and major improvements in performance on most
benchmarks. There are very few benchmarks whose performance regresses. These
regressions can be handled in a subsequent patch with a DAG combine (similar
to how this patch handles int -> fp conversions of illegal vector types).