Aaron Ballman [Mon, 24 Nov 2014 14:03:16 +0000 (14:03 +0000)]
Removing a variable that is initialized but never read. The original author has been alerted to the warning, in case this variable is meant to be used. Fixes -Werror builds in the meantime.
Jozef Kolek [Mon, 24 Nov 2014 13:29:59 +0000 (13:29 +0000)]
[mips][microMIPS] Implement disassembler support for 16-bit instructions
With the help of new method readInstruction16() two bytes are read and
decodeInstruction() is called with DecoderTableMicroMips16, if this fails
four bytes are read and decodeInstruction() is called with
DecoderTableMicroMips32.
Andrea Di Biagio [Mon, 24 Nov 2014 12:23:15 +0000 (12:23 +0000)]
[X86] Improved target specific combine on VSELECT dag nodes.
This patch teaches function 'transformVSELECTtoBlendVECTOR_SHUFFLE' how to
convert VSELECT dag nodes to shuffles on targets that do not have SSE4.1.
On pre-SSE4.1 targets, we can still perform blend operations using movss/movsd.
Also, removed a target specific combine that performed a premature lowering of
VSELECT nodes to target specific MOVSS/MOVSD nodes.
Fill in omission of `cast_or_null<>` and `dyn_cast_or_null<>` for types
that wrap pointers (e.g., smart pointers).
Type traits need to be slightly stricter than for `cast<>` and
`dyn_cast<>` to resolve ambiguities with simple types.
There didn't seem to be any unit tests for pointer wrappers, so I tested
`isa<>`, `cast<>`, and `dyn_cast<>` while I was in there.
This only supports pointer wrappers with a conversion to `bool` to check
for null. If in the future it's useful to support wrappers without such
a conversion, it should be a straightforward incremental step to use the
`simplify_type` machinery for the null check. In that case, the unit
tests should be updated to remove the `operator bool()` from the
`pointer_wrappers::PTy`.
r222375 made some improvements to build_vector lowering of v4x32 and v4xf32 into an insertps, but it missed a case where:
1. A single extracted element is used twice.
2. The lower of the two non-zero indexes should be preserved, and the higher should be used for the dest mask.
This caused a crash, since the source value for the insertps ends-up uninitialized.
Masked Vector Load and Store Intrinsics.
Introduced new target-independent intrinsics in order to support masked vector loads and stores. The loop vectorizer optimizes loops containing conditional memory accesses by generating these intrinsics for existing targets AVX2 and AVX-512. The vectorizer asks the target about availability of masked vector loads and stores.
Added SDNodes for masked operations and lowering patterns for X86 code generator.
Examples:
<16 x i32> @llvm.masked.load.v16i32(i8* %addr, <16 x i32> %passthru, i32 4 /* align */, <16 x i1> %mask)
declare void @llvm.masked.store.v8f64(i8* %addr, <8 x double> %value, i32 4, <8 x i1> %mask)
Scalarizer for other targets (not AVX2/AVX-512) will be done in a separate patch.
Craig Topper [Sat, 22 Nov 2014 18:30:18 +0000 (18:30 +0000)]
Reduce size of some tables in tablegen register info output.
Primarily done by using SequenceToOffsetTable to reduce the register pressure set tables and then sizing the indices into the tables appropriately. Size a few other table entries based on content as well. Reduces X86RegisterInfo.o by ~9k.
Chandler Carruth [Sat, 22 Nov 2014 05:44:43 +0000 (05:44 +0000)]
[x86] Add some tests for a common unpack pattern of vector shuffle that
has a remarkably unique and efficient lowering.
While we get this some of the time already, we miss a few cases and
there wasn't a principled reason we got it. We should at least test
this. v8 already has tests for this pattern.
Jozef Kolek [Fri, 21 Nov 2014 22:04:35 +0000 (22:04 +0000)]
[mips][microMIPS] This patch implements functionality in MIPS delay slot
filler such as if delay slot filler have to put NOP instruction into the
delay slot of microMIPS BEQ or BNE instruction which uses the register $0,
then instead of emitting NOP this instruction is replaced by the corresponding
microMIPS compact branch instruction, i.e. BEQZC or BNEZC.
Tim Northover [Fri, 21 Nov 2014 20:16:09 +0000 (20:16 +0000)]
Remove duplication of relocation names in lib/Object/ELFYAML.cpp
We can now use the ELF relocation .def files to create the mapping
of relocation numbers to names and avoid having to duplicate the
list of relocations.
Tim Northover [Fri, 21 Nov 2014 20:16:07 +0000 (20:16 +0000)]
Remove duplication of relocation names in lib/Object/ELF.cpp
We can now use the ELF relocation .def files to create the mapping
of relocation numbers to names and avoid having to duplicate the
list of relocations.
Tim Northover [Fri, 21 Nov 2014 20:16:02 +0000 (20:16 +0000)]
Split ELF relocation defintions into per-architecture .def files
This should allow the list of relocations for a particular
architecture to be kept in a single header rather than duplicated
whenever we need to enumerate all the relocations.
Sanjay Patel [Fri, 21 Nov 2014 17:40:04 +0000 (17:40 +0000)]
Add a feature flag for slow 32-byte unaligned memory accesses [x86].
This patch adds a feature flag to avoid unaligned 32-byte load/store AVX codegen
for Sandy Bridge and Ivy Bridge. There is no functionality change intended for
those chips. Previously, the absence of AVX2 was being used as a proxy to detect
this feature. But that hindered codegen for AVX-enabled AMD chips such as btver2
that do not have the 32-byte unaligned access slowdown.
Performance measurements are included in PR21541 ( http://llvm.org/bugs/show_bug.cgi?id=21541 ).
Revert "Allow FDE references outside the +/-2GB range supported by PC relative offsets for code models other than small/medium. For JIT application, memory layout is less controlled and can result in truncations otherwise."
This reverts commit r222538.
It's causing test failures for CFI, at least on Darwin:
Note that the previous incremental build was on r222537, and the CFI
tests weren't failing:
http://lab.llvm.org:8080/green/job/clang-stage1-cmake-RA-incremental/1188/
Chandler Carruth [Fri, 21 Nov 2014 14:53:03 +0000 (14:53 +0000)]
[x86] Restructure the checking patterns for v16 and v32 avx2 vector
shuffle lowering to allow much better blend matching.
Specifically, with the new structure the code seems clearer to me and we
correctly can hit the cases where merging two 128-bit lanes is a clear
win and can be shuffled cheaply afterward.
Allow FDE references outside the +/-2GB range supported by PC relative
offsets for code models other than small/medium. For JIT application,
memory layout is less controlled and can result in truncations
otherwise.
Chandler Carruth [Fri, 21 Nov 2014 14:33:24 +0000 (14:33 +0000)]
[x86] Make the previous logic significantly less conservative and get
a bunch more improvements.
Non-lane-crossing is fine, the key is that lane merging only makes sense
for single-input shuffles. Not sure why I got so turned around here. The
code all works, I was just using the wrong model for it.
This only updates v4 and v8 lowering. The v16 and v32 lowering requires
restructuring the entire check sequence.
Andrea Di Biagio [Fri, 21 Nov 2014 14:32:06 +0000 (14:32 +0000)]
[DAG] Teach how to turn a build_vector into a shuffle if some of the operands are zero.
Before this patch, the DAGCombiner only tried to convert build_vector dag nodes
into shuffles if all operands were either extract_vector_elt or undef.
This patch improves that logic and teaches the DAGCombiner how to deal with
build_vector dag nodes where one or more operands are zero. A build_vector
dag node with some zero operands is turned into a shuffle only if the resulting
shuffle mask is legal for the target.
Chandler Carruth [Fri, 21 Nov 2014 13:56:05 +0000 (13:56 +0000)]
[x86] Teach the x86 vector shuffle lowering to detect mergable 128-bit
lanes.
By special casing these we can often either reduce the total number of
shuffles significantly or reduce the number of (high latency on Haswell)
AVX2 shuffles that potentially cross 128-bit lanes. Even when these
don't actually cross lanes, they have much higher latency to support
that. Doing two of them and a blend is worse than doing a single insert
across the 128-bit lanes to blend and then doing a single interleaved
shuffle.
While this seems like a narrow case, it kept cropping up on me and the
difference is *huge* as you can see in many of the test cases. I first
hit this trying to perfectly fix the interleaving shuffle patterns used
by Halide for AVX2.
Chandler Carruth [Fri, 21 Nov 2014 12:17:50 +0000 (12:17 +0000)]
[x86] Add a bunch of test cases to 256-bit shuffles that exercise
merging 128-bit subvectors and also shuffling all the elements of those
subvectors. Currently we generate pretty bad code for many of these, but
I'm testing a patch that should dramatically improve this in addition to
making the shuffle lowering robust to other changes.
Andrea Di Biagio [Fri, 21 Nov 2014 11:33:07 +0000 (11:33 +0000)]
[DAG] Refactor the shuffle combining logic in DAGCombiner. NFC.
This patch simplifies the logic that combines a pair of shuffle nodes into
a single shuffle if there is a legal mask. Also added comments to better
describe the algorithm. No functional change intended.
Hao Liu [Fri, 21 Nov 2014 06:39:58 +0000 (06:39 +0000)]
DAGCombiner: Allow the DAGCombiner to combine multiple FDIVs with the same divisor info FMULs by the reciprocal.
E.g., ( a / D; b / D ) -> ( recip = 1.0 / D; a * recip; b * recip)
A hook is added to allow the target to control whether it needs to do such combine.
Hal Finkel [Fri, 21 Nov 2014 04:35:51 +0000 (04:35 +0000)]
[PPC] Use SeparateConstOffsetFromGEP
This mirrors r222331, which enabled SeparateConstOffsetFromGEP on AArch64, in
the PowerPC backend. Yields, on a POWER7 machine, a 30% speedup on
SingleSource/Benchmarks/Shootout/nestedloop (this might just be from LICM,
there is a store moved out of the inner loop) and a potential speedup on
MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode. Regardless, it
makes some code look cleaner, and synchronizing the backends in this regard
seems like a generally good thing.
Hal Finkel [Fri, 21 Nov 2014 02:22:46 +0000 (02:22 +0000)]
Clarify the description of the noalias attribute
The previous description of the noalias attribute did not accurately specify
the implemented semantics, and the terminology used differed unnecessarily
from that used by the C specification to define the semantics of restrict. For
the argument attribute, the semantics can be precisely specified in terms of
objects accessed through pointers based on the arguments, and this is now what
is done.
Saying that the semantics are 'slightly weaker' than that provided by C99
restrict is not really useful without further elaboration, so that has been
removed from the sentence.
noalias on a return value is really used to mean that the function is
malloc-like (and, in fact, we use this attribute to represent
__attribute__((malloc)) in Clang), and this is a stronger guarantee than that
provided by restrict (because it is a property of the pointed-to memory region,
not just a guarantee on object access). Clarifying this is relevant to fixing
(and was motivated by the discussion on) PR21556.
Adrian Prantl [Fri, 21 Nov 2014 00:39:43 +0000 (00:39 +0000)]
Verifier: Check that all instructions have their parent pointers set up
correctly. This helps with catching problems caused by IRBuilder abuse
such as the one fixed in CFE r222487.
Reid Kleckner [Thu, 20 Nov 2014 23:37:18 +0000 (23:37 +0000)]
Add out of line virtual destructors to all LLVMTargetMachine subclasses
These recently all grew a unique_ptr<TargetLoweringObjectFile> member in
r221878. When anyone calls a virtual method of a class, clang-cl
requires all virtual methods to be semantically valid. This includes the
implicit virtual destructor, which triggers instantiation of the
unique_ptr destructor, which fails because the type being deleted is
incomplete.
This is just part of the ongoing saga of PR20337, which is affecting
Blink as well. Because the MSVC ABI doesn't have key functions, we end
up referencing the vtable and implicit destructor on any virtual call
through a class. We don't actually end up emitting the dtor, so it'd be
good if we could avoid this unneeded type completion work.
Mehdi Amini [Thu, 20 Nov 2014 22:40:25 +0000 (22:40 +0000)]
SimplifyCFG: Refactor GatherConstantCompares() result in a struct
Code seems cleaner and easier to understand this way
This is basically r222416, after fixes for MSVC lack of standard
support, and a few cleaning (got rid of a warning).
Thanks Nakamura Takumi and Nico Weber for the MSVC fixes.