Eric Christopher [Tue, 10 Mar 2015 23:46:01 +0000 (23:46 +0000)]
Have TargetRegisterInfo::getLargestLegalSuperClass take a
MachineFunction argument so that it can look up the subtarget
rather than using a cached one in some Targets.
Philip Reames [Tue, 10 Mar 2015 22:52:37 +0000 (22:52 +0000)]
If a conditional branch jumps to the same target, remove the condition
Given that large parts of inst combine is restricted to instructions which have one use, getting rid of a use on the condition can help the effectiveness of the optimizer. Also, it allows the condition to potentially be deleted by instcombine rather than waiting for another pass.
I noticed this completely by accident in another test case. It's not anything that actually came from a real workload.
p.s. We should probably do the same thing for switch instructions.
Paul Robinson [Tue, 10 Mar 2015 22:44:45 +0000 (22:44 +0000)]
Emit correct linkage-name attribute based on DWARF version.
There are still 4 tests that check for DW_AT_MIPS_linkage_name,
because they specify DWARF 2 or 3 in the module metadata. So, I didn't
create an explicit version-based test for the attribute.
Philip Reames [Tue, 10 Mar 2015 22:43:20 +0000 (22:43 +0000)]
Infer known bits from dominating conditions
This patch adds limited support in ValueTracking for inferring known bits of a value from conditional expressions which must be true to reach the instruction we're trying to optimize. At this time, the feature is off by default. Once landed, I'm hoping for feedback from others on both profitability and compile time impact.
Forms of conditional value propagation have been tried in LLVM before and have failed due to compile time problems. In an attempt to side step that, this patch only considers conditions where the edge leaving the branch dominates the context instruction. It does not attempt full dataflow. Even with that restriction, it handles many interesting cases:
* Early exits from functions
* Early exits from loops (for context instructions in the loop and after the check)
* Conditions which control entry into loops, including multi-version loops (such as those produced during vectorization, IRCE, loop unswitch, etc..)
Possible applications include optimizing using information provided by constructs such as: preconditions, assumptions, null checks, & range checks.
This patch implements two approaches to the problem that need further benchmarking. Approach 1 is to directly walk the dominator tree looking for interesting conditions. Approach 2 is to inspect other uses of the value being queried for interesting comparisons. From initial benchmarking, it appears that Approach 2 is faster than Approach 1, but this needs to be further validated.
Eric Christopher [Tue, 10 Mar 2015 22:03:14 +0000 (22:03 +0000)]
Remove the use of the subtarget in MCCodeEmitter creation and
update all ports accordingly. Required a couple of small rewrites
in handling subtarget features during creation in PPC.
Eric Christopher [Tue, 10 Mar 2015 21:57:34 +0000 (21:57 +0000)]
Remove createAMDGPUMCCodeEmitter and instead just register the correct
MCCodeEmitter creation routine based on TargetMachine since the only
64-bit R600 gpus are part of the GCN target.
Quentin Colombet [Tue, 10 Mar 2015 21:48:15 +0000 (21:48 +0000)]
[CodeGenPrepare] Refine the cost model provided by the promotion helper.
- Use TargetLowering to check for the actual cost of each extension.
- Provide a factorized method to check for the cost of an extension:
TargetLowering::isExtFree.
- Provide a virtual method TargetLowering::isExtFreeImpl for targets to be able
to tune the cost of non-free extensions.
This refactoring offers a better granularity to model what really happens on
different targets.
No performance changes and very few code differences.
Chris Bieneman [Tue, 10 Mar 2015 20:48:02 +0000 (20:48 +0000)]
Add new LLVM_OPTIMIZED_TABLEGEN build setting which configures, builds and uses a release tablegen build when LLVM is configured with assertions enabled.
Summary: This change leverages the cross-compiling functionality in the build system to build a release tablegen executable for use during the build.
Ahmed Bougacha [Tue, 10 Mar 2015 20:45:38 +0000 (20:45 +0000)]
[AArch64] Avoid going through GPRs for across-vector instructions.
This adds new node types for each intrinsic.
For instance, for addv, we have AArch64ISD::UADDV, such that:
(v4i32 (uaddv ...))
is the same as
(v4i32 (scalar_to_vector (i32 (int_aarch64_neon_uaddv ...))))
that is,
(v4i32 (INSERT_SUBREG (v4i32 (IMPLICIT_DEF)),
(i32 (int_aarch64_neon_uaddv ...)), ssub)
In a combine, we transform all such across-vector-lanes intrinsics to:
(i32 (extract_vector_elt (uaddv ...), 0))
This has one big advantage: by making the extract_element explicit, we
enable the existing patterns for lane-aware instructions to fire.
This lets us avoid needlessly going through the GPRs. Consider:
uint32x4_t test_mul(uint32x4_t a, uint32x4_t b) {
return vmulq_n_u32(a, vaddvq_u32(b));
}
We now generate:
addv.4s s1, v1
mul.4s v0, v0, v1[0]
instead of the previous:
addv.4s s1, v1
fmov w8, s1
dup.4s v1, w8
mul.4s v0, v1, v0
The V128 integer patterns already exist in the INS multiclass.
The duplicates only fire when the vector index type isn't i64,
because they accept "imm" instead of an explicit "i64", as the
instruction definition patterns do.
TLI::getVectorIdxTy is i64 on AArch64, so this should never happen.
Also, one of them had a typo: for i64, INSvi32lane was used.
I noticed because I mistakenly used an explicit i32 as the idx type,
and got ins.s for an i64 vector_insert.
The V64 patterns also don't seem to ever fire, as V64 vector
extract/insert are legalized to V128.
The equivalent float patterns are unique and useful, so keep them.
No functional change intended; none exhibited on the LIT and LNT tests.
Fix the non-determinism by using a MapVector and reintroduce the AArch64
testcase. Defer deleting the got candidates up to the end and remove
them in a bulk, avoiding linear time removal of each element.
Thanks to Renato Golin for trying it out on other platforms.
Adam Nemet [Tue, 10 Mar 2015 18:54:26 +0000 (18:54 +0000)]
[LAA-memchecks 3/3] Introduce pointer partitions for memchecks
This is the final patch that actually introduces the new parameter of
partition mapping to RuntimePointerCheck::needsChecking.
Another API (LAI::getInstructionsForAccess) is also exposed that helps
to map pointers to instructions because ultimately we partition
instructions.
The WIP version of the Loop Distribution pass in D6930 has been adapted
to use all this. See for example, how
InstrPartitionContainer::computePartitionSetForPointers sets up the
partitions using the above API and then calls to LAI::addRuntimeCheck
with the pointer partitions.
Adam Nemet [Tue, 10 Mar 2015 18:54:23 +0000 (18:54 +0000)]
[LAA-memchecks 2/3] Move number of memcheck threshold checking to LV
Now the analysis won't "fail" if the memchecks exceed the threshold. It
is the transform pass' responsibility to perform the check.
This allows the transform pass to further analyze/eliminate the
memchecks. E.g. in Loop distribution we only need to check pointers
that end up in different partitions.
Note that there is a slight change of functionality here. The logic in
analyzeLoop is that if dependence checking fails due to non-constant
distance between the pointers, another attempt is made to prove safety
of the dependences purely using run-time checks.
Before this patch we could fail the loop due to exceeding the memcheck
threshold after the first step, now we only check the threshold in the
client after the full analysis. There is no measurable compile-time
effect but I wanted to record this here.
Adam Nemet [Tue, 10 Mar 2015 18:54:19 +0000 (18:54 +0000)]
[LAA-memchecks 1/3] Split out NumComparisons checks. NFC
The check for the number of memchecks will be moved to the client of
this analysis. Besides allowing for transform-specific thresholds, this
also lets Loop Distribution post-process the memchecks; Loop
Distribution only needs memchecks between pointers of different
partitions.
The motivation for this first patch is to untangle the CanDoRT check
from the NumComparison check before moving the NumComparison part.
CanDoRT means that we couldn't determine the bounds for the pointer.
Note that NumComparison is set independent of this flag.
Adam Nemet [Tue, 10 Mar 2015 17:40:37 +0000 (17:40 +0000)]
[LoopAccesses 2/3] Allow querying of interesting dependences
Gather an array of interesting dependences rather than just failing
after the first unsafe one and regarding the loop unsafe. Loop
Distribution needs to be able to collect all dependences in order to
isolate the dependence cycles into their own partition.
Since the dependence checking algorithm is quadratic in terms of
accesses sharing the same underlying pointer, I am applying a cut-off
threshold (MaxInterestingDependence). Exceeding that, the logic reverts
back to the original approach deeming the loop unsafe upon encountering
the first unsafe dependence.
The main idea of the patch is to split isDepedent from directly
answering the question whether the dep is safe for vectorization to
return a dependence type which then gets mapped to old boolean result
using Dependence::isSafeForVectorization.
Tested that this was compile-time neutral on SpecINT2006 LTO bitcode
inputs. No assembly change on the testsuite including external.
Adam Nemet [Tue, 10 Mar 2015 17:40:34 +0000 (17:40 +0000)]
[LoopAccesses 1/3] Expose MemoryDepChecker to LAA users
LoopDistribution needs to query various results of the dependence
analysis. This series will expose some more APIs and state of the
dependence checker.
This patch is a simple one to just expose the DepChecker instance. The
set is compile-time neutral measured with LTO bitcode files of
SpecINT2006. Also there is no assembly change on the testsuite.
Igor Laevsky [Tue, 10 Mar 2015 16:26:48 +0000 (16:26 +0000)]
Teach lowering to correctly handle invoke statepoint and gc results tied to them. Note that we still can not lower gc.relocates for invoke statepoints.
Also it extracts getCopyFromRegs helper function in SelectionDAGBuilder as we need to be able to customize type of the register exported from basic block during lowering of the gc.result.
(Resubmitting this change after not being able to reproduce buildbot failure)
Chad Rosier [Tue, 10 Mar 2015 16:22:52 +0000 (16:22 +0000)]
[BranchFolding] Remove MMOs during tail merge to preserve dependencies.
When tail merging it may be necessary to remove MMOs from memory operations to
ensures later passes (e.g., MI sched) conservatively compute dependencies.
Currently, we only remove the MMO from the common tail if the MMO doesn't match
with the relative instruction in the non-common tail(s).
A more robust solution would be to add multiple MMOs from the duplicate MIs to
the new MI. Currently ScheduleDAGInstrs.cpp ignores all MMOs on instructions
with multiple MMOs, so this solution is equivalent for the time being.
No test case included as this is incredibly difficult to reproduce.
Patch was a collaborative effort between Ana Pazos and myself.
Phabricator: http://reviews.llvm.org/D7769
Sanjay Patel [Tue, 10 Mar 2015 16:08:36 +0000 (16:08 +0000)]
[X86, AVX] replace vinsertf128 intrinsics with generic shuffles
We want to replace as much custom x86 shuffling via intrinsics
as possible because pushing the code down the generic shuffle
optimization path allows for better codegen and less complexity
in LLVM.
This is the sibling patch for the Clang half of this change:
http://reviews.llvm.org/D8088
Karthik Bhat [Tue, 10 Mar 2015 14:32:02 +0000 (14:32 +0000)]
Fix a memory corruption in Dependency Analysis.
This crash occurs due to memory corruption when trying to update dependency
direction based on Constraints.
This crash was observed during lnt regression of Polybench benchmark test case dynprog.
Karthik Bhat [Tue, 10 Mar 2015 13:31:03 +0000 (13:31 +0000)]
Fix a crash in Dependency Analysis.
This crash in Dependency analysis is because we assume here that in case of UsefulGEP
both source and destination have the same number of operands which may not be true.
This incorrect assumption results in crash while populating Pairs. Fix the same.
This crash was observed during lnt regression for code such as-
struct s{
int A[10][10];
int C[10][10][10];
} S;
void dep_constraint_crash_test(int k,int N) {
for( int i=0;i<N;i++)
for( int j=0;j<N;j++)
S.A[0][0] = S.C[0][0][k];
}
Review: http://reviews.llvm.org/D8162
Daniel Sanders [Tue, 10 Mar 2015 10:42:59 +0000 (10:42 +0000)]
The operand flag word used in ISD::INLINEASM is an i32 not a pointer. NFC.
Summary:
This is part of the work to support memory constraints that behave
differently to 'm'. The subsequent patches will expand on the existing
encoding (which is a 32-bit int) and as a result in some flag words will no
longer fit into an i16. This problem only affected the MSP430 target which
appears to have 16-bit pointers.
Yaron Keren [Tue, 10 Mar 2015 07:33:23 +0000 (07:33 +0000)]
Teach raw_ostream to accept SmallString.
Saves adding .str() call to any raw_ostream << SmallString usage
and a small step towards making .str() consistent in the ADTs by
removing one of the SmallString::str() use cases, discussion at
Owen Anderson [Tue, 10 Mar 2015 05:58:21 +0000 (05:58 +0000)]
Fix an issue in the verifier where we could try to read information out of a malformed statepoint intrinsic.
In this situation we would always have already flagged an error on the statepoint intrinsic,
but then we carry on to parse other, related GC intrinsics, and could end up crashing during that
verification when they try to access data from the malformed statepoint.
Owen Anderson [Tue, 10 Mar 2015 05:13:47 +0000 (05:13 +0000)]
Fix an infinite loop in InstCombine when an instruction with no users and side effects can be constant folded.
ReplaceInstUsesWith needs to return nullptr when the input has no users,
because in that case it does not mutate the program. Otherwise, we can
get stuck in an infinite loop of repeatedly attempting to constant fold
and instruction with no users.
Last commit fixed the handling of hash collisions, but it introdcuced
unneeded bucket terminators in some places. The generated table was
correct, it can just be a tiny bit smaller. As the previous table was
correct, the test doesn't need updating. If we really wanted to test
this, I could add the section size to the dwarf dump and test for a
precise value there. IMO the correctness test is sufficient.
CFLAA didn't know how to properly handle ConstantExprs; it would silently
ignore them. This was a problem if the ConstantExpr is, say, a GEP of a global,
because CFLAA wouldn't realize that there's a global there. :)
Mehdi Amini [Tue, 10 Mar 2015 02:37:25 +0000 (02:37 +0000)]
DataLayout is mandatory, update the API to reflect it with references.
Summary:
Now that the DataLayout is a mandatory part of the module, let's start
cleaning the codebase. This patch is a first attempt at doing that.
This patch is not exactly NFC as for instance some places were passing
a nullptr instead of the DataLayout, possibly just because there was a
default value on the DataLayout argument to many functions in the API.
Even though it is not purely NFC, there is no change in the
validation.
I turned as many pointer to DataLayout to references, this helped
figuring out all the places where a nullptr could come up.
I had initially a local version of this patch broken into over 30
independant, commits but some later commit were cleaning the API and
touching part of the code modified in the previous commits, so it
seemed cleaner without the intermediate state.
Frederic Riss [Tue, 10 Mar 2015 00:46:31 +0000 (00:46 +0000)]
DwarfAccelTable: Fix handling of hash collisions.
It turns out accelerator tables where totally broken if they contained
entries with colliding hashes. The failure mode is pretty bad, as it not
only impacted the colliding entries, but would basically make all the
entries after the first hash collision pointing in the wrong place.
The testcase uses the symbol names that where found to collide during a
clang build.
From a performance point of view, the patch adds a sort and a linear
walk over each bucket contents. While it has a measurable impact on the
accelerator table emission, it's not showing up significantly in clang
profiles (and I'd argue that correctness is priceless :-)).
Eric Christopher [Tue, 10 Mar 2015 00:33:27 +0000 (00:33 +0000)]
Temporarily revert r231726 and r231724 as they're breaking the build.:
Author: Lang Hames <lhames@gmail.com>
Date: Mon Mar 9 23:51:09 2015 +0000
[Orc][MCJIT][RuntimeDyld] Add header that was accidentally left out of r231724.
Author: Lang Hames <lhames@gmail.com>
Date: Mon Mar 9 23:44:13 2015 +0000
[Orc][MCJIT][RuntimeDyld] Add symbol flags to symbols in RuntimeDyld. Thread the
new types through MCJIT and Orc.
In particular, add a 'weak' flag. When plumbed through RTDyldMemoryManager, this
will allow us to distinguish between weak and strong definitions and find the
right ones during symbol resolution.
Lang Hames [Mon, 9 Mar 2015 23:44:13 +0000 (23:44 +0000)]
[Orc][MCJIT][RuntimeDyld] Add symbol flags to symbols in RuntimeDyld. Thread the
new types through MCJIT and Orc.
In particular, add a 'weak' flag. When plumbed through RTDyldMemoryManager, this
will allow us to distinguish between weak and strong definitions and find the
right ones during symbol resolution.
Ahmed Bougacha [Mon, 9 Mar 2015 22:51:05 +0000 (22:51 +0000)]
[CodeGen] Replace the reused stores' chain for extractelt expansion.
This fixes a subtle issue that was introduced in r205153.
When reusing a store for the extractelement expansion (to load directly
from it, inserting of going through the stack), later stores to the
same location might have overwritten the data we were expecting to
extract from.
To fix that, we need to explicitly replace the chain going out of the
reused store, so that later stores also have an explicit dependency on
the generated element-extracting loads, and can't clobber them.
Fix the double-deletion of AnalysisResolver when delegating through to
Dwarf EH preparation by creating one from scratch. Hopefully the new
pass manager simplifies this.