Adam Nemet [Wed, 13 Jul 2016 05:01:48 +0000 (05:01 +0000)]
[BFI] Add new LazyBFI analysis pass
Summary:
This is necessary for D21771. In order to add the hotness attribute to
optimization remarks we need BFI to be available in all passes that emit
optimization remarks.
However we don't want to pay for computing BFI unless the hotness
attribute is requested.
This is achieved by making BFI lazy at the very high-level through a new
analysis pass -- BFI is not calculated unless requested.
I am adding a test to check the laziness under D21771 where the first
user of the analysis is added.
This achieves the same result as previously by using line wrapping. This allows
us to have one keyword per line which makes adding a new keyword significantly
easier, especially if they are inserted in a lexicographical sort order as you
no longer need to reflow the content around it.
This only does the keywords as that is the group which changes more often.
Teresa Johnson [Wed, 13 Jul 2016 03:42:41 +0000 (03:42 +0000)]
[ThinLTO/gold] ThinLTO internalization fixes
Internalization was missing cases where we originally had a local symbol
that was promoted eagerly but not actually exported. This is because we
were only internalizing the set of global (non-local) symbols that were
PREVAILAING_DEF_IRONLY. Instead, collect the set of global symbols that
are referenced outside of a single IR file, and skip internalization for
those.
[X86] Remove some seemingly unnecessary patterns that supported vector zext/sext with 256-bit source types producing a 256-bit result.
These patterns just extracted the source down to 128-bits to use the instructions. AVX512 seems to have blindly copied them over for VLX, but did not create similar patterns for 512-bit sources. So I'm hoping the backend can't actually produce these cases.
The expandAddRecExprLiterally function incorrectly transforms
`[Start + Step * X]` into `Step * [Start + X]` instead of the correct
transform of `[Step * X] + Start`.
This caused https://github.com/JuliaLang/julia/issues/14704#issuecomment-174126219
due to what appeared to be sufficiently complicated loop interactions.
Patch by Jameson Nash (jameson@juliacomputing.com).
Tim Northover [Tue, 12 Jul 2016 22:23:42 +0000 (22:23 +0000)]
GlobalISel: freeze reserved regs after IRTranslator.
We can freeze the registers after the MachineFrameInfo has been configured (by
telling it about calls, inline asm, ...). This doesn't happen at all yet, but
will be part of IR translation.
The LCSSA pass itself will not generate several redundant PHI nodes in a single
exit block. However, such redundant PHI nodes don't violate LCSSA form, and may
be introduced by passes that preserve LCSSA, and/or preserved by the LCSSA pass
itself. So, assuming a single PHI node per exit block is not safe.
Summary:
Refactored the profitability analysis out of the IC promotion pass and
into lib/Analysis so that it can be accessed by the summary index
builder in a follow-on patch to enable IC promotion in ThinLTO (D21932).
[Power9] Add codegen for VSX word insert/extract instructions
This patch corresponds to review:
http://reviews.llvm.org/D20239
It adds exploitation of XXINSERTW and XXEXTRACTUW instructions that
are useful in some cases for inserting and extracting vector elements of
v4[if]32 vectors.
X86FixupBWInsts: No need for forward liveness analysis.
With r274952 and r275201 in place there are no cases left where a
forward liveness analysis yields different results than a backward one.
So we can remove the forward stepping logic.
BranchFolding: Use LivePhysReg to update live in lists.
Use LivePhysRegs with a backwards walking algorithm to update live in
lists, this way the results do not depend on the presence of kill flags
anymore.
This patch also reduces the number of registers added as live-in.
Previously all pristine registers as well as all sub registers of a
super register were added resulting in unnecessarily large live in
lists. This fixed https://llvm.org/PR25263.
[PowerPC] Cannonicalize applicable vector shift immediates as swaps
This patch corresponds to review:
http://reviews.llvm.org/D21358
Vector shifts that have the same semantics as a vector swap are cannonicalized
as such to provide additional opportunities for swap removal optimization to
remove unnecessary swaps.
Added support for:
1. Multi dimension array.
2. Array of structure type, which previously was declared incompletely.
3. Dynamic size array.
4. Array where element type is a typedef, volatile or constant (this should resolve PR28311).
Summary:
Previously, constant index insertelements would be turned into SI_INDIRECT_DST,
which is bound to prevent some optimization opportunities. Worse, it mislead
the heuristic that decides whether immediates should be lowered to S_MOV_B32
or V_MOV_B32 in a way that resulted in unnecessary v_readfirstlanes.
Avoid implicit conversions from MachineInstrBundleIterator to
MachineInstr*, mainly by preferring MachineInstr& over MachineInstr* and
using range-based for loops.
Ivan Krasin [Tue, 12 Jul 2016 02:38:37 +0000 (02:38 +0000)]
Print remarks from WholeProgramDevirt pass for each call site.
Summary:
It's useful to have some visibility about which call sites are devirtualized,
especially for debug purposes. Another use case is a regression test on the
application side (like, Chromium).
Avoid implicit iterator conversions from MachineInstrBundleIterator to
MachineInstr* in the Hexagon backend, mostly by preferring MachineInstr&
over MachineInstr* and switching to range-based for loops.
There's a long tail of API cleanup here, but I'm planning to leave the
rest to the Hexagon maintainers. HexagonInstrInfo defines many of its
own predicates, and most of them still take MachineInstr*. Some of
those actually check for nullptr, so I didn't feel comfortable changing
them to MachineInstr& en masse.
Avoid implicit conversions from MachineInstrBundleIterator to
MachineInstr* in the Mips backend, mainly by preferring MachineInstr&
over MachineInstr* when a pointer isn't nullable and using range-based
for loops.
Avoid implicit conversions from MachineInstrBundleIterator to
MachineInstr* in the SystemZ backend, mainly by preferring MachineInstr&
over MachineInstr* and using range-based for loops.
Mehdi Amini [Mon, 11 Jul 2016 23:10:18 +0000 (23:10 +0000)]
Add a libLTO API to query a memory buffer and check if it contains ObjC categories
The linker supports a feature to force load an object from a static
archive if it defines an Objective-C category.
This API supports this feature by looking at every section in the
module to find if a category is defined in the module.
This patch simplifies the graph builder by encoding nodes as {Value,
Dereference Level} pairs. This lets us kill edge types, and allows us to
get rid of hacks in StratifiedSets (like addAttrsBelow/...). This
simplification also allows us to remove InstantiatedRelations and
InstantiatedAttrs.
Summary:
Aiming to correct the ordering of loads/stores. This patch changes the
insert point for loads to the position of the first load.
It updates the ordering method for loads to insert before, rather than after.
Before this patch the following sequence:
"load a[1], store a[1], store a[0], load a[2]"
Would incorrectly vectorize to "store a[0,1], load a[1,2]".
The correctness check was assuming the insertion point for loads is at
the position of the first load, when in practice it was at the last
load. An alternative fix would have been to invert the correctness check.
The current fix changes insert position but also requires reordering of
instructions before the vectorized load.
Tim Northover [Mon, 11 Jul 2016 22:29:37 +0000 (22:29 +0000)]
ARM: validate immediate branch targets in AsmParser.
Immediate branch targets aren't commonly used, but if they are we should make
sure they can actually be encoded. This means they must be divisible by 2 when
targeting Thumb mode, and by 4 when targeting ARM mode.
Also do a little naming cleanup while I was changing everything around anyway.
AMDGPU: Treat texture gather instructions more like other MIMG instructions
Summary:
Setting MIMG to 0 has a bunch of unexpected side effects, including that
isVMEM returns false which leads to incorrect treatment in the hazard
recognizer. The reason I noticed it is that it also leads to incorrect
treatment in VGPR-to-SGPR copies, which is one cause of the referenced bug.
The only reason why MIMG was set to 0 is to signal the special handling of
dmasks, but that can be checked differently.
[pdb] Add a pdb2yaml option to not dump file headers.
This will be useful once we start adding the ability to dump type
records and symbol records, since it will allow us to generate
mergeable information instead of information that specifies an
entire file.
Summary:
The main bug fix here is using the 32-bit encoding of V_ADD_I32 in
materializeFrameBaseRegister and resolveFrameIndex, so that arbitrary
immediates work.
The second part is that we may now require the SegmentWaveByteOffset
even when there are initially no stack objects and VGPR spilling isn't
enabled, for stack slots that are allocated later. This means that some
bits become effectively dead and can be cleaned up.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=96602 Tested-by: Kai Wasserbäch <kai@dev.carbon-project.org>
Reviewers: arsenm, tstellarAMD
Blocks to be tail-merged may share more than one successor. Correct the
comment to state that they share a specific successor, SuccBB, rather
than a single successor, which is not true.
This bug (llvm.org/PR28124) was introduced by r237977, which refactored
the tail call sequence to be generated in two passes instead of one.
Unfortunately, the stack adjustment produced by the first pass was not
recognized by X86FrameLowering::mergeSPUpdates() in all cases, causing
code such as the following, which clobbers the return address, to be
generated:
Add TLI.allowsMisalignedMemoryAccesses to LoadStoreVectorizer
Summary: Extend TTI to access TLI.allowsMisalignedMemoryAccesses(). Check condition when vectorizing load and store chains.
Add additional parameters: AddressSpace, Alignment, Fast.
It is an optimization pass, and should not run at -O0. Especially since Fast RA
will not do the required register coalescing anyway, so it's a loss even from
the optimization standpoint.
This also works around (but doesn't quite fix) PR28489.
[SystemZ] Recognize Load On Condition Immediate (LOCHI/LOGHI) opportunities
Summary: Add support for the z13 instructions LOCHI and LOCGHI which
conditionally load immediate values. Add target instruction info hooks so
that if conversion will allow predication of LHI/LGHI.
Davide Italiano [Mon, 11 Jul 2016 18:10:06 +0000 (18:10 +0000)]
[PM/IPO] Port LowerTypeTests to the new PassManager.
There's a little bit of churn in this patch because the initialization
mechanism is now shared between the old and the new PM. Other than
that, it's just a pretty mechanical translation.
Fix the assertion failure caused by http://reviews.llvm.org/D22118
Summary: http://reviews.llvm.org/D22118 uses metadata to store the call count, which makes it possible to have branch weight to have only one elements. Also fix the assertion failure in inliner when checking the instruction type to include "invoke" instruction.
Implement callsite-hotness based inline cost for Sample-based PGO
Summary:
For sample-based PGO, using BFI to calculate callsite count is sometime not accurate. This is because with sampling based approach, if a callsite resides in a hot loop deeply nested in a bunch of cold branches, the callsite's BFI frequency would be inaccurately calculated due to lack of samples in the cold branch.
E.g.
if (A1 && A2 && A3 && ..... && A10) {
for (i=0; i < 100000000; i++) {
callsite();
}
}
Assume that A1 to A100 are all 100% taken, and callsite has 1000 samples and thus is considerred hot. Because the loop's trip count is huge, it's normal that all branches outside the loop has no sample at all. As a result, we can only use static branch probability to derive the the frequency of the loop header. Assuming that static heuristic thinks each branch is 50% taken, then the count calculated from BFI will be 1/(2^10) of the actual value.
In order to get more accurate callsite count, we directly annotate the weight on the call instruction, and directly use it when checking callsite hotness.
Note that this mechanism can also be shared by instrumentation based callsite hotness analysis. The side benefit is that it breaks the dependency from Inliner to BFI as call count is embedded in the IR.
Tune the weight propagation algorithm for sample profile.
Summary: Handle the case when there is only one incoming/outgoing edge for a visited basic block: use the block weight to adjust edge weight even when the edge has been visited before. This can help reduce inaccuracies introduced by incorrect basic block profile, as shown in the updated unittest.