Artyom Skrobov [Tue, 20 Oct 2015 13:14:52 +0000 (13:14 +0000)]
Adding support for TargetLoweringBase::LibCall
Summary:
TargetLoweringBase::Expand is defined as "Try to expand this to other ops,
otherwise use a libcall." For ISD::UDIV and ISD::SDIV, the choice between
the two possibilities was defined in a rather convoluted way:
- if DIVREM is legal, expand to DIVREM
- if DIVREM has a custom lowering, expand to DIVREM
- if DIVREM libcall is defined and a remainder from the same division is
computed elsewhere, expand to a DIVREM libcall
- else, expand to a DIV libcall
This had the undesirable effect that if both DIV and DIVREM are implemented
as libcalls, then ISD::UDIV and ISD::SDIV are expanded to the heavier DIVREM
libcall, even when the remainder isn't used.
The new code adds a new LegalizeAction, TargetLoweringBase::LibCall, so that
backends can directly control whether they prefer an expansion or a conversion
to a libcall. This makes the generic lowering code even more generic,
allowing its reuse in a wider range of target-specific configurations.
The useful effect is that ARM backend will now generate a call
to __aeabi_{i,u}div rather than __aeabi_{i,u}divmod in cases where
it doesn't need the remainder. There's no functional change outside
the ARM backend.
Artyom Skrobov [Tue, 20 Oct 2015 13:06:02 +0000 (13:06 +0000)]
Combining DIV+REM->DIVREM doesn't belong in LegalizeDAG; move it over into DAGCombiner.
Summary:
In addition to moving the code over, this patch amends the DIV,REM -> DIVREM
combining to run on all affected nodes at once: if the nodes are converted
to DIVREM one at a time, then the resulting DIVREM may get legalized by the
backend into something target-specific that we won't be able to recognize
and correlate with the remaining nodes.
The motivation is to "prepare terrain" for D13862: when we set DIV and REM
to be legalized to libcalls, instead of the DIVREM, we otherwise lose the
ability to combine them together. To prevent this, we need to take the
DIV,REM -> DIVREM combining out of the lowering stage.
The mask value type for maskload/maskstore GCC builtins is never a vector of
packed floats/doubles.
This patch fixes the following issues:
1. The mask argument for builtin_ia32_maskloadpd and builtin_ia32_maskstorepd
should be of type llvm_v2i64_ty and not llvm_v2f64_ty.
2. The mask argument for builtin_ia32_maskloadpd256 and
builtin_ia32_maskstorepd256 should be of type llvm_v4i64_ty and not
llvm_v4f64_ty.
3. The mask argument for builtin_ia32_maskloadps and builtin_ia32_maskstoreps
should be of type llvm_v4i32_ty and not llvm_v4f32_ty.
4. The mask argument for builtin_ia32_maskloadps256 and
builtin_ia32_maskstoreps256 should be of type llvm_v8i32_ty and not
llvm_v8f32_ty.
Keno Fischer [Tue, 20 Oct 2015 10:13:55 +0000 (10:13 +0000)]
Fix missing INITIALIZE_PASS_DEPENDENCY for AddressSanitizer
Summary: In r231241, TargetLibraryInfoWrapperPass was added to
`getAnalysisUsage` for `AddressSanitizer`, but the corresponding
`INITIALIZE_PASS_DEPENDENCY` was not added.
Matt Arsenault [Tue, 20 Oct 2015 03:59:58 +0000 (03:59 +0000)]
AMDGPU: Stop reserving v[254:255]
This wasn't doing anything useful. They weren't explicitly used
anywhere, and the RegScavenger ignores reserved registers.
This for some reason caused a random scheduling change in the test.
Getting the check lines to pass is too frustrating, and there's probably
not too much value in checking the vector case's operands N times.
`normalizeForInvokeSafepoint` in RewriteStatepointsForGC.cpp, as it is
written today, deals with `gc.relocate` and `gc.result` uses of a
statepoint equally well. This change documents this fact and adds a
test case.
There is no functional change here -- only documentation of existing
functionality.
There are two things out of the ordinary in this commit. First, I made
a loop obviously "infinite" in HexagonInstrInfo.cpp. After checking if
an instruction was at the beginning of a basic block (in which case,
`break`), the loop decremented and checked the iterator for `nullptr` as
the loop condition. This has never been possible (the prev pointers are
always been circular, so even with the weird ilist/iplist
implementation, this isn't been possible), so I removed the condition.
Second, in HexagonAsmPrinter.cpp there was another case of comparing a
`MachineBasicBlock::instr_iterator` against `MachineBasicBlock::end()`
(which returns `MachineBasicBlock::iterator`). While not incorrect,
it's fragile. I switched this to `::instr_end()`.
All that said, no functionality change intended here.
Cong Hou [Mon, 19 Oct 2015 23:16:40 +0000 (23:16 +0000)]
Enhance loop rotation with existence of profile data in MachineBlockPlacement pass.
Currently, in MachineBlockPlacement pass the loop is rotated to let the best exit to be the last BB in the loop chain, to maximize the fall-through from the loop to outside. With profile data, we can determine the cost in terms of missed fall through opportunities when rotating a loop chain and select the best rotation. Basically, there are three kinds of cost to consider for each rotation:
1. The possibly missed fall through edge (if it exists) from BB out of the loop to the loop header.
2. The possibly missed fall through edges (if they exist) from the loop exits to BB out of the loop.
3. The missed fall through edge (if it exists) from the last BB to the first BB in the loop chain.
Therefore, the cost for a given rotation is the sum of costs listed above. We select the best rotation with the smallest cost. This is only for PGO mode when we have more precise edge frequencies.
As usual, this is a polymorphic hierarchy without polymorphic ownership,
so simply make the dtor protected non-virtual, protected default copy
ctor/assign, and make derived classes final. The derived classes will
pick up correct default public copy ops (and dtor) implicitly.
(wish I could add -Wdeprecated to the build, but last time I tried it
triggered on some system headers I still need to look into/figure out)
Besides the usual, I finally added an overload to
`BasicBlock::splitBasicBlock()` that accepts an `Instruction*` instead
of `BasicBlock::iterator`. Someone can go back and remove this overload
later (after updating the callers I'm going to skip going forward), but
the most common call seems to be
`BB->splitBasicBlock(BB->getTerminator(), ...)` and I'm not sure it's
better to add `->getIterator()` to every one than have the overload.
It's pretty hard to get the usage wrong.
Sanjay Patel [Mon, 19 Oct 2015 21:59:12 +0000 (21:59 +0000)]
[CGP] transform select instructions into branches and sink expensive operands
This was originally checked in at r250527, but reverted at r250570 because of PR25222.
There were at least 2 problems:
1. The cost check was checking for an instruction with an exact cost of TCC_Expensive;
that should have been >=.
2. The cause of the clang stage 1 failures was illegally sinking 'call' instructions;
we can't sink instructions that may have side effects / are not safe to execute speculatively.
Fixed those conditions in sinkSelectOperand() and added test cases.
Original commit message:
This is a follow-up to the discussion in D12882.
Ideally, we would like SimplifyCFG to be able to form select instructions even when the operands
are expensive (as defined by the TTI cost model) because that may expose further optimizations.
However, we would then like a later pass like CodeGenPrepare to undo that transformation if the
target would likely benefit from not speculatively executing an expensive op (this patch).
Once we have this safety mechanism in place, we can adjust SimplifyCFG to restore its
select-formation behavior that changed with r248439.
Jun Bum Lim [Mon, 19 Oct 2015 18:34:53 +0000 (18:34 +0000)]
[AArch64]Merge halfword loads into a 32-bit load
Convert two halfword loads into a single 32-bit word load with bitfield extract
instructions. For example :
ldrh w0, [x2]
ldrh w1, [x2, #2]
becomes
ldr w0, [x2]
ubfx w1, w0, #16, #16
and w0, w0, #ffff
- Isolate the check for the existence of a stack frame into hasFP.
- Implement getFrameIndexReference for DWARF address computation.
- Use getFrameIndexReference for offset computation in eliminateFrameIndex.
- Preserve debug information for dynamically allocated stack objects.
- Prefer FP to access local objects at -O0.
- Add experimental code to skip allocframe when not strictly necessary
(disabled by default).
Lang Hames [Mon, 19 Oct 2015 17:43:51 +0000 (17:43 +0000)]
[Orc] Add support for emitting indirect stubs directly into the JIT target's
memory, rather than representing the stubs in IR. Update the CompileOnDemand
layer to use this functionality.
Directly emitting stubs is much cheaper than building them in IR and codegen'ing
them (see below). It also plays well with remote JITing - stubs can be emitted
directly in the target process, rather than having to send them over the wire.
The downsides are:
(1) Care must be taken when resolving symbols, as stub symbols are held in a
separate symbol table. This is only a problem for layer writers and other
people using this API directly. The CompileOnDemand layer hides this detail.
(2) Aliases of function stubs can't be symbolic any more (since there's no
symbol definition in IR), but must be converted into a constant pointer
expression. This means that modules containing aliases of stubs cannot be
cached. In practice this is unlikely to be a problem: There's no benefit to
caching such a module anyway.
On balance I think the extra performance is more than worth the trade-offs: In a
simple stress test with 10000 dummy functions requiring stubs and a single
executed "hello world" main function, directly emitting stubs reduced user time
for JITing / executing by over 90% (1.5s for IR stubs vs 0.1s for direct
emission).
Teresa Johnson [Mon, 19 Oct 2015 14:30:44 +0000 (14:30 +0000)]
llvm-lto support for generating combined function indexes
Summary:
This patch adds support to llvm-lto that mirrors the support added by
r249270 to the gold plugin. This enables better testing of combined
index generation for ThinLTO.
Added a new test, and this support will be used in the test in D13515.
Asiri Rathnayake [Mon, 19 Oct 2015 11:44:24 +0000 (11:44 +0000)]
Fix mapping of @llvm.arm.ssat/usat intrinsics to ssat/usat instructions
The mapping of these two intrinsics in ARMInstrInfo.td had a small
omission which lead to their operands not being validated/transformed
before being lowered into usat and ssat instructions. This can cause
incorrect instructions to be emitted.
I've also added tests for the remaining two saturating arithmatic
intrinsics @llvm.arm.qadd and @llvm.arm.qsub as they are missing
codegen tests.
James Molloy [Mon, 19 Oct 2015 08:54:59 +0000 (08:54 +0000)]
[GlobalsAA] Fix a really horrible iterator invalidation bug
We were keeping a reference to an object in a DenseMap then mutating it. At the end of the function we were attempting to clone that reference into other keys in the DenseMap, but DenseMap may well decide to resize its hashtable which would invalidate the reference!
It took an extremely complex testcase to catch this - many thanks to Zhendong Su for catching it in PR25225.
Removed parameter "Consecutive" from isLegalMaskedLoad() / isLegalMaskedStore().
Originally I planned to use the same interface for masked gather/scatter and set isConsecutive to "false" in this case.
Now I'm implementing masked gather/scatter and see that the interface is inconvenient. I want to add interfaces isLegalMaskedGather() / isLegalMaskedScatter() instead of using the "Consecutive" parameter in the existing interfaces.
1. Key constant values (version, magic) and data structures related to raw and
indexed profile format are moved into one centralized file: InstrProf.h.
2. Utility function such as MD5Hash computation is also moved to the common
header to allow sharing with other components in the future.
3. A header data structure is introduced for Indexed format so that the reader
and writer can always be in sync.
4. Added some comments to document different places where multiple definition
of the data structure must be kept in sync (reader/writer, runtime, lowering
etc). No functional change is intended.
Simon Pilgrim [Sat, 17 Oct 2015 11:40:05 +0000 (11:40 +0000)]
[InstCombine] SSE4A constant folding and conversion to shuffles.
This patch improves support for combining the SSE4A EXTRQ(I) and INSERTQ(I) intrinsics:
1 - Converts INSERTQ/EXTRQ calls to INSERTQI/EXTRQI if the 'bit index' and 'length' operands are constant
2 - Converts INSERTQI/EXTRQI calls to shufflevector if the bit index/length are both byte aligned (we can already lower shuffles to INSERTQI/EXTRQI if its useful)
3 - Constant folding support
4 - Add zeroinitializer handling
Matthias Braun [Sat, 17 Oct 2015 00:46:57 +0000 (00:46 +0000)]
RegisterPressure: allocatable physreg uses are always kills
This property was already used in the code path when no liveness
intervals are present. Unfortunately the code path that uses liveness
intervals tried to query a cached live interval for an allocatable
physreg, those are usually not computed so a conservative default was
used.
This doesn't affect any of the lit testcases. This is a foreclosure to
upcoming changes which should be NFC but without this patch this tidbit
wouldn't be NFC.
Matthias Braun [Sat, 17 Oct 2015 00:35:59 +0000 (00:35 +0000)]
RegisterPressure: Remove 0 entries from PressureChange
This should not change behaviour because as far as I can see all code
reading the pressure changes has no effect if the PressureInc is 0.
Removing these entries however does avoid unnecessary computation, and
results in a more stable debug output. I want the stable debug output to
check that some upcoming changes are indeed NFC and identical even at
the debug output level.
JF Bastien [Sat, 17 Oct 2015 00:25:38 +0000 (00:25 +0000)]
WebAssembly: don't omit dead vregs from locals
Summary:
This is a temporary hack until we get around to remapping the vreg
numbers to local numbers. Dead vregs cause bad numbering and make
consumers sad.
We could also just look at debug info an use named locals instead, but
vregs have to work properly anyways so there!