Alexey Bataev [Mon, 31 Oct 2016 12:10:53 +0000 (12:10 +0000)]
Improved cost model for FDIV and FSQRT, by Andrew Tischenko
There is a bug describing poor cost model for floating point operations:
Bug 29083 - [X86][SSE] Improve costs for floating point operations. This
patch is the second one in series of patches dealing with cost model.
Dorit Nuzman [Sun, 30 Oct 2016 12:23:26 +0000 (12:23 +0000)]
[LoopVectorize] Make interleaved-accesses analysis less conservative about
possible pointer-wrap-around concerns, in some cases.
Before this patch, collectConstStridedAccesses (part of interleaved-accesses
analysis) called getPtrStride with [Assume=false, ShouldCheckWrap=true] when
examining all candidate pointers. This is too conservative. Instead, this
patch makes collectConstStridedAccesses use an optimistic approach, calling
getPtrStride with [Assume=true, ShouldCheckWrap=false], and then, once the
candidate interleave groups have been formed, revisits the pointer-wrapping
analysis but only where it matters: namely, in groups that have gaps, and where
the gaps are not at the very end of the group (in which case the loop is
peeled). This second time getPtrStride is called with [Assume=false,
ShouldCheckWrap=true], but this could further be improved to using Assume=true,
once we also add the logic to track that we are not going to meet the scev
runtime checks threshold.
Craig Topper [Sun, 30 Oct 2016 06:56:16 +0000 (06:56 +0000)]
[X86] Use intrinsics table for PMADDUBSW and PMADDWD so that we can use the legacy intrinsics to select EVEX encoded instructions when available.
This removes a couple tablegen classes that become unused after this change. Another class gained an additional parameter to allow PMADDUBSW to specify a different result type from its input type.
Teresa Johnson [Sun, 30 Oct 2016 05:40:44 +0000 (05:40 +0000)]
[ThinLTO] Use per-summary flag to prevent exporting locals used in inline asm
Summary:
Instead of using the workaround of suppressing the entire index for
modules that call inline asm that may reference locals, use the
NoRename flag on the summary for any locals in the llvm.used set, and
add a reference edge from any functions containing inline asm.
This avoids issues from having no summaries despite the module defining
global values, which was preventing more aggressive index-based
optimization. It will be followed by a subsequent patch to make a
similar fix for local references in module level asm (to fix PR30610).
Teresa Johnson [Sun, 30 Oct 2016 05:15:23 +0000 (05:15 +0000)]
[ThinLTO] Correctly resolve linkonce when importing aliasee
Summary:
When we have an aliasee that is linkonce, while we can't convert
the non-prevailing copies to available_externally, we still need to
convert the prevailing copy to weak. If a reference to the aliasee
is exported, not converting a copy to weak will result in undefined
references when the linkonce is removed in its original module.
Teresa Johnson [Sat, 29 Oct 2016 21:31:48 +0000 (21:31 +0000)]
[ThinLTO] Use NoPromote flag in summary during promotion
Summary:
Replace the check of whether a GV has a section with the flag check
in the summary. This is in preparation for using the NoPromote flag
to convey other situations when we can't promote (e.g. locals used in
inline asm).
Sanjay Patel [Sat, 29 Oct 2016 16:21:19 +0000 (16:21 +0000)]
[ValueTracking] recognize more variants of smin/smax
Try harder to detect obfuscated min/max patterns: the initial pattern was added with D9352 / rL236202.
There was a bug fix for PR27137 at rL264996, but I think we can do better by folding the corresponding
smax pattern and commuted variants.
The codegen tests demonstrate the effect of ValueTracking on the backend via SelectionDAGBuilder. We
can't expose these differences minimally in IR because we don't have smin/smax intrinsics for IR.
Simon Pilgrim [Sat, 29 Oct 2016 11:29:39 +0000 (11:29 +0000)]
[DAGCombiner] (REAPPLIED) Add vector demanded elements support to computeKnownBits
Currently computeKnownBits returns the common known zero/one bits for all elements of vector data, when we may only be interested in one/some of the elements.
This patch adds a DemandedElts argument that allows us to specify the elements we actually care about. The original computeKnownBits implementation calls with a DemandedElts demanding all elements to match current behaviour. Scalar types set this to 1.
The approach was found to be easier than trying to add a per-element known bits solution, for a similar usefulness given the combines where computeKnownBits is typically used.
I've only added support for a few opcodes so far (the ones that have proven straightforward to test), all others will default to demanding all elements but can be updated in due course.
DemandedElts support could similarly be added to computeKnownBitsForTargetNode in a future commit.
This looked like this had caused compile time regressions on some buildbots (and was reverted in rL285381), but appears to have just been a harmless bystander!
Matthias Braun [Sat, 29 Oct 2016 01:03:41 +0000 (01:03 +0000)]
AArch64DeadRegisterDefinitionsPass: Cleanup; NFC
- Fix doxygen file comment
- reduce indentation in loop
- Factor out some common subexpressions
- Move independent helper function out of class
- Fix Changed flag (this is not strictly NFC but a bugfix, but the flag
seems ignored anyway)
Matt Arsenault [Fri, 28 Oct 2016 21:55:08 +0000 (21:55 +0000)]
AMDGPU: Rename glc operand type
While trying to add the glc bit to SMEM instructions on VI
with the new refactoring I ran into some kind of shadowing
problem for the glc operand when using the pseudoinstruction
as a multiclass parameter.
Everywhere that currently uses it defines the operand to have the same
name as its type, i.e. glc:$glc which works. For some reason now it
conflicts, and its up evaluating to the wrong thing. For the
real encoding classes,
let Inst{16} = !if(ps.has_glc, glc, ?); was not being evaluated
and still visible in the Inst initializer in the expanded td file.
In other cases I got a a different error about an illegal operand
where this was using { 0 } initializer from the bits<1> glc initializer
instead of evaluating it as false in the if.
For consistency all of the operand types should probably
be captialized to avoid conflicting with the variable names
unless somebody has a better idea of how to fix this.
Justin Lebar [Fri, 28 Oct 2016 21:43:54 +0000 (21:43 +0000)]
Don't leave unused divs/rems sitting around in BypassSlowDivision.
Summary:
This "pass" eagerly creates div and rem instructions even when only one
is needed -- it relies on a later pass (machine DCE?) to clean them up.
This is problematic not just from a cleanliness perspective (this pass
is running during CodeGenPrepare, so should leave the IR in a better
state), but it also creates a problem for instruction selection. If we
always have a div+rem, isel will always select a divrem instruction (if
possible), even when a single div or rem would do.
Specifically, in NVPTX, we want to compute rem from the output of div,
if available. But if a div is not available, we want to leave the rem
alone. This transformation is overeager if div is always available.
Because this code runs as part of CodeGenPrepare, it's nontrivial to
write a test for this change. But this will effectively be tested by
a later patch which adds the aforementioned change to NVPTX isel.
Justin Lebar [Fri, 28 Oct 2016 21:43:51 +0000 (21:43 +0000)]
Don't claim the udiv created in BypassSlowDivision is exact.
Summary:
In BypassSlowDivision's short-dividend path, we would create e.g.
udiv exact i32 %a, %b
"exact" here means that we are asserting that %a is a multiple of %b.
But we have no reason to believe this must be true -- this is just a
bug, as far as I can tell.
Handle non-~0 lane masks on live-in registers in LivePhysRegs
When LivePhysRegs adds live-in registers, it recognizes ~0 as a special
lane mask indicating the entire register. If the lane mask is not ~0,
it will only add the subregisters that overlap the specified lane mask.
The problem is that if a live-in register does not have subregisters,
and the lane mask is not ~0, it will not be added to the live set.
(The given lane mask may simply be the lane mask of its register class.)
If a register does not have subregisters, add it to the live set if
the lane mask is non-zero.
Matt Arsenault [Fri, 28 Oct 2016 19:43:31 +0000 (19:43 +0000)]
AMDGPU: Fix using incorrect private resource with no allocation
It's possible to have a use of the private resource descriptor or
scratch wave offset registers even though there are no allocated
stack objects. This would result in continuing to use the maximum
number reserved registers. This could go over the number of SGPRs
available on VI, or violate the SGPR limit requested by
the function attributes.
Teresa Johnson [Fri, 28 Oct 2016 19:36:00 +0000 (19:36 +0000)]
[ThinLTO] Use flags from summary when writing variable summary (NFC)
We already read the flags out of the summary when writing the summary
records for functions and aliases, do the same for variables.
This is an NFC change for now since the flags computed on the fly from
the GlobalValue currently will always match those in the summary
already, but once I send a follow-on patch to set the NoRename flag for
locals in the llvm.used set this becomes a necessary change.
Lang Hames [Fri, 28 Oct 2016 18:24:15 +0000 (18:24 +0000)]
[Error] Unify +Asserts/-Asserts behavior for checked flags in Error/Expected<T>.
(1) Switches to raw pointer and bitmasking operations for Error payload.
(2) Always includes the 'unchecked' bitfield in Expected<T>, even in -Asserts.
(3) Always propagates checked bit status in move-ops for both classes, even in
-Asserts.
This should allow debug programs to link against release libraries without
encountering spurious 'unchecked error' terminations.
Error checks still aren't verified in release mode so this doesn't introduce
any new control flow, but it does require new bit-masking ops in release mode
to preserve the flag values during move ops. I expect the overhead to be
minimal, but if we discover any corner cases where it matters we could fix
this by making flag propagation conditional on a new build option.
Matthias Braun [Fri, 28 Oct 2016 18:05:05 +0000 (18:05 +0000)]
TargetPassConfig: Move addPass of IPRA RegUsageInfoProp down.
TargetPassConfig::addMachinePasses() does some housekeeping first:
Handling the -print-machineinstrs flag and doing an initial printing
"After Instruction Selection". There is no reason for RegUsageInfoProp
to run before those two steps.
Tom Stellard [Fri, 28 Oct 2016 15:32:28 +0000 (15:32 +0000)]
[Loads] Fix crash in is isDereferenceableAndAlignedPointer()
Summary:
We were trying to add APInt values with different bit sizes after
visiting an addrspacecast instruction which changed the bit width
of the pointer.
Teresa Johnson [Fri, 28 Oct 2016 15:30:27 +0000 (15:30 +0000)]
[cmake] Temporarily revert enforcement of minimum GCC version increase
Summary:
This is temporary, until bot that builds public facing LLVM
documentation is upgraded. It reverts only the cmake change in r284497,
but leaves the doc changes in place to preserve intent.
Davide Italiano [Fri, 28 Oct 2016 02:47:09 +0000 (02:47 +0000)]
[Reassociate] Removing instructions mutates the IR.
Fixes PR 30784. Discussed with Justin, who pointed out that
in the new PassManager infrastructure we can have more fine-grained
control on which analyses we want to preserve, but this is the
best we can do with the current infrastructure.
Teresa Johnson [Fri, 28 Oct 2016 02:39:38 +0000 (02:39 +0000)]
[ThinLTO] Create AliasSummary when building index
Summary:
Previously we were creating the alias summary on the fly while writing
the summary to bitcode. This moves the creation of these summaries to
the module summary index builder where we build the rest of the summary
index.
This is going to be necessary for setting the NoRename flag for values
possibly used in inline asm or module level asm.