Roman Lebedev [Wed, 3 Jul 2019 09:41:35 +0000 (09:41 +0000)]
[Codegen][X86][AArch64][ARM][PowerPC] Inc-of-add vs sub-of-not (PR42457)
Summary:
This is the backend part of [[ https://bugs.llvm.org/show_bug.cgi?id=42457 | PR42457 ]].
In middle-end, we'd want to prefer the form with two adds - D63992,
but as this diff shows, not every target will prefer that pattern.
Out of 4 targets for which i added tests all seem to be ok with inc-of-add for scalars,
but only X86 prefer that same pattern for vectors.
Here i'm adding a new TLI hook, always defaulting to the inc-of-add,
but adding AArch64,ARM,PowerPC overrides to prefer inc-of-add only for scalars.
[SCEV][LSR] Prevent using undefined value in binops
On some occasions ReuseOrCreateCast may convert previously
expanded value to undefined. That value may be passed by
SCEVExpander as an argument to InsertBinop making IV chain
undefined.
Summary:
Handling callbr is very similar to handling an inline assembly call:
MSan must checks the instruction's inputs.
callbr doesn't (yet) have outputs, so there's nothing to unpoison,
and conservative assembly handling doesn't apply either.
Teresa Johnson [Wed, 3 Jul 2019 02:14:47 +0000 (02:14 +0000)]
[ThinLTO] Reenable test with workaround for known failure
Reenable the testing disabled in r364978 with the same workaround used
for this failure in the cfi-devirt.ll test. The known issue is PR39436,
and the workaround is to add -verify-machineinstrs=0.
[AArch64][GlobalISel] Overhaul legalization & isel or shifts to select immediate forms.
There are two main issues preventing us from generating immediate form shifts:
1) We have partial SelectionDAG imported support for G_ASHR and G_LSHR shift
immediate forms, but they currently don't work because the amount type is
expected to be an s64 constant, but we only legalize them to have homogenous
types.
To deal with this, first we introduce a custom legalizer to *only* custom legalize
s32 shifts which have a constant operand into a s64.
There is also an additional artifact combiner to fold zexts(g_constant) to a
larger G_CONSTANT if it's legal, a counterpart to the anyext version committed
in an earlier patch.
2) For G_SHL the importer can't cope with the pattern. For this I introduced an
early selection phase in the arm64 selector to select these forms manually
before the tablegen selector pessimizes it to a register-register variant.
Matt Arsenault [Wed, 3 Jul 2019 00:30:47 +0000 (00:30 +0000)]
CodeGen: Set hasSideEffects = 0 on BUNDLE
The BUNDLE itself should not have side effects, and this is a property
of instructions inside the bundle. The hasProperty check already
searches for any member instructions, which was pointless since it was
overridden by this bit.
Allows me to distinguish bundles that have side effects vs. do not in
a future patch. Also fixes an unnecessary scheduling barrier in the
bundle AMDGPU uses to get PC relative addresses.
Teresa Johnson [Tue, 2 Jul 2019 23:28:28 +0000 (23:28 +0000)]
[ThinLTO] Work around existing failure exposed by new test
When adding summary entries for index-based WPD (r364960), an added
test also included some additional testing of the existing hybrid
Thin/Regular LTO WPD (test/ThinLTO/X86/devirt.ll). That part of the
test is producing a failure on the llvm-clang-x86_64-expensive-checks-win
bot:
*** Bad machine code: Explicit definition marked as use ***
- function: __typeid__ZTS1A_0_branch_funnel
- basic block: %bb.0 (0x81d4c58)
- instruction: ICALL_BRANCH_FUNNEL %0:gr64, @0, 16, @_ZN1B1fEi, 48, @_ZN1C1fEi
- operand 0: %0:gr64
LLVM ERROR: Found 1 machine code errors.
This is functionality unrelated to the summary entries added with my
patch, so I am disabling this part of the new test until it is
addressed. I'll continue to investigate the failure.
Eli Friedman [Tue, 2 Jul 2019 21:35:15 +0000 (21:35 +0000)]
[ARM] Fix unwind info for Thumb1 functions that save high registers.
There were two issues here: one, some of the relevant instructions were
missing the expected "FrameSetup" flag, and two,
ARMAsmPrinter::EmitUnwindingInstruction wasn't expecting "mov"
instructions in the prologue.
I'm sticking the additional state into ARMFunctionInfo so it's obvious
it only applies to the current function.
I considered a few alternative approaches where we would compute the
correct unwind information as part of the prologue/epilogue lowering,
but it seems like a lot of work to introduce pseudo-instructions, and
the current code seems to be reliable enough.
Teresa Johnson [Tue, 2 Jul 2019 20:24:00 +0000 (20:24 +0000)]
[gold] Fix test after BitStream reader error changes
The recent change to the BitStream reader error handling in r364464
changed the error message format (from "LLVM ERROR:" to just "error"),
leading to a failure in this test which is only executed for very recent
versions of gold. Fix this by removing that part of the error message
check, leaving only the interesting part of the message to be checked.
Summary: This patch introduces a new heuristic for guiding operand reordering. The new "look-ahead" heuristic can look beyond the immediate predecessors. This helps break ties when the immediate predecessors have identical opcodes (see lit test for an example).
[AArch64][GlobalISel] Teach tryOptSelect to handle G_ICMP
This teaches `tryOptSelect` to handle folding G_ICMP, and removes the
requirement that the G_SELECT we're dealing with is floating point.
Some refactoring to make this work nicely as well:
- Factor out the scalar case from the selection code for G_ICMP into
`emitIntegerCompare`.
- Make `tryOptCMN` return a MachineInstr* instead of a bool.
- Make `tryOptCMN` not modify the instruction being selected.
- Factor out the CMN emission into `emitCMN` for readability.
By doing this this way, we can get all of the compare selection optimizations
in select emission.
Teresa Johnson [Tue, 2 Jul 2019 19:38:02 +0000 (19:38 +0000)]
[ThinLTO] Add summary entries for index-based WPD
Summary:
If LTOUnit splitting is disabled, the module summary analysis computes
the summary information necessary to perform single implementation
devirtualization during the thin link with the index and no IR. The
information collected from the regular LTO IR in the current hybrid WPD
algorithm is summarized, including:
1) For vtable definitions, record the function pointers and their offset
within the vtable initializer (subsumes the information collected from
IR by tryFindVirtualCallTargets).
2) A record for each type metadata summarizing the vtable definitions
decorated with that metadata (subsumes the TypeIdentiferMap collected
from IR).
Also added are the necessary bitcode records, and the corresponding
assembly support.
Matt Arsenault [Tue, 2 Jul 2019 19:15:45 +0000 (19:15 +0000)]
AMDGPU: Custom lower vector_shuffle for v4i16/v4f16
Ordinarily it is lowered as a build_vector of each extract_vector_elt,
which in turn get lowered to bitcasts and bit shifts. Very little
understand the lowered extract pattern, resulting in much worse
code. We treat concat_vectors of v2i16 as legal, so prefer that.
Teresa Johnson [Tue, 2 Jul 2019 18:54:03 +0000 (18:54 +0000)]
[RA] Fix spelling of Greedy register allocator internal option
The internal option added with r323870 has a typo. It isn't being used
by any tests, but I decided to fix the spelling and leave it in for use
in debugging the changes added in that patch.
Erik Pilkington [Tue, 2 Jul 2019 18:28:13 +0000 (18:28 +0000)]
[C++2a] Add __builtin_bit_cast, used to implement std::bit_cast
This commit adds a new builtin, __builtin_bit_cast(T, v), which performs a
bit_cast from a value v to a type T. This expression can be evaluated at
compile time under specific circumstances.
The compile time evaluation currently doesn't support bit-fields, but I'm
planning on fixing this in a follow up (some of the logic for figuring this out
is in CodeGen). I'm also planning follow-ups for supporting some more esoteric
types that the constexpr evaluator supports, as well as extending
__builtin_memcpy constexpr evaluation to use the same infrastructure.
[X86] Add patterns to select (scalar_to_vector (loadf32)) as (V)MOVSSrm instead of COPY_TO_REGCLASS + (V)MOVSSrm_alt.
Similar for (V)MOVSD. Ultimately, I'd like to see about folding
scalar_to_vector+load to vzload. Which would select as (V)MOVSSrm
so this is closer to that.
Roman Lebedev [Tue, 2 Jul 2019 16:48:49 +0000 (16:48 +0000)]
[NFC][Codegen][X86][AArch64][ARM][PowerPC] Recommit: Add test coverage for "add-of-inc" vs "sub-of-not"
I initially committed it with --check-prefix instead of --check-prefixes
(again, shame on me, and utils/update_*.py not complaining!)
and did not have a moment to understand the failure,
so i reverted it initially in rL64939.
Roman Lebedev [Tue, 2 Jul 2019 14:48:52 +0000 (14:48 +0000)]
[NFC][Codegen][X86][AArch64][ARM][PowerPC] Add test coverage for "add-of-inc" vs "sub-of-not"
As it is pointed out in https://reviews.llvm.org/D63992,
before we get to pick canonical variant in middle-end
we should ensure best codegen in backend.
Matt Arsenault [Tue, 2 Jul 2019 14:40:22 +0000 (14:40 +0000)]
AMDGPU/GlobalISel: Fix G_GEP with mixed SGPR/VGPR operands
The register bank for the destination of the sample argument copy was
wrong. We shouldn't be constraining each source to the result register
bank. Allow constraining the original register to the right size.
Simon Pilgrim [Tue, 2 Jul 2019 13:30:04 +0000 (13:30 +0000)]
[X86][AVX] combineX86ShuffleChain - pull out CombineShuffleWithExtract lambda. NFCI.
Pull out CombineShuffleWithExtract lambda to new combineX86ShuffleChainWithExtract wrapper and refactored it to handle more than 2 shuffle inputs - this will allow combineX86ShufflesRecursively to call this in a future patch.
I was actually wondering if there was some nicer way than m_Value()+cast,
but apparently what i was really "subconsciously" thinking about
was correctness issue.
hasNoUnsignedWrap()/hasNoUnsignedWrap() exist for Instruction,
not for BinaryOperator, so let's just use m_Instruction(),
thus both avoiding a cast, and a crash.
Michal Gorny [Tue, 2 Jul 2019 11:32:03 +0000 (11:32 +0000)]
[llvm] [Support] Clean PrintStackTrace() ptr arithmetic up
Use '%tu' modifier for pointer arithmetic since we are using C++11
already. Prefer static_cast<> over C-style cast. Remove unnecessary
conversion of result, and add const qualifier to converted pointers,
to silence the following warning:
In file included from /home/mgorny/llvm-project/llvm/lib/Support/Signals.cpp:220:0:
/home/mgorny/llvm-project/llvm/lib/Support/Unix/Signals.inc: In function ‘void llvm::sys::PrintStackTrace(llvm::raw_ostream&)’:
/home/mgorny/llvm-project/llvm/lib/Support/Unix/Signals.inc:546:53: warning: cast from type ‘const void*’ to type ‘char*’ casts away qualifiers [-Wcast-qual]
(char*)dlinfo.dli_saddr));
^~~~~~~~~
[IDF] Generalize IDFCalculator to be used with Clang's CFG
I'm currently working on a GSoC project that aims to improve the the bug reports
of the analyzer. The main heuristic I plan to use is to explain values that are
a control dependency of the bug location better.
01 bool b = messyComputation();
02 int i = 0;
03 if (b) // control dependency of the bug site, let's explain why we assume val
04 // to be true
05 10 / i; // warn: division by zero
Because of this, I'd like to generalize IDFCalculator so that I could use it for
Clang's CFG: D62883.
In detail:
* Rename IDFCalculator to IDFCalculatorBase, make it take a general CFG node
type as a template argument rather then strictly BasicBlock (but preserve
ForwardIDFCalculator and ReverseIDFCalculator)
* Move IDFCalculatorBase from llvm/include/llvm/Analysis to
llvm/include/llvm/Support (but leave the BasicBlock variants in
llvm/include/llvm/Analysis)
* clang-format the file since this patch messes up git blame anyways
* Change typedef to using
* Add the new type ChildrenGetterTy, and store an instance of it in
IDFCalculatorBase. This is important because I'll have to specialize it for
Clang's CFG to filter out nullpointer successors, similarly to D62507.
Simon Tatham [Tue, 2 Jul 2019 11:26:11 +0000 (11:26 +0000)]
[ARM] MVE: allow soft-float ABI to pass vector types.
Passing a vector type over the soft-float ABI involves it being split
into four GPRs, so the first thing that has to happen at the start of
the function is to recombine those into a vector register. The ABI
types all vectors as v2f64, so we need to support BUILD_VECTOR for
that type, which I do in this patch by allowing it to be expanded in
terms of INSERT_VECTOR_ELT, and writing an ISel pattern for that in
turn. Similarly, I provide a rule for EXTRACT_VECTOR_ELT so that a
returned vector can be marshalled back into GPRs.
While I'm here, I've also added ISD::UNDEF to the list of operations
we turn back on in `setAllExpand`, because I noticed that otherwise it
gets expanded into a BUILD_VECTOR with explicit zero inputs, leading
to pointless machine instructions to zero out a vector register that's
about to have every lane overwritten of in any case.
Simon Tatham [Tue, 2 Jul 2019 11:26:00 +0000 (11:26 +0000)]
[ARM] Stop using scalar FP instructions in integer-only MVE mode.
If you compile with `-mattr=+mve` (enabling integer MVE instructions
but not floating-point ones), then the scalar FP //registers// exist
and it's legal to move things in and out of them, load and store them,
but it's not legal to do arithmetic on them.
In D60708, the calls to `addRegisterClass` in ARMISelLowering that
enable use of the scalar FP registers became conditionalised on
`Subtarget->hasFPRegs()` instead of `Subtarget->hasVFP2Base()`, so
that loads, stores and moves of those registers would work. But I
didn't realise that that would also enable all the operations on those
types by default.
Now, if the target doesn't have basic VFP, we follow up those
`addRegisterClass` calls by turning back off all the nontrivial
operations you can perform on f32 and f64. That causes several
knock-on failures, which are fixed by allowing the `VMOVDcc` and
`VMOVScc` instructions to be selected even if all you have is
`HasFPRegs`, and adjusting several checks for 'is this a double in a
single-precision-only world?' to the more general 'is this any FP type
we can't do arithmetic on?'. Between those, the whole of the
`float-ops.ll` and `fp16-instructions.ll` tests can now run in
MVE-without-FP mode and generate correct-looking code.
One odd side effect is that I had to relax the check lines in that
test so that they permit test functions like `add_f` to be generated
as tailcalls to software FP library functions, instead of ordinary
calls. Doing that is entirely legal, but the mystery is why this is
the first RUN line that's needed the relaxation: on the usual kind of
non-FP target, no tailcalls ever seem to be generated. Going by the
llc messages, I think `SoftenFloatResult` must be perturbing the code
generation in some way, but that's as much as I can guess.
Simon Pilgrim [Tue, 2 Jul 2019 10:53:17 +0000 (10:53 +0000)]
[X86] resolveTargetShuffleInputsAndMask - add repeated input handling.
We were relying on combineX86ShufflesRecursively to handle this - this patch gets it done earlier which should make it easier for other code to use resolveTargetShuffleInputsAndMask.
[TailDuplicator] Fix copy instruction emitting into the wrong block.
The code for duplicating instructions could sometimes try to emit copies
intended to deal with unconstrainable register classes to the tail block of the
original instruction, rather than before the newly cloned instruction in the
predecessor block.
[PowerPC] Implement the areMemAccessesTriviallyDisjoint hook
After implemented this hook, we will model the memory dependency in the scheduling dependency graph more precise,
and will have more opportunity to reorder the load/stores, as they didn't have the dependency at some condition
Zi Xuan Wu [Tue, 2 Jul 2019 02:54:52 +0000 (02:54 +0000)]
[DAGCombiner] Exploiting more about the transformation of TransformFPLoadStorePair function
For a given floating point load / store pair, if the load value isn't used by any other operations,
then consider transforming the pair to integer load / store operations if the target deems the transformation profitable.
And we can exploiting much more when there are other operation nodes with chain operand between the load/store pair
so long as we keep the chain ordering original. We only replace the register used to load/store from float to integer.
I only add testcase in ARM because the TLI.isDesirableToTransformToIntegerOp hook is only enabled in ARM target.
[cmake] With utils disabled, don't build tblgen in cross mode
Summary:
In cross mode, we build a separate NATIVE tblgen that runs on the
host and is used during the build. Separately, we have a flag that
disables building all executables in utils/. Of course generally,
this doesn't turn off tblgen, since we need that during the build.
In cross mode, however, that tblegen is useless since we never
actually use it. Furthermore, it can be actively problematic if the
cross toolchain doesn't like building executables for whatever reason.
And even if building executables works fine, we can at least save
compile time by omitting it from the target build. There's two changes
needed to make this happen:
- Stop creating a dependency from the native tool to the target tool.
No such dependency is required for a correct build, so I'm not entirely
sure why it was there in the first place.
- If utils were disabled on the CMake command line and we're in cross mode,
respect that by excluding it from the install target (using EXCLUDE_FROM_ALL).
[InstCombine] reduce more checks for power-of-2-or-zero using ctpop
Extends the transform from:
rL364341
...to include another (more common?) pattern that tests whether a
value is a power-of-2 (including or excluding zero).
Matt Arsenault [Mon, 1 Jul 2019 19:36:10 +0000 (19:36 +0000)]
GlobalISel: Try to widen merges with other merges
If the requested source type an be used as a merge source type, create
a merge of merges. This avoids creating large, illegal extensions and
bit-ops directly to the result type.