granicus.if.org Git

Revert "[DebugInfo] Introduce DW_OP_LLVM_convert"

This reverts commit 1cf4b593a7ebd666fc6775f3bd38196e8e65fafe.

Build bots found failing tests not detected locally.

Failing Tests (3):
  LLVM :: DebugInfo/Generic/convert-debugloc.ll
  LLVM :: DebugInfo/Generic/convert-inlined.ll
  LLVM :: DebugInfo/Generic/convert-linked.ll

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356444 91177308-0d34-0410-b5e6-96231b3b80d8

Use response file when generating LLVM-C.dll

As discovered in D56774 the command line gets to long, so use a response file
to give the script the libs. This change has been tested and is confirmed
working for me.

Commited on behalf of Jakob Bornecrantz.
Differential Revision: https://reviews.llvm.org/D56781

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356443 91177308-0d34-0410-b5e6-96231b3b80d8

[DebugInfo] Introduce DW_OP_LLVM_convert

Introduce a DW_OP_LLVM_convert Dwarf expression pseudo op that allows
for a convenient way to perform type conversions on the Dwarf expression
stack. As an additional bonus it paves the way for using other Dwarf
v5 ops that need to reference a base_type.

The new DW_OP_LLVM_convert is used from lib/Transforms/Utils/Local.cpp
to perform sext/zext on debug values but mainly the patch is about
preparing terrain for adding other Dwarf v5 ops that need to reference a
base_type.

For Dwarf v5 the op maps to DW_OP_convert and for earlier versions a
complex shift & mask pattern is generated to emulate sext/zext.

Differential Revision: https://reviews.llvm.org/D56587

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356442 91177308-0d34-0410-b5e6-96231b3b80d8

[WebAssembly] Small improvements in FixIrreducibleControlFlow (NFC)

Summary:
- Make some class member methods const
- Delete unnecessary includes
- Use a simpler form of `BuildMI`

Reviewers: kripken

Subscribers: dschuff, sbc100, jgravelle-google, sunfish, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59454

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356440 91177308-0d34-0410-b5e6-96231b3b80d8

[WebAssembly] Improve readability of irreducibility tests

Summary:
This adds `preds` comment lines to BB names for readability, while also
fixes some of existing incorrect comment lines. Also deletes a few
unnecessary attributes. Autogenerated by `opt`.

Reviewers: kripken

Subscribers: dschuff, sbc100, jgravelle-google, sunfish, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59456

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356439 91177308-0d34-0410-b5e6-96231b3b80d8

[WebAssembly] Rename methods according to instruction name changes (NFC)

Reviewers: tlively, sbc100

Subscribers: dschuff, jgravelle-google, sunfish, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59469

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356438 91177308-0d34-0410-b5e6-96231b3b80d8

[WebAssembly] Add immarg attribute to intrinsics

Summary:
After r355981, intrinsic arguments that are immediate values should be
marked as `ImmArg`.

Reviewers: dschuff, tlively

Subscribers: sbc100, jgravelle-google, sunfish, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59447

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356437 91177308-0d34-0410-b5e6-96231b3b80d8

[WebAssembly] Lower SIMD nnan setcc nodes

Summary:
Adds patterns to lower all the remaining setcc modes: lt, gt,
le, and ge. Fixes PR40912.

Reviewers: aheejin, sbc100, dschuff

Reviewed By: dschuff

Subscribers: jgravelle-google, hiraditya, sunfish, jdoerfert, llvm-commits, srj

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59519

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356431 91177308-0d34-0410-b5e6-96231b3b80d8

Revert "[ValueTracking][InstSimplify] Support min/max selects in computeConstantRange()"

This reverts commit 106f0cdefb02afc3064268dc7a71419b409ed2f3.

This change impacts the AMDGPU smed3.ll and umed3.ll codegen tests.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356424 91177308-0d34-0410-b5e6-96231b3b80d8

[libFuzzer] document -len_control

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356422 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Add coverage for 16-bit and 64-bit versions of bsf/bsr/bt/btc/btr/bts in the assembly tests that are supposed to provide full coverage. Add coverage for cwtl/cltq/cwtd/cqto as well.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356420 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Disable CQTO and CLTQ instructions in the assembly parser outside 64-bit mode.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356419 91177308-0d34-0410-b5e6-96231b3b80d8

[ValueTracking][InstSimplify] Support min/max selects in computeConstantRange()

Add support for min/max flavor selects in computeConstantRange(),
which allows us to fold comparisons of a min/max against a constant
in InstSimplify. This was suggested by spatel as an alternative
approach to D59378. I've also added the infinite looping test from
that revision here.

Differential Revision: https://reviews.llvm.org/D59506

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356415 91177308-0d34-0410-b5e6-96231b3b80d8

[InstCombine] Add tests for add nuw + uaddo; NFC

Baseline tests for D59471 (InstCombine of `add nuw` and `uaddo` with
constants).

Patch by Dan Robertson.

Differential Revision: https://reviews.llvm.org/D59472

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356414 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Allow any 8-bit immediate to be used with BT/BTC/BTR/BTS not just sign extended 8-bit immediates.

We need to allow [128,255] in addition to [-128, 127] to match gas.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356413 91177308-0d34-0410-b5e6-96231b3b80d8

[GlobalISel] Include missing change from r356396

Forgot to add a change to relax some asserts in r356396.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356411 91177308-0d34-0410-b5e6-96231b3b80d8

[WebAssembly] Don't override default implementation of isOffsetFoldingLegal. NFC.

The default implementation does we want and is going to more compatible
with dynamic linking (-fPIC) support that is planned.

This is NFC because currently we only build wasm with
`-relocation-model=static` which in turn means that the default
`isOffsetFoldingLegal` always returns true today.

Differential Revision: https://reviews.llvm.org/D54661

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356410 91177308-0d34-0410-b5e6-96231b3b80d8

[ValueTracking][InstSimplify] Move abs handling into computeConstantRange(); NFC

This is preparation for D59506. The InstructionSimplify abs handling
is moved into computeConstantRange(), which is the general place for
such calculations. This is NFC and doesn't affect the existing tests
in test/Transforms/InstSimplify/icmp-abs-nabs.ll.

Differential Revision: https://reviews.llvm.org/D59511

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356409 91177308-0d34-0410-b5e6-96231b3b80d8

[InstSimplify] Add additional icmp of min/max tests; NFC

These are baseline tests for D59506.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356408 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Use relocImm in the ROL8ri/ROL16ri/ROL32ri/ROL64ri patterns to be consistent with the ROR patterns.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356407 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Replace uses of i64immSExt32_su with i64relocImmSExt32_su.

For the i8, i16, and i32 instructions we were using a relocImm. Presumably we should for i64 as well.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356406 91177308-0d34-0410-b5e6-96231b3b80d8

[AMDGPU] Enable code selection using `s_mul_hi_u32`/`s_mul_hi_i32`.

Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59501

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356405 91177308-0d34-0410-b5e6-96231b3b80d8

[llvm-objcopy] Make .build-id linking atomic

This change makes linking into .build-id atomic and safe to use.
Some users under particular workflows are reporting that this races
more than half the time under particular conditions.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356404 91177308-0d34-0410-b5e6-96231b3b80d8

[InstCombine] Improve with.overflow intrinsic tests; NFC

- Do not use unnamed values in saddo tests
- Add tests for canonicalization of a constant arg0

Patch by Dan Robertson.

Differential Revision: https://reviews.llvm.org/D59476

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356403 91177308-0d34-0410-b5e6-96231b3b80d8

Restore comment regarding why Reloc::PIC_ can't be PIC

The original change back in rL29307 explained this but it was
lost somewhere along the way.

Differential Revision: https://reviews.llvm.org/D59445

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356402 91177308-0d34-0410-b5e6-96231b3b80d8

Fix flat-error-unsupported-gpu-hsa test

Differential Revision: https://reviews.llvm.org/D59505

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356400 91177308-0d34-0410-b5e6-96231b3b80d8

[AMDGPU] Asm/disasm clamp modifier on vop3 int arithmetic

Allow the clamp modifier on vop3 int arithmetic instructions in assembly
and disassembly.

This involved adding a clamp operand to the affected instructions in MIR
and MC, and thus having to fix up several places in codegen and MIR
tests.

Differential Revision: https://reviews.llvm.org/D59267

Change-Id: Ic7775105f02a985b668fa658a0cd7837846a534e

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356399 91177308-0d34-0410-b5e6-96231b3b80d8

[AMDGPU] Asm/disasm v_cndmask_b32_e64 with abs/neg source modifiers

This commit allows v_cndmask_b32_e64 with abs, neg source
modifiers on src0, src1 to be assembled and disassembled.

This does appear to be allowed, even though they are floating point
modifiers and the operand type is b32.

To do this, I added src0_modifiers and src1_modifiers to the
MachineInstr, which involved fixing up several places in codegen and mir
tests.

Differential Revision: https://reviews.llvm.org/D59191

Change-Id: I69bf4a8c73ebc65744f6110bb8fc4e937d79fbea

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356398 91177308-0d34-0410-b5e6-96231b3b80d8

Revert r356304: remove subreg parameter from MachineIRBuilder::buildCopy()

After review comments, it was preferred to not teach MachineIRBuilder about
non-generic instructions beyond using buildInstr().

For AArch64 I've changed the buildCopy() calls to buildInstr() + a
separate addReg() call.

This also relaxes the MachineIRBuilder's COPY checking more because it may
not always have a SrcOp given to it.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356396 91177308-0d34-0410-b5e6-96231b3b80d8

[DebugInfo][PDB] Don't write empty debug streams

Before, empty debug streams were written as 8 bytes (4 bytes signature + 4 bytes for the GlobalRefs count).

With this patch, unused empty streams aren't emitted anymore. Modules now encode 65535 as an 'unused stream' value, by convention.
Also fix the * Linker * contrib section which wasn't correctly emitted previously.

Differential Revision: https://reviews.llvm.org/D59502

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356395 91177308-0d34-0410-b5e6-96231b3b80d8

[MsgPack][AMDGPU] Fix unflushed raw_string_ostream bugs on windows expensive checks bot

This fixes a couple of unflushed raw_string_ostream bugs in recent
commits that only show up on a bot building on windows with expensive
checks.

Differential Revision: https://reviews.llvm.org/D59396

Change-Id: I9c6208325503b3ee0786b4b688e13fc24a15babf

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356394 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Rename imm8_su/imm16_su/imm32_su to relocImm8_su/relocImm16_su/relocImm32_su/ to accurately reflect what they are.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356393 91177308-0d34-0410-b5e6-96231b3b80d8

[SCEV] Guard movement of insertion point for loop-invariants

This reinstates r347934, along with a tweak to address a problem with
PHI node ordering that that commit created (or exposed). (That commit
was reverted at r348426, due to the PHI node issue.)

Original commit message:

r320789 suppressed moving the insertion point of SCEV expressions with
dev/rem operations to the loop header in non-loop-invariant situations.
This, and similar, hoisting is also unsafe in the loop-invariant case,
since there may be a guard against a zero denominator. This is an
adjustment to the fix of r320789 to suppress the movement even in the
loop-invariant case.

This fixes PR30806.

Differential Revision: https://reviews.llvm.org/D57428

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356392 91177308-0d34-0410-b5e6-96231b3b80d8

[AArch64] Small fix for getIntImmCost

It uses the generic AArch64_IMM::expandMOVImm to get the correct
number of instruction used in immediate materialization.

Reviewers: efriedma

Differential Revision: https://reviews.llvm.org/D58461

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356391 91177308-0d34-0410-b5e6-96231b3b80d8

[AArch64] Optimize floating point materialization

This patch follows some ideas from r352866 to optimize the floating
point materialization even further. It changes isFPImmLegal to
considere up to 2 mov instruction or up to 5 in case subtarget has
fused literals.

The rationale is the cost is the same for mov+fmov vs. adrp+ldr; but
the mov+fmov sequence is always better because of the reduced d-cache
pressure. The timings are still the same if you consider movw+movk+fmov
vs. adrp+ldr will be fused (although one instruction longer).

Reviewers: efriedma

Differential Revision: https://reviews.llvm.org/D58460

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356390 91177308-0d34-0410-b5e6-96231b3b80d8

[TargetLowering] Add code size information on isFPImmLegal. NFC

This allows better code size for aarch64 floating point materialization
in a future patch.

Reviewers: evandro

Differential Revision: https://reviews.llvm.org/D58690

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356389 91177308-0d34-0410-b5e6-96231b3b80d8

[AArch64] Refactor floating point materialization. NFC

It splits the login of actual instruction emission away from the logic
that figures out the appropriate sequence on AArch64ExpandPseudo::expandMOVImm.
The new function AArch64_IMM::expandMOVImm, which return the list of the
instructions to materialize the immediate constant, is implemented on a
separated unit because it will be used in a subsequent patch to optimize
floating point materialization.

Reviewers: efriedma

Differential Revision: https://reviews.llvm.org/D58915

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356387 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Remove the _alt forms of (V)CMP instructions. Use a combination of custom printing and custom parsing to achieve the same result and more

Similar to previous change done for VPCOM and VPCMP

Differential Revision: https://reviews.llvm.org/D59468

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356384 91177308-0d34-0410-b5e6-96231b3b80d8

[InstCombine] add/adjust test for NaN checks; NFC

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356383 91177308-0d34-0410-b5e6-96231b3b80d8

[DAG] Cleanup unused node in SimplifySelectCC.

Delete temporarily constructed node uses for analysis after it's use,
holding onto original input nodes. Ideally this would be rewritten
without making nodes, but this appears relatively complex.

Reviewers: spatel, RKSimon, craig.topper

Subscribers: jdoerfert, hiraditya, deadalnix, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D57921

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356382 91177308-0d34-0410-b5e6-96231b3b80d8

[MVT] Fix typos in comment. NFC.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356381 91177308-0d34-0410-b5e6-96231b3b80d8

[AMDGPU] Add an experimental buffer fat pointer address space.

Add an experimental buffer fat pointer address space that is currently
unhandled in the backend. This commit reserves address space 7 as a
non-integral pointer repsenting the 160-bit fat pointer (128-bit buffer
descriptor + 32-bit offset) that is heavily used in graphics workloads
using the AMDGPU backend.

Differential Revision: https://reviews.llvm.org/D58957

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356373 91177308-0d34-0410-b5e6-96231b3b80d8

[InstCombine] allow general vector constants for funnel shift to shift transforms

Follow-up to:
rL356338
rL356369

We can calculate an arbitrary vector constant minus the bitwidth, so there's
no need to limit this transform to scalars and splats.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356372 91177308-0d34-0410-b5e6-96231b3b80d8

[llvm-objcopy] - Calculate the string table section sizes correctly.

This fixes the https://bugs.llvm.org/show_bug.cgi?id=40980.

Previously if string optimization occurred as a result of
StringTableBuilder's finalize() method, the size wasn't updated.

This hopefully also makes the interaction between sections during finalization
processes a bit more clear.

Differential revision: https://reviews.llvm.org/D59488

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356371 91177308-0d34-0410-b5e6-96231b3b80d8

[InstCombine] extend rotate-left-by-constant canonicalization to funnel shift

Follow-up to:
rL356338

Rotates are a special case of funnel shift where the 2 input operands
are the same value, but that does not need to be a restriction for the
canonicalization when the shift amount is a constant.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356369 91177308-0d34-0410-b5e6-96231b3b80d8

[SystemZ] Remove icmp undef from reduced tests

Pre-commit for D59363 (Add icmp UNDEF handling to SelectionDAG::FoldSetCC)

Approved by @uweigand (Ulrich Weigand)

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356368 91177308-0d34-0410-b5e6-96231b3b80d8

[InstCombine] add funnel shift tests with arbitrary constants; NFC

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356367 91177308-0d34-0410-b5e6-96231b3b80d8

[llvm-exegesis] Separate tool options into three categories.

Results in much nicer -help output:
```
$ ./bin/llvm-exegesis -help
USAGE: llvm-exegesis [options]

OPTIONS:

Color Options:

  -color                                         - Use colors in output (default=autodetect)

General options:

  -enable-cse-in-irtranslator                    - Should enable CSE in irtranslator
  -enable-cse-in-legalizer                       - Should enable CSE in Legalizer

Generic Options:

  -help                                          - Display available options (-help-hidden for more)
  -help-list                                     - Display list of available options (-help-list-hidden for more)
  -version                                       - Display the version of this program

llvm-exegesis analysis options:

  -analysis-clustering-epsilon=<number>          - dbscan epsilon for benchmark point clustering
  -analysis-clusters-output-file=<string>        -
  -analysis-display-unstable-clusters            - if there is more than one benchmark for an opcode, said benchmarks may end up not being clustered into the same cluster if the measured performance characteristics are different. by default all such opcodes are filtered out. this flag will instead show only such unstable opcodes
  -analysis-inconsistencies-output-file=<string> -
  -analysis-inconsistency-epsilon=<number>       - epsilon for detection of when the cluster is different from the LLVM schedule profile values
  -analysis-numpoints=<uint>                     - minimum number of points in an analysis cluster

llvm-exegesis benchmark options:

  -ignore-invalid-sched-class                    - ignore instructions that do not define a sched class
  -mode=<value>                                  - the mode to run
    =latency                                     -   Instruction Latency
    =inverse_throughput                          -   Instruction Inverse Throughput
    =uops                                        -   Uop Decomposition
    =analysis                                    -   Analysis
  -num-repetitions=<uint>                        - number of time to repeat the asm snippet
  -opcode-index=<int>                            - opcode to measure, by index
  -opcode-name=<string>                          - comma-separated list of opcodes to measure, by name
  -snippets-file=<string>                        - code snippets to measure

llvm-exegesis options:

  -benchmarks-file=<string>                      - File to read (analysis mode) or write (latency/uops/inverse_throughput modes) benchmark results. “-” uses stdin/stdout.
  -mcpu=<string>                                 - cpu name to use for pfm counters, leave empty to autodetect
```

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356364 91177308-0d34-0410-b5e6-96231b3b80d8

[DebugInfo] Ignore bitcasts when lowering stack arg dbg.values

Summary:
Look past bitcasts when looking for parameter debug values that are
described by frame-index loads in `EmitFuncArgumentDbgValue()`.

In the attached test case we would be left with an undef `DBG_VALUE`
for the parameter without this patch.

A similar fix was done for parameters passed in registers in D13005.

This fixes PR40777.

Reviewers: aprantl, vsk, jmorse

Reviewed By: aprantl

Subscribers: bjope, javed.absar, jdoerfert, llvm-commits

Tags: #debug-info, #llvm

Differential Revision: https://reviews.llvm.org/D58831

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356363 91177308-0d34-0410-b5e6-96231b3b80d8

[AArch64] Fix bug 35094 atomicrmw on Armv8.1-A+lse

Fixes https://bugs.llvm.org/show_bug.cgi?id=35094

The Dead register definition pass should leave alone the atomicrmw
instructions on AArch64 (LTE extension). The reason is the following
statement in the Arm ARM:

"The ST<OP> instructions, and LD<OP> instructions where the destination
register is WZR or XZR, are not regarded as doing a read for the purpose
of a DMB LD barrier."

A good example was given in the gcc thread by Will Deacon (linked in the
bugzilla ticket 35094):

    P0 (atomic_int* y,atomic_int* x) {
      atomic_store_explicit(x,1,memory_order_relaxed);
      atomic_thread_fence(memory_order_release);
      atomic_store_explicit(y,1,memory_order_relaxed);
    }

    P1 (atomic_int* y,atomic_int* x) {
      atomic_fetch_add_explicit(y,1,memory_order_relaxed);  // STADD
      atomic_thread_fence(memory_order_acquire);
      int r0 = atomic_load_explicit(x,memory_order_relaxed);
    }

    P2 (atomic_int* y) {
      int r1 = atomic_load_explicit(y,memory_order_relaxed);
    }

    My understanding is that it is forbidden for r0 == 0 and r1 == 2 after
    this test has executed. However, if the relaxed add in P1 compiles to
    STADD and the subsequent acquire fence is compiled as DMB LD, then we
    don't have any ordering guarantees in P1 and the forbidden result could
    be observed.

Change-Id: I419f9f9df947716932038e1100c18d10a96408d0

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356360 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Hopefully fix a tautological compare warning in printVecCompareInstr.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356359 91177308-0d34-0410-b5e6-96231b3b80d8

[RISCV] Add ImmArg to intrinsics

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356358 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Add ADD8ri_DB and ADD8rr_DB to the autogenerated load folding table.

These were added in r355423.

We only use the autogenerated table to assist with the maintenance of the
manual table. These entries are alreayd in the manual table.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356357 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Make ADD*_DB post-RA pseudos and expand them in expandPostRAPseudo.

These are used to help convert OR->LEA when needed to avoid avoid a copy. They
aren't need after register allocation.

Happens to remove an ugly goto from X86MCCodeEmitter.cpp

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356356 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Add tab character to the custom printing of VPCMP and VPCOM instructions.

All the other instructions are printed with a preceeding tab.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356355 91177308-0d34-0410-b5e6-96231b3b80d8

Remove immarg from llvm.expect

The LangRef claimed this was required to be a constant, but this
appears to be wrong.

Fixes bug 41079.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356353 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Merge printf32mem/printi32mem into a single printdwordmem. Do the same for all other printing functions.

The only thing the print methods currently need to know is the string to print for the memory size in intel syntax.

This patch merges the functions based on this string. If we ever need something else in the future, its easy to split them back out.

This reduces the number of cases in the assembly printers. It shrinks the intel printer to only use 7 bytes per instruction instead of 8.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356352 91177308-0d34-0410-b5e6-96231b3b80d8

[CodeGen] Defined MVTs v3i32, v3f32, v5i32, v5f32

AMDGPU would like to use these MVTs.

Differential Revision: https://reviews.llvm.org/D58901

Change-Id: I6125fea810d7cc62a4b4de3d9904255a1233ae4e

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356351 91177308-0d34-0410-b5e6-96231b3b80d8

[CodeGen] Prepare for introduction of v3 and v5 MVTs

AMDGPU would like to have MVTs for v3i32, v3f32, v5i32, v5f32. This
commit does not add them, but makes preparatory changes:

* Exclude non-legal non-power-of-2 vector types from ComputeRegisterProp
mechanism in TargetLoweringBase::getTypeConversion.

* Cope with SETCC and VSELECT for odd-width i1 vector when the other
vectors are legal type.

Some of this patch is from Matt Arsenault, also of AMD.

Differential Revision: https://reviews.llvm.org/D58899

Change-Id: Ib5f23377dbef511be3a936211a0b9f94e46331f8

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356350 91177308-0d34-0410-b5e6-96231b3b80d8

[ARM] Check that CPSR does not have other uses

Fix up rL356335 by checking that CPSR is not read between
the compare and the branch.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356349 91177308-0d34-0410-b5e6-96231b3b80d8

RegAllocFast: Add hint to debug printing

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356348 91177308-0d34-0410-b5e6-96231b3b80d8

AMDGPU: Partially fix default device for HSA

There are a few different issues, mostly stemming from using
generation based checks for anything instead of subtarget
features. Stop adding flat-address-space as a feature for HSA, as it
should only be a device property. This was incorrectly allowing flat
instructions to select for SI.

Increase the default generation for HSA to avoid the encoding error
when emitting objects. This has some other side effects from various
checks which probably should be separate subtarget features (in the
cost model and for dealing with the DS offset folding issue).

Partial fix for bug 41070. It should probably be an error to try using
amdhsa without flat support.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356347 91177308-0d34-0410-b5e6-96231b3b80d8

[ConstantRange] Add assertion for KnownBits validity; NFC

Following the suggestion in D59475.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356346 91177308-0d34-0410-b5e6-96231b3b80d8

[ValueTracking] Use ConstantRange overflow check for signed add; NFC

This is the same change as rL356290, but for signed add. It replaces
the existing ripple logic with the overflow logic in ConstantRange.

This is NFC in that it should return NeverOverflow in exactly the
same cases as the previous implementation. However, it does make
computeOverflowForSignedAdd() more powerful by now also determining
AlwaysOverflows conditions. As none of its consumers handle this yet,
this has no impact on optimization. Making use of AlwaysOverflows
in with.overflow folding will be handled as a followup.

Differential Revision: https://reviews.llvm.org/D59450

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356345 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Remove the _alt forms of AVX512 VPCMP instructions. Use a combination of custom printing and custom parsing to achieve the same result and more

Similar to the previous patch for VPCOM.

Differential Revision: https://reviews.llvm.org/D59398

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356344 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Remove the _alt forms of XOP VPCOM instructions. Use a combination of custom printing and custom parsing to achieve the same result and more

Previously we had a regular form of the instruction used when the immediate was 0-7. And _alt form that allowed the full 8 bit immediate. Codegen would always use the 0-7 form since the immediate was always checked to be in range. Assembly parsing would use the 0-7 form when a mnemonic like vpcomtrueb was used. If the immediate was specified directly the _alt form was used. The disassembler would prefer to use the 0-7 form instruction when the immediate was in range and the _alt form otherwise. This way disassembly would print the most readable form when possible.

The assembly parsing for things like vpcomtrueb relied on splitting the mnemonic into 3 pieces. A "vpcom" prefix, an immediate representing the "true", and a suffix of "b". The tablegenerated printing code would similarly print a "vpcom" prefix, decode the immediate into a string, and then print "b".

The _alt form on the other hand parsed and printed like any other instruction with no specialness.

With this patch we drop to one form and solve the disassembly printing issue by doing custom printing when the immediate is 0-7. The parsing code has been tweaked to turn "vpcomtrueb" into "vpcomb" and then the immediate for the "true" is inserted either before or after the other operands depending on at&t or intel syntax.

I'd rather not do the custom printing, but I tried using an InstAlias for each possible mnemonic for all 8 immediates for all 16 combinations of element size, signedness, and memory/register. The code emitted into printAliasInstr ended up checking the number of operands, the register class of each operand, and the immediate for all 256 aliases. This was repeated for both the at&t and intel printer. Despite a lot of common checks between all of the aliases, when compiled with clang at least this commonality was not well optimized. Nor do all the checks seem necessary. Since I want to do a similar thing for vcmpps/pd/ss/sd which have 32 immediate values and 3 encoding flavors, 3 register sizes, etc. This didn't seem to scale well for clang binary size. So custom printing seemed a better trade off.

I also considered just using the InstAlias for the matching and not the printing. But that seemed like it would add a lot of extra rows to the matcher table. Especially given that the 32 immediates for vpcmpps have 46 strings associated with them.

Differential Revision: https://reviews.llvm.org/D59398

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356343 91177308-0d34-0410-b5e6-96231b3b80d8

[AMDGPU] Prepare for introduction of v3 and v5 MVTs

AMDGPU would like to have MVTs for v3i32, v3f32, v5i32, v5f32. This
commit does not add them, but makes preparatory changes:

* Fixed assumptions of power-of-2 vector type in kernel arg handling,
and added v5 kernel arg tests and v3/v5 shader arg tests.

* Added v5 tests for cost analysis.

* Added vec3/vec5 arg test cases.

Some of this patch is from Matt Arsenault, also of AMD.

Differential Revision: https://reviews.llvm.org/D58928

Change-Id: I7279d6b4841464d2080eb255ef3c589e268eabcd

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356342 91177308-0d34-0410-b5e6-96231b3b80d8

[ARM] Fixed an assumption of power-of-2 vector MVT

I am about to introduce some non-power-of-2 width vector MVTs. This
commit fixes a power-of-2 assumption that my forthcoming change would
otherwise break, as shown by test/CodeGen/ARM/vcvt_combine.ll and
vdiv_combine.ll.

Differential Revision: https://reviews.llvm.org/D58927

Change-Id: I56a282e365d3874ab0621e5bdef98a612f702317

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356341 91177308-0d34-0410-b5e6-96231b3b80d8

[AMDGPU] Regenerate some f16/i16 tests.

Prep work for D51589

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356340 91177308-0d34-0410-b5e6-96231b3b80d8

[ConstantRange] Add fromKnownBits() method

Following the suggestion in D59450, I'm moving the code for constructing
a ConstantRange from KnownBits out of ValueTracking, which also allows us
to test this code independently.

I'm adding this method to ConstantRange rather than KnownBits (which
would have been a bit nicer API wise) to avoid creating a dependency
from Support to IR, where ConstantRange lives.

Differential Revision: https://reviews.llvm.org/D59475

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356339 91177308-0d34-0410-b5e6-96231b3b80d8

[InstCombine] canonicalize rotate right by constant to rotate left

This was noted as a backend problem:
https://bugs.llvm.org/show_bug.cgi?id=41057
...and subsequently fixed for x86:
rL356121
But we should canonicalize these in IR for the benefit of all targets
and improve IR analysis such as CSE.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356338 91177308-0d34-0410-b5e6-96231b3b80d8

[InstCombine] add tests for rotate by constant using funnel intrinsics; NFC

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356337 91177308-0d34-0410-b5e6-96231b3b80d8

[ARM] Search backwards for CMP when combining into CBZ

The constant island pass currently only looks at the instruction immediately
before a branch for a CMP to fold into a CBZ/CBNZ. This extends it to search
backwards for the instruction that defines CPSR. We need to ensure that the
register is not overridden between the CMP and the branch.

Differential Revision: https://reviews.llvm.org/D59317

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356336 91177308-0d34-0410-b5e6-96231b3b80d8

[ARM] Add some CBZ constant island tests. NFC

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356335 91177308-0d34-0410-b5e6-96231b3b80d8

[DAGCombine] Fold (x & ~y) | y patterns

Fold (x & ~y) | y and it's four commuted variants to x | y. This pattern
can in particular appear when a vselect c, x, -1 is expanded to
(x & ~c) | (-1 & c) and combined to (x & ~c) | c.

This change has some overlap with D59066, which avoids creating a
vselect of this form in the first place during uaddsat expansion.

Differential Revision: https://reviews.llvm.org/D59174

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356333 91177308-0d34-0410-b5e6-96231b3b80d8

[TargetLowering] improve the default expansion of uaddsat/usubsat

This is a subset of what was proposed in:
D59006
...and may overlap with test changes from:
D59174
...but it seems like a good general optimization to turn selects
into bitwise-logic when possible because we never know exactly
what can happen at this stage of DAG combining depending on how
the target has defined things.

Differential Revision: https://reviews.llvm.org/D59066

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356332 91177308-0d34-0410-b5e6-96231b3b80d8

[RISCV][NFC] Factor out matchRegisterNameHelper in RISCVAsmParser.cpp

Contains common logic to match a string to a register name.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356330 91177308-0d34-0410-b5e6-96231b3b80d8

[RISCV] Fix RISCVAsmParser::ParseRegister and add tests

RISCVAsmParser::ParseRegister is called from AsmParser::parseRegisterOrNumber,
which in turn is called when processing CFI directives. The RISC-V
implementation wasn't setting RegNo, and so was incorrect. This patch address
that and adds cfi directive tests that demonstrate the fix. A follow-up patch
will factor out the register parsing logic shared between ParseRegister and
parseRegister.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356329 91177308-0d34-0410-b5e6-96231b3b80d8

[DAGCombine] combineShuffleOfScalars - handle non-zero SCALAR_TO_VECTOR indices (PR41097)

rL356292 reduces the size of scalar_to_vector if we know the upper bits are undef - which means that shuffles may find they are suddenly referencing scalar_to_vector elements other than zero - so make sure we handle this as undef.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356327 91177308-0d34-0410-b5e6-96231b3b80d8

[BPF] Add BTF Var and DataSec Support

Two new kinds, BTF_KIND_VAR and BTF_KIND_DATASEC, are added.

BTF_KIND_VAR has the following specification:
   btf_type.name: var name
   btf_type.info: type kind
   btf_type.type: var type
   // btf_type is followed by one u32
   u32: varinfo (currently, only 0 - static, 1 - global allocated in elf sections)

Not all globals are supported in this patch. The following globals are supported:
  . static variables with or without section attributes
  . global variables with section attributes

The inclusion of globals with section attributes
is for future potential extraction of key/value
type id's from map definition.

BTF_KIND_DATASEC has the following specification:
  btf_type.name: section name associated with variable or
                 one of .data/.bss/.readonly
  btf_type.info: type kind and vlen for # of variables
  btf_type.size: 0
  #vlen number of the following:
    u32: id of corresponding BTF_KIND_VAR
    u32: in-session offset of the var
    u32: the size of memory var occupied

At the time of debug info emission, the data section
size is unknown, so the btf_type.size = 0 for
BTF_KIND_DATASEC. The loader can patch it during
loading time.

The in-session offseet of the var is only available
for static variables. For global variables, the
loader neeeds to assign the global variable symbol value in
symbol table to in-section offset.

The size of memory is used to specify the amount of the
memory a variable occupies. Typically, it equals to
the type size, but for certain structures, e.g.,
  struct tt {
    int a;
    int b;
    char c[];
   };
   static volatile struct tt s2 = {3, 4, "abcdefghi"};
The static variable s2 has size of 20.

Note that for BTF_KIND_DATASEC name, the section name
does not contain object name. The compiler does have
input module name. For example, two cases below:
   . clang -target bpf -O2 -g -c test.c
     The compiler knows the input file (module) is test.c
     and can generate sec name like test.data/test.bss etc.
   . clang -target bpf -O2 -g -emit-llvm -c test.c -o - |
     llc -march=bpf -filetype=obj -o test.o
     The llc compiler has the input file as stdin, and
     would generate something like stdin.data/stdin.bss etc.
     which does not really make sense.

For any user specificed section name, e.g.,
  static volatile int a __attribute__((section("id1")));
  static volatile const int b __attribute__((section("id2")));
The DataSec with name "id1" and "id2" does not contain
information whether the section is readonly or not.
The loader needs to check the corresponding elf section
flags for such information.

A simple example:
  -bash-4.4$ cat t.c
  int g1;
  int g2 = 3;
  const int g3 = 4;
  static volatile int s1;
  struct tt {
   int a;
   int b;
   char c[];
  };
  static volatile struct tt s2 = {3, 4, "abcdefghi"};
  static volatile const int s3 = 4;
  int m __attribute__((section("maps"), used)) = 4;
  int test() { return g1 + g2 + g3 + s1 + s2.a + s3 + m; }
  -bash-4.4$ clang -target bpf -O2 -g -S t.c
Checking t.s, 4 BTF_KIND_VAR's are generated (s1, s2, s3 and m).
4 BTF_KIND_DATASEC's are generated with names
".data", ".bss", ".rodata" and "maps".

Signed-off-by: Yonghong Song <yhs@fb.com>
Differential Revision: https://reviews.llvm.org/D59441

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356326 91177308-0d34-0410-b5e6-96231b3b80d8

[X86][SSE] Constant fold PEXTRB/PEXTRW/EXTRACT_VECTOR_ELT nodes.

Replaces existing i1-only fold.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356325 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Add SimplifyDemandedBitsForTargetNode support for PEXTRB/PEXTRW

Improved constant folding for PEXTRB/PEXTRW will be added in a future commit

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356324 91177308-0d34-0410-b5e6-96231b3b80d8

[WebAssembly] Make rethrow take an except_ref type argument

Summary:
In the new wasm EH proposal, `rethrow` takes an `except_ref` argument.
This change was missing in r352598.

This patch adds `llvm.wasm.rethrow.in.catch` intrinsic. This is an
intrinsic that's gonna eventually be lowered to wasm `rethrow`
instruction, but this intrinsic can appear only within a catchpad or a
cleanuppad scope. Also this intrinsic needs to be invokable - otherwise
EH pad successor for it will not be correctly generated in clang.

This also adds lowering logic for this intrinsic in
`SelectionDAGBuilder::visitInvoke`. This routine is basically a
specialized and simplified version of
`SelectionDAGBuilder::visitTargetIntrinsic`, but we can't use it
because if is only for `CallInst`s.

This deletes the previous `llvm.wasm.rethrow` intrinsic and related
tests, which was meant to be used within a `__cxa_rethrow` library
function. Turned out this needs some more logic, so the intrinsic for
this purpose will be added later.

LateEHPrepare takes a result value of `catch` and inserts it into
matching `rethrow` as an argument.

`RETHROW_IN_CATCH` is a pseudo instruction that serves as a link between
`llvm.wasm.rethrow.in.catch` and the real wasm `rethrow` instruction. To
generate a `rethrow` instruction, we need an `except_ref` argument,
which is generated from `catch` instruction. But `catch` instrutions are
added in LateEHPrepare pass, so we use `RETHROW_IN_CATCH`, which takes
no argument, until we are able to correctly lower it to `rethrow` in
LateEHPrepare.

Reviewers: dschuff

Subscribers: sbc100, jgravelle-google, sunfish, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59352

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356316 91177308-0d34-0410-b5e6-96231b3b80d8

[WebAssembly] Method order change in LateEHPrepare (NFC)

Summary:
Currently the order of these methods does not matter, but the following
CL needs to have this order changed. Merging the order change and the
semantics change within a CL complicates the diff, so submitting the
order change first.

Reviewers: dschuff

Subscribers: sbc100, jgravelle-google, sunfish, jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59342

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356315 91177308-0d34-0410-b5e6-96231b3b80d8

gn build: Merge r356305.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356314 91177308-0d34-0410-b5e6-96231b3b80d8

[WebAssembly] Irreducible control flow rewrite

Summary:
Rewrite WebAssemblyFixIrreducibleControlFlow to a simpler and cleaner
design, which directly computes reachability and other properties
itself. This avoids previous complexity and bugs. (The new graph
analyses are very similar to how the Relooper algorithm would find loop
entries and so forth.)

This fixes a few bugs, including where we had a false positive and
thought fannkuch was irreducible when it was not, which made us much
larger and slower there, and a reverse bug where we missed
irreducibility. On fannkuch, we used to be 44% slower than asm2wasm and
are now 4% faster.

Reviewers: aheejin

Subscribers: jdoerfert, mgrang, dschuff, sbc100, jgravelle-google, sunfish, llvm-commits

Differential Revision: https://reviews.llvm.org/D58919

Patch by Alon Zakai (kripken)

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356313 91177308-0d34-0410-b5e6-96231b3b80d8

[ADT] Make SmallVector emplace_back return a reference

This follows the C++17 std::vector change and can simplify immediate
back() calls.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356312 91177308-0d34-0410-b5e6-96231b3b80d8

[GlobalISel] Make isel verification checks of vregs run under NDEBUG only.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356309 91177308-0d34-0410-b5e6-96231b3b80d8

gn build: Add missing dependency to check-clang target.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356306 91177308-0d34-0410-b5e6-96231b3b80d8

[TimePasses] allow -time-passes reporting into a custom stream

TimePassesHandler object (implementation of time-passes for new pass manager)
gains ability to report into a stream customizable per-instance (per pipeline).

Intended use is to specify separate time-passes output stream per each compilation,
setting up TimePasses member of StandardInstrumentation during PassBuilder setup.
That allows to get independent non-overlapping pass-times reports for parallel
independent compilations (in JIT-like setups).

By default it still puts timing reports into the info-output-file stream
(created by CreateInfoOutputFile every time report is requested).

Unit-test added for non-default case, and it also allowed to discover that print() does not work
as declared - it did not reset the timers, leading to yet another report being printed into the default stream.
Fixed print() to actually reset timers according to what was declared in print's comments before.

Reviewed By: philip.pfaffe
Differential Revision: https://reviews.llvm.org/D59366

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356305 91177308-0d34-0410-b5e6-96231b3b80d8

[GlobalISel] Allow MachineIRBuilder to build subregister copies.

This relaxes some asserts about sizes, and adds an optional subreg parameter
to buildCopy().

Also update AArch64 instruction selector to use this in places where we
previously used MachineInstrBuilder manually.

Differential Revision: https://reviews.llvm.org/D59434

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356304 91177308-0d34-0410-b5e6-96231b3b80d8

[ARM] Add MachineVerifier logic for some Thumb1 instructions.

tMOVr and tPUSH/tPOP/tPOP_RET have register constraints which can't be
expressed in TableGen, so check them explicitly. I've unfortunately run
into issues with both of these recently; hopefully this saves some time
for someone else in the future.

Differential Revision: https://reviews.llvm.org/D59383

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356303 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] X86ISelLowering::combineSextInRegCmov(): also handle i8 CMOV's

Summary:
As noted by @andreadb in https://reviews.llvm.org/D59035#inline-525780

If we have `sext (trunc (cmov C0, C1) to i8)`,
we can instead do `cmov (sext (trunc C0 to i8)), (sext (trunc C1 to i8))`

Reviewers: craig.topper, andreadb, RKSimon

Reviewed By: craig.topper

Subscribers: llvm-commits, andreadb

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59412

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356301 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Promote i8 CMOV's (PR40965)

Summary:
@mclow.lists brought up this issue up in IRC, it came up during
implementation of libc++ `std::midpoint()` implementation (D59099)
https://godbolt.org/z/oLrHBP

Currently LLVM X86 backend only promotes i8 CMOV if it came from 2x`trunc`.
This differential proposes to always promote i8 CMOV.

There are several concerns here:
* Is this actually more performant, or is it just the ASM that looks cuter?
* Does this result in partial register stalls?
* What about branch predictor?

# Indeed, performance should be the main point here.
Let's look at a simple microbenchmark: {F8412076}
```
#include "benchmark/benchmark.h"

#include <algorithm>
#include <cmath>
#include <cstdint>
#include <iterator>
#include <limits>
#include <random>
#include <type_traits>
#include <utility>
#include <vector>

// Future preliminary libc++ code, from Marshall Clow.
namespace std {
template <class _Tp>
__inline _Tp midpoint(_Tp __a, _Tp __b) noexcept {
  using _Up = typename std::make_unsigned<typename remove_cv<_Tp>::type>::type;

  int __sign = 1;
  _Up __m = __a;
  _Up __M = __b;
  if (__a > __b) {
    __sign = -1;
    __m = __b;
    __M = __a;
  }
  return __a + __sign * _Tp(_Up(__M - __m) >> 1);
}
}  // namespace std

template <typename T>
std::vector<T> getVectorOfRandomNumbers(size_t count) {
  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(),
                                       std::numeric_limits<T>::max());
  std::vector<T> v;
  v.reserve(count);
  std::generate_n(std::back_inserter(v), count,
                  [&dis, &gen]() { return dis(gen); });
  assert(v.size() == count);
  return v;
}

struct RandRand {
  template <typename T>
  static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
    return std::make_pair(getVectorOfRandomNumbers<T>(count),
                          getVectorOfRandomNumbers<T>(count));
  }
};
struct ZeroRand {
  template <typename T>
  static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
    return std::make_pair(std::vector<T>(count, T(0)),
                          getVectorOfRandomNumbers<T>(count));
  }
};

template <class T, class Gen>
void BM_StdMidpoint(benchmark::State& state) {
  const size_t Length = state.range(0);

  const std::pair<std::vector<T>, std::vector<T>> Data =
      Gen::template Gen<T>(Length);
  const std::vector<T>& a = Data.first;
  const std::vector<T>& b = Data.second;
  assert(a.size() == Length && b.size() == a.size());

  benchmark::ClobberMemory();
  benchmark::DoNotOptimize(a);
  benchmark::DoNotOptimize(a.data());
  benchmark::DoNotOptimize(b);
  benchmark::DoNotOptimize(b.data());

  for (auto _ : state) {
    for (size_t i = 0; i < Length; i++) {
      const auto calculated = std::midpoint(a[i], b[i]);
      benchmark::DoNotOptimize(calculated);
    }
  }
  state.SetComplexityN(Length);
  state.counters["midpoints"] =
      benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariant);
  state.counters["midpoints/sec"] =
      benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariantRate);
  const size_t BytesRead = 2 * sizeof(T) * Length;
  state.counters["bytes_read/iteration"] =
      benchmark::Counter(BytesRead, benchmark::Counter::kDefaults,
                         benchmark::Counter::OneK::kIs1024);
  state.counters["bytes_read/sec"] = benchmark::Counter(
      BytesRead, benchmark::Counter::kIsIterationInvariantRate,
      benchmark::Counter::OneK::kIs1024);
}

template <typename T>
static void CustomArguments(benchmark::internal::Benchmark* b) {
  const size_t L2SizeBytes = 2 * 1024 * 1024;
  // What is the largest range we can check to always fit within given L2 cache?
  const size_t MaxLen = L2SizeBytes / /*total bufs*/ 2 /
                        /*maximal elt size*/ sizeof(T) / /*safety margin*/ 2;
  b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN);
}

// Both of the values are random.
// The comparison is unpredictable.
BENCHMARK_TEMPLATE(BM_StdMidpoint, int32_t, RandRand)
    ->Apply(CustomArguments<int32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, RandRand)
    ->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int64_t, RandRand)
    ->Apply(CustomArguments<int64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, RandRand)
    ->Apply(CustomArguments<uint64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int16_t, RandRand)
    ->Apply(CustomArguments<int16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, RandRand)
    ->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int8_t, RandRand)
    ->Apply(CustomArguments<int8_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, RandRand)
    ->Apply(CustomArguments<uint8_t>);

// One value is always zero, and another is bigger or equal than zero.
// The comparison is predictable.
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, ZeroRand)
    ->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, ZeroRand)
    ->Apply(CustomArguments<uint64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, ZeroRand)
    ->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, ZeroRand)
    ->Apply(CustomArguments<uint8_t>);
```

```
$ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks ./llvm-cmov-bench-OLD ./llvm-cmov-bench-NEW
RUNNING: ./llvm-cmov-bench-OLD --benchmark_out=/tmp/tmp5a5qjm
2019-03-06 21:53:31
Running ./llvm-cmov-bench-OLD
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 1.78, 1.81, 1.36
----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters<...>
----------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072      300398 ns       300404 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.25083G/s midpoints=305.398M midpoints/sec=436.319M/s
BM_StdMidpoint<int32_t, RandRand>_BigO          2.29 N          2.29 N
BM_StdMidpoint<int32_t, RandRand>_RMS              2 %             2 %
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072     300433 ns       300433 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.25052G/s midpoints=305.398M midpoints/sec=436.278M/s
BM_StdMidpoint<uint32_t, RandRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, RandRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<int64_t, RandRand>/65536       169857 ns       169858 ns         4121 bytes_read/iteration=1024k bytes_read/sec=5.74929G/s midpoints=270.074M midpoints/sec=385.828M/s
BM_StdMidpoint<int64_t, RandRand>_BigO          2.59 N          2.59 N
BM_StdMidpoint<int64_t, RandRand>_RMS              3 %             3 %
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536      169770 ns       169771 ns         4125 bytes_read/iteration=1024k bytes_read/sec=5.75223G/s midpoints=270.336M midpoints/sec=386.026M/s
BM_StdMidpoint<uint64_t, RandRand>_BigO         2.59 N          2.59 N
BM_StdMidpoint<uint64_t, RandRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<int16_t, RandRand>/262144      591169 ns       591179 ns         1182 bytes_read/iteration=1024k bytes_read/sec=1.65189G/s midpoints=309.854M midpoints/sec=443.426M/s
BM_StdMidpoint<int16_t, RandRand>_BigO          2.25 N          2.25 N
BM_StdMidpoint<int16_t, RandRand>_RMS              1 %             1 %
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144     591264 ns       591274 ns         1184 bytes_read/iteration=1024k bytes_read/sec=1.65162G/s midpoints=310.378M midpoints/sec=443.354M/s
BM_StdMidpoint<uint16_t, RandRand>_BigO         2.25 N          2.25 N
BM_StdMidpoint<uint16_t, RandRand>_RMS             1 %             1 %
<...>
BM_StdMidpoint<int8_t, RandRand>/524288      2983669 ns      2983689 ns          235 bytes_read/iteration=1024k bytes_read/sec=335.156M/s midpoints=123.208M midpoints/sec=175.718M/s
BM_StdMidpoint<int8_t, RandRand>_BigO           5.69 N          5.69 N
BM_StdMidpoint<int8_t, RandRand>_RMS               0 %             0 %
<...>
BM_StdMidpoint<uint8_t, RandRand>/524288     2668398 ns      2668419 ns          262 bytes_read/iteration=1024k bytes_read/sec=374.754M/s midpoints=137.363M midpoints/sec=196.479M/s
BM_StdMidpoint<uint8_t, RandRand>_BigO          5.09 N          5.09 N
BM_StdMidpoint<uint8_t, RandRand>_RMS              0 %             0 %
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072     300887 ns       300887 ns         2331 bytes_read/iteration=1024k bytes_read/sec=3.24561G/s midpoints=305.529M midpoints/sec=435.619M/s
BM_StdMidpoint<uint32_t, ZeroRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, ZeroRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536      169634 ns       169634 ns         4102 bytes_read/iteration=1024k bytes_read/sec=5.75688G/s midpoints=268.829M midpoints/sec=386.338M/s
BM_StdMidpoint<uint64_t, ZeroRand>_BigO         2.59 N          2.59 N
BM_StdMidpoint<uint64_t, ZeroRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144     592252 ns       592255 ns         1182 bytes_read/iteration=1024k bytes_read/sec=1.64889G/s midpoints=309.854M midpoints/sec=442.62M/s
BM_StdMidpoint<uint16_t, ZeroRand>_BigO         2.26 N          2.26 N
BM_StdMidpoint<uint16_t, ZeroRand>_RMS             1 %             1 %
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288      987295 ns       987309 ns          711 bytes_read/iteration=1024k bytes_read/sec=1012.85M/s midpoints=372.769M midpoints/sec=531.028M/s
BM_StdMidpoint<uint8_t, ZeroRand>_BigO          1.88 N          1.88 N
BM_StdMidpoint<uint8_t, ZeroRand>_RMS              1 %             1 %
RUNNING: ./llvm-cmov-bench-NEW --benchmark_out=/tmp/tmpPvwpfW
2019-03-06 21:56:58
Running ./llvm-cmov-bench-NEW
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 1.17, 1.46, 1.30
----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters<...>
----------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072      300878 ns       300880 ns         2324 bytes_read/iteration=1024k bytes_read/sec=3.24569G/s midpoints=304.611M midpoints/sec=435.629M/s
BM_StdMidpoint<int32_t, RandRand>_BigO          2.29 N          2.29 N
BM_StdMidpoint<int32_t, RandRand>_RMS              2 %             2 %
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072     300231 ns       300226 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.25276G/s midpoints=305.398M midpoints/sec=436.578M/s
BM_StdMidpoint<uint32_t, RandRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, RandRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<int64_t, RandRand>/65536       170819 ns       170777 ns         4115 bytes_read/iteration=1024k bytes_read/sec=5.71835G/s midpoints=269.681M midpoints/sec=383.752M/s
BM_StdMidpoint<int64_t, RandRand>_BigO          2.60 N          2.60 N
BM_StdMidpoint<int64_t, RandRand>_RMS              3 %             3 %
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536      171705 ns       171708 ns         4106 bytes_read/iteration=1024k bytes_read/sec=5.68733G/s midpoints=269.091M midpoints/sec=381.671M/s
BM_StdMidpoint<uint64_t, RandRand>_BigO         2.62 N          2.62 N
BM_StdMidpoint<uint64_t, RandRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<int16_t, RandRand>/262144      592510 ns       592516 ns         1182 bytes_read/iteration=1024k bytes_read/sec=1.64816G/s midpoints=309.854M midpoints/sec=442.425M/s
BM_StdMidpoint<int16_t, RandRand>_BigO          2.26 N          2.26 N
BM_StdMidpoint<int16_t, RandRand>_RMS              1 %             1 %
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144     614823 ns       614823 ns         1180 bytes_read/iteration=1024k bytes_read/sec=1.58836G/s midpoints=309.33M midpoints/sec=426.373M/s
BM_StdMidpoint<uint16_t, RandRand>_BigO         2.33 N          2.33 N
BM_StdMidpoint<uint16_t, RandRand>_RMS             4 %             4 %
<...>
BM_StdMidpoint<int8_t, RandRand>/524288      1073181 ns      1073201 ns          650 bytes_read/iteration=1024k bytes_read/sec=931.791M/s midpoints=340.787M midpoints/sec=488.527M/s
BM_StdMidpoint<int8_t, RandRand>_BigO           2.05 N          2.05 N
BM_StdMidpoint<int8_t, RandRand>_RMS               1 %             1 %
BM_StdMidpoint<uint8_t, RandRand>/524288     1071010 ns      1071020 ns          653 bytes_read/iteration=1024k bytes_read/sec=933.689M/s midpoints=342.36M midpoints/sec=489.522M/s
BM_StdMidpoint<uint8_t, RandRand>_BigO          2.05 N          2.05 N
BM_StdMidpoint<uint8_t, RandRand>_RMS              1 %             1 %
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072     300413 ns       300416 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.2507G/s midpoints=305.398M midpoints/sec=436.302M/s
BM_StdMidpoint<uint32_t, ZeroRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, ZeroRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536      169667 ns       169669 ns         4123 bytes_read/iteration=1024k bytes_read/sec=5.75568G/s midpoints=270.205M midpoints/sec=386.257M/s
BM_StdMidpoint<uint64_t, ZeroRand>_BigO         2.59 N          2.59 N
BM_StdMidpoint<uint64_t, ZeroRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144     591396 ns       591404 ns         1184 bytes_read/iteration=1024k bytes_read/sec=1.65126G/s midpoints=310.378M midpoints/sec=443.257M/s
BM_StdMidpoint<uint16_t, ZeroRand>_BigO         2.26 N          2.26 N
BM_StdMidpoint<uint16_t, ZeroRand>_RMS             1 %             1 %
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288     1069421 ns      1069413 ns          655 bytes_read/iteration=1024k bytes_read/sec=935.092M/s midpoints=343.409M midpoints/sec=490.258M/s
BM_StdMidpoint<uint8_t, ZeroRand>_BigO          2.04 N          2.04 N
BM_StdMidpoint<uint8_t, ZeroRand>_RMS              0 %             0 %
Comparing ./llvm-cmov-bench-OLD to ./llvm-cmov-bench-NEW
Benchmark                                                   Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072                 +0.0016         +0.0016        300398        300878        300404        300880
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072                -0.0007         -0.0007        300433        300231        300433        300226
<...>
BM_StdMidpoint<int64_t, RandRand>/65536                  +0.0057         +0.0054        169857        170819        169858        170777
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536                 +0.0114         +0.0114        169770        171705        169771        171708
<...>
BM_StdMidpoint<int16_t, RandRand>/262144                 +0.0023         +0.0023        591169        592510        591179        592516
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144                +0.0398         +0.0398        591264        614823        591274        614823
<...>
BM_StdMidpoint<int8_t, RandRand>/524288                  -0.6403         -0.6403       2983669       1073181       2983689       1073201
<...>
BM_StdMidpoint<uint8_t, RandRand>/524288                 -0.5986         -0.5986       2668398       1071010       2668419       1071020
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072                -0.0016         -0.0016        300887        300413        300887        300416
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536                 +0.0002         +0.0002        169634        169667        169634        169669
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144                -0.0014         -0.0014        592252        591396        592255        591404
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288                 +0.0832         +0.0832        987295       1069421        987309       1069413
```

What can we tell from the benchmark?
* `BM_StdMidpoint<[u]int8_t, RandRand>` indeed has the worst performance.
* All `BM_StdMidpoint<uint{8,16,32}_t, ZeroRand>` are all performant, even the 8-bit case.
  That is because there we are computing mid point between zero and some random number,
  thus if the branch predictor is in use, it is in optimal situation.
* Promoting 8-bit CMOV did improve performance of `BM_StdMidpoint<[u]int8_t, RandRand>`, by -59%..-64%.

# What about branch predictor?
* `BM_StdMidpoint<uint8_t, ZeroRand>` was faster than `BM_StdMidpoint<uint{16,32,64}_t, ZeroRand>`,
  which may mean that well-predicted branch is better than `cmov`.
* Promoting 8-bit CMOV degraded performance of `BM_StdMidpoint<uint8_t, ZeroRand>`,
  `cmov` is up to +10% worse than well-predicted branch.
* However, i do not believe this is a concern. If the branch is well predicted,  then the PGO
  will also say that it is well predicted, and LLVM will happily expand cmov back into branch:
  https://godbolt.org/z/P5ufig

# What about partial register stalls?
I'm not really able to answer that.
What i can say is that if the branch is unpredictable (if it is predictable, then use PGO and you'll have branch)
in ~50% of cases you will have to pay branch misprediction penalty.
```
$ grep -i MispredictPenalty X86Sched*.td
X86SchedBroadwell.td:  let MispredictPenalty = 16;
X86SchedHaswell.td:  let MispredictPenalty = 16;
X86SchedSandyBridge.td:  let MispredictPenalty = 16;
X86SchedSkylakeClient.td:  let MispredictPenalty = 14;
X86SchedSkylakeServer.td:  let MispredictPenalty = 14;
X86ScheduleBdVer2.td:  let MispredictPenalty = 20; // Minimum branch misdirection penalty.
X86ScheduleBtVer2.td:  let MispredictPenalty = 14; // Minimum branch misdirection penalty
X86ScheduleSLM.td:  let MispredictPenalty = 10;
X86ScheduleZnver1.td:  let MispredictPenalty = 17;
```
.. which it can be as small as 10 cycles and as large as 20 cycles.
Partial register stalls do not seem to be an issue for AMD CPU's.
For intel CPU's, they should be around ~5 cycles?
Is that actually an issue here? I'm not sure.

In short, i'd say this is an improvement, at least on this microbenchmark.

Fixes [[ https://bugs.llvm.org/show_bug.cgi?id=40965 | PR40965 ]].

Reviewers: craig.topper, RKSimon, spatel, andreadb, nikic

Reviewed By: craig.topper, andreadb

Subscribers: jfb, jdoerfert, llvm-commits, mclow.lists

Tags: #llvm, #libc

Differential Revision: https://reviews.llvm.org/D59035

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356300 91177308-0d34-0410-b5e6-96231b3b80d8

[AArch64] Turn BIC immediate creation into a DAG combine

Switch BIC immediate creation for vector ANDs from custom lowering
to a DAG combine, which gives generic DAG combines a change to
apply first. In particular this avoids (and x, -1) being turned into
a (bic x, 0) instead of being eliminated entirely.

Differential Revision: https://reviews.llvm.org/D59187

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356299 91177308-0d34-0410-b5e6-96231b3b80d8

AMDGPU: Fix a SIAnnotateControlFlow issue when there are multiple backedges.

Summary:
At the exit of the loop, the compiler uses a register to remember and accumulate
the number of threads that have already exited. When all active threads exit the
loop, this register is used to restore the exec mask, and the execution continues
for the post loop code.

When there is a "continue" in the loop, the compiler made a mistake to reset the
register to 0 when the "continue" backedge is taken. This will result in some
threads not executing the post loop code as they are supposed to.

This patch fixed the issue.

Reviewers:
nhaehnle, arsenm

Differential Revision:
https://reviews.llvm.org/D59312

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356298 91177308-0d34-0410-b5e6-96231b3b80d8

[CMake] Correct CMake message mode

Summary:
This wasn't actually printing out a CMake warning, it was prepending
"WARN" to the message.

Reviewers: zturner

Subscribers: mgorny, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59432

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356297 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Strip the SAE bit from the rounding mode passed to the _RND opcodes. Use TargetConstant to save a conversion in the isel table.

The asm parser generates the immediate without the SAE bit. So for consistency we should generate the MCInst the same way from CodeGen.

Since they are now both the same, remove the masking from the printer and replace with an llvm_unreachable.

Use a target constant since we're rebuilding the node anyway. Then we don't have to have isel convert it. Saves about 500 bytes from the isel table.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356294 91177308-0d34-0410-b5e6-96231b3b80d8

[SimplifyDemandedVec] Strengthen handling all undef lanes (particularly GEPs)

A change of two parts:
1) A generic enhancement for all callers of SDVE to exploit the fact that if all lanes are undef, the result is undef.
2) A GEP specific piece to strengthen/fix the vector index undef element handling, and call into the generic infrastructure when visiting the GEP.

The result is that we replace a vector gep with at least one undef in each lane with a undef. We can also do the same for vector intrinsics. Once the masked.load patch (D57372) has landed, I'll update to include call tests as well.

Differential Revision: https://reviews.llvm.org/D57468

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356293 91177308-0d34-0410-b5e6-96231b3b80d8

[X86][SSE] Fold scalar_to_vector(i64 anyext(x)) -> bitcast(scalar_to_vector(i32 anyext(x)))

Reduce the size of an any-extended i64 scalar_to_vector source to i32 - the any_extend nodes are often introduced by SimplifyDemandedBits.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356292 91177308-0d34-0410-b5e6-96231b3b80d8