granicus.if.org Git

[ARM] Add MachineVerifier logic for some Thumb1 instructions.

tMOVr and tPUSH/tPOP/tPOP_RET have register constraints which can't be
expressed in TableGen, so check them explicitly. I've unfortunately run
into issues with both of these recently; hopefully this saves some time
for someone else in the future.

Differential Revision: https://reviews.llvm.org/D59383

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356303 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] X86ISelLowering::combineSextInRegCmov(): also handle i8 CMOV's

Summary:
As noted by @andreadb in https://reviews.llvm.org/D59035#inline-525780

If we have `sext (trunc (cmov C0, C1) to i8)`,
we can instead do `cmov (sext (trunc C0 to i8)), (sext (trunc C1 to i8))`

Reviewers: craig.topper, andreadb, RKSimon

Reviewed By: craig.topper

Subscribers: llvm-commits, andreadb

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59412

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356301 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Promote i8 CMOV's (PR40965)

Summary:
@mclow.lists brought up this issue up in IRC, it came up during
implementation of libc++ `std::midpoint()` implementation (D59099)
https://godbolt.org/z/oLrHBP

Currently LLVM X86 backend only promotes i8 CMOV if it came from 2x`trunc`.
This differential proposes to always promote i8 CMOV.

There are several concerns here:
* Is this actually more performant, or is it just the ASM that looks cuter?
* Does this result in partial register stalls?
* What about branch predictor?

# Indeed, performance should be the main point here.
Let's look at a simple microbenchmark: {F8412076}
```
#include "benchmark/benchmark.h"

#include <algorithm>
#include <cmath>
#include <cstdint>
#include <iterator>
#include <limits>
#include <random>
#include <type_traits>
#include <utility>
#include <vector>

// Future preliminary libc++ code, from Marshall Clow.
namespace std {
template <class _Tp>
__inline _Tp midpoint(_Tp __a, _Tp __b) noexcept {
  using _Up = typename std::make_unsigned<typename remove_cv<_Tp>::type>::type;

  int __sign = 1;
  _Up __m = __a;
  _Up __M = __b;
  if (__a > __b) {
    __sign = -1;
    __m = __b;
    __M = __a;
  }
  return __a + __sign * _Tp(_Up(__M - __m) >> 1);
}
}  // namespace std

template <typename T>
std::vector<T> getVectorOfRandomNumbers(size_t count) {
  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(),
                                       std::numeric_limits<T>::max());
  std::vector<T> v;
  v.reserve(count);
  std::generate_n(std::back_inserter(v), count,
                  [&dis, &gen]() { return dis(gen); });
  assert(v.size() == count);
  return v;
}

struct RandRand {
  template <typename T>
  static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
    return std::make_pair(getVectorOfRandomNumbers<T>(count),
                          getVectorOfRandomNumbers<T>(count));
  }
};
struct ZeroRand {
  template <typename T>
  static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
    return std::make_pair(std::vector<T>(count, T(0)),
                          getVectorOfRandomNumbers<T>(count));
  }
};

template <class T, class Gen>
void BM_StdMidpoint(benchmark::State& state) {
  const size_t Length = state.range(0);

  const std::pair<std::vector<T>, std::vector<T>> Data =
      Gen::template Gen<T>(Length);
  const std::vector<T>& a = Data.first;
  const std::vector<T>& b = Data.second;
  assert(a.size() == Length && b.size() == a.size());

  benchmark::ClobberMemory();
  benchmark::DoNotOptimize(a);
  benchmark::DoNotOptimize(a.data());
  benchmark::DoNotOptimize(b);
  benchmark::DoNotOptimize(b.data());

  for (auto _ : state) {
    for (size_t i = 0; i < Length; i++) {
      const auto calculated = std::midpoint(a[i], b[i]);
      benchmark::DoNotOptimize(calculated);
    }
  }
  state.SetComplexityN(Length);
  state.counters["midpoints"] =
      benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariant);
  state.counters["midpoints/sec"] =
      benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariantRate);
  const size_t BytesRead = 2 * sizeof(T) * Length;
  state.counters["bytes_read/iteration"] =
      benchmark::Counter(BytesRead, benchmark::Counter::kDefaults,
                         benchmark::Counter::OneK::kIs1024);
  state.counters["bytes_read/sec"] = benchmark::Counter(
      BytesRead, benchmark::Counter::kIsIterationInvariantRate,
      benchmark::Counter::OneK::kIs1024);
}

template <typename T>
static void CustomArguments(benchmark::internal::Benchmark* b) {
  const size_t L2SizeBytes = 2 * 1024 * 1024;
  // What is the largest range we can check to always fit within given L2 cache?
  const size_t MaxLen = L2SizeBytes / /*total bufs*/ 2 /
                        /*maximal elt size*/ sizeof(T) / /*safety margin*/ 2;
  b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN);
}

// Both of the values are random.
// The comparison is unpredictable.
BENCHMARK_TEMPLATE(BM_StdMidpoint, int32_t, RandRand)
    ->Apply(CustomArguments<int32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, RandRand)
    ->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int64_t, RandRand)
    ->Apply(CustomArguments<int64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, RandRand)
    ->Apply(CustomArguments<uint64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int16_t, RandRand)
    ->Apply(CustomArguments<int16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, RandRand)
    ->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int8_t, RandRand)
    ->Apply(CustomArguments<int8_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, RandRand)
    ->Apply(CustomArguments<uint8_t>);

// One value is always zero, and another is bigger or equal than zero.
// The comparison is predictable.
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, ZeroRand)
    ->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, ZeroRand)
    ->Apply(CustomArguments<uint64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, ZeroRand)
    ->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, ZeroRand)
    ->Apply(CustomArguments<uint8_t>);
```

```
$ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks ./llvm-cmov-bench-OLD ./llvm-cmov-bench-NEW
RUNNING: ./llvm-cmov-bench-OLD --benchmark_out=/tmp/tmp5a5qjm
2019-03-06 21:53:31
Running ./llvm-cmov-bench-OLD
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 1.78, 1.81, 1.36
----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters<...>
----------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072      300398 ns       300404 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.25083G/s midpoints=305.398M midpoints/sec=436.319M/s
BM_StdMidpoint<int32_t, RandRand>_BigO          2.29 N          2.29 N
BM_StdMidpoint<int32_t, RandRand>_RMS              2 %             2 %
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072     300433 ns       300433 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.25052G/s midpoints=305.398M midpoints/sec=436.278M/s
BM_StdMidpoint<uint32_t, RandRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, RandRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<int64_t, RandRand>/65536       169857 ns       169858 ns         4121 bytes_read/iteration=1024k bytes_read/sec=5.74929G/s midpoints=270.074M midpoints/sec=385.828M/s
BM_StdMidpoint<int64_t, RandRand>_BigO          2.59 N          2.59 N
BM_StdMidpoint<int64_t, RandRand>_RMS              3 %             3 %
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536      169770 ns       169771 ns         4125 bytes_read/iteration=1024k bytes_read/sec=5.75223G/s midpoints=270.336M midpoints/sec=386.026M/s
BM_StdMidpoint<uint64_t, RandRand>_BigO         2.59 N          2.59 N
BM_StdMidpoint<uint64_t, RandRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<int16_t, RandRand>/262144      591169 ns       591179 ns         1182 bytes_read/iteration=1024k bytes_read/sec=1.65189G/s midpoints=309.854M midpoints/sec=443.426M/s
BM_StdMidpoint<int16_t, RandRand>_BigO          2.25 N          2.25 N
BM_StdMidpoint<int16_t, RandRand>_RMS              1 %             1 %
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144     591264 ns       591274 ns         1184 bytes_read/iteration=1024k bytes_read/sec=1.65162G/s midpoints=310.378M midpoints/sec=443.354M/s
BM_StdMidpoint<uint16_t, RandRand>_BigO         2.25 N          2.25 N
BM_StdMidpoint<uint16_t, RandRand>_RMS             1 %             1 %
<...>
BM_StdMidpoint<int8_t, RandRand>/524288      2983669 ns      2983689 ns          235 bytes_read/iteration=1024k bytes_read/sec=335.156M/s midpoints=123.208M midpoints/sec=175.718M/s
BM_StdMidpoint<int8_t, RandRand>_BigO           5.69 N          5.69 N
BM_StdMidpoint<int8_t, RandRand>_RMS               0 %             0 %
<...>
BM_StdMidpoint<uint8_t, RandRand>/524288     2668398 ns      2668419 ns          262 bytes_read/iteration=1024k bytes_read/sec=374.754M/s midpoints=137.363M midpoints/sec=196.479M/s
BM_StdMidpoint<uint8_t, RandRand>_BigO          5.09 N          5.09 N
BM_StdMidpoint<uint8_t, RandRand>_RMS              0 %             0 %
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072     300887 ns       300887 ns         2331 bytes_read/iteration=1024k bytes_read/sec=3.24561G/s midpoints=305.529M midpoints/sec=435.619M/s
BM_StdMidpoint<uint32_t, ZeroRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, ZeroRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536      169634 ns       169634 ns         4102 bytes_read/iteration=1024k bytes_read/sec=5.75688G/s midpoints=268.829M midpoints/sec=386.338M/s
BM_StdMidpoint<uint64_t, ZeroRand>_BigO         2.59 N          2.59 N
BM_StdMidpoint<uint64_t, ZeroRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144     592252 ns       592255 ns         1182 bytes_read/iteration=1024k bytes_read/sec=1.64889G/s midpoints=309.854M midpoints/sec=442.62M/s
BM_StdMidpoint<uint16_t, ZeroRand>_BigO         2.26 N          2.26 N
BM_StdMidpoint<uint16_t, ZeroRand>_RMS             1 %             1 %
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288      987295 ns       987309 ns          711 bytes_read/iteration=1024k bytes_read/sec=1012.85M/s midpoints=372.769M midpoints/sec=531.028M/s
BM_StdMidpoint<uint8_t, ZeroRand>_BigO          1.88 N          1.88 N
BM_StdMidpoint<uint8_t, ZeroRand>_RMS              1 %             1 %
RUNNING: ./llvm-cmov-bench-NEW --benchmark_out=/tmp/tmpPvwpfW
2019-03-06 21:56:58
Running ./llvm-cmov-bench-NEW
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 1.17, 1.46, 1.30
----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters<...>
----------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072      300878 ns       300880 ns         2324 bytes_read/iteration=1024k bytes_read/sec=3.24569G/s midpoints=304.611M midpoints/sec=435.629M/s
BM_StdMidpoint<int32_t, RandRand>_BigO          2.29 N          2.29 N
BM_StdMidpoint<int32_t, RandRand>_RMS              2 %             2 %
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072     300231 ns       300226 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.25276G/s midpoints=305.398M midpoints/sec=436.578M/s
BM_StdMidpoint<uint32_t, RandRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, RandRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<int64_t, RandRand>/65536       170819 ns       170777 ns         4115 bytes_read/iteration=1024k bytes_read/sec=5.71835G/s midpoints=269.681M midpoints/sec=383.752M/s
BM_StdMidpoint<int64_t, RandRand>_BigO          2.60 N          2.60 N
BM_StdMidpoint<int64_t, RandRand>_RMS              3 %             3 %
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536      171705 ns       171708 ns         4106 bytes_read/iteration=1024k bytes_read/sec=5.68733G/s midpoints=269.091M midpoints/sec=381.671M/s
BM_StdMidpoint<uint64_t, RandRand>_BigO         2.62 N          2.62 N
BM_StdMidpoint<uint64_t, RandRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<int16_t, RandRand>/262144      592510 ns       592516 ns         1182 bytes_read/iteration=1024k bytes_read/sec=1.64816G/s midpoints=309.854M midpoints/sec=442.425M/s
BM_StdMidpoint<int16_t, RandRand>_BigO          2.26 N          2.26 N
BM_StdMidpoint<int16_t, RandRand>_RMS              1 %             1 %
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144     614823 ns       614823 ns         1180 bytes_read/iteration=1024k bytes_read/sec=1.58836G/s midpoints=309.33M midpoints/sec=426.373M/s
BM_StdMidpoint<uint16_t, RandRand>_BigO         2.33 N          2.33 N
BM_StdMidpoint<uint16_t, RandRand>_RMS             4 %             4 %
<...>
BM_StdMidpoint<int8_t, RandRand>/524288      1073181 ns      1073201 ns          650 bytes_read/iteration=1024k bytes_read/sec=931.791M/s midpoints=340.787M midpoints/sec=488.527M/s
BM_StdMidpoint<int8_t, RandRand>_BigO           2.05 N          2.05 N
BM_StdMidpoint<int8_t, RandRand>_RMS               1 %             1 %
BM_StdMidpoint<uint8_t, RandRand>/524288     1071010 ns      1071020 ns          653 bytes_read/iteration=1024k bytes_read/sec=933.689M/s midpoints=342.36M midpoints/sec=489.522M/s
BM_StdMidpoint<uint8_t, RandRand>_BigO          2.05 N          2.05 N
BM_StdMidpoint<uint8_t, RandRand>_RMS              1 %             1 %
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072     300413 ns       300416 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.2507G/s midpoints=305.398M midpoints/sec=436.302M/s
BM_StdMidpoint<uint32_t, ZeroRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, ZeroRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536      169667 ns       169669 ns         4123 bytes_read/iteration=1024k bytes_read/sec=5.75568G/s midpoints=270.205M midpoints/sec=386.257M/s
BM_StdMidpoint<uint64_t, ZeroRand>_BigO         2.59 N          2.59 N
BM_StdMidpoint<uint64_t, ZeroRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144     591396 ns       591404 ns         1184 bytes_read/iteration=1024k bytes_read/sec=1.65126G/s midpoints=310.378M midpoints/sec=443.257M/s
BM_StdMidpoint<uint16_t, ZeroRand>_BigO         2.26 N          2.26 N
BM_StdMidpoint<uint16_t, ZeroRand>_RMS             1 %             1 %
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288     1069421 ns      1069413 ns          655 bytes_read/iteration=1024k bytes_read/sec=935.092M/s midpoints=343.409M midpoints/sec=490.258M/s
BM_StdMidpoint<uint8_t, ZeroRand>_BigO          2.04 N          2.04 N
BM_StdMidpoint<uint8_t, ZeroRand>_RMS              0 %             0 %
Comparing ./llvm-cmov-bench-OLD to ./llvm-cmov-bench-NEW
Benchmark                                                   Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072                 +0.0016         +0.0016        300398        300878        300404        300880
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072                -0.0007         -0.0007        300433        300231        300433        300226
<...>
BM_StdMidpoint<int64_t, RandRand>/65536                  +0.0057         +0.0054        169857        170819        169858        170777
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536                 +0.0114         +0.0114        169770        171705        169771        171708
<...>
BM_StdMidpoint<int16_t, RandRand>/262144                 +0.0023         +0.0023        591169        592510        591179        592516
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144                +0.0398         +0.0398        591264        614823        591274        614823
<...>
BM_StdMidpoint<int8_t, RandRand>/524288                  -0.6403         -0.6403       2983669       1073181       2983689       1073201
<...>
BM_StdMidpoint<uint8_t, RandRand>/524288                 -0.5986         -0.5986       2668398       1071010       2668419       1071020
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072                -0.0016         -0.0016        300887        300413        300887        300416
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536                 +0.0002         +0.0002        169634        169667        169634        169669
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144                -0.0014         -0.0014        592252        591396        592255        591404
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288                 +0.0832         +0.0832        987295       1069421        987309       1069413
```

What can we tell from the benchmark?
* `BM_StdMidpoint<[u]int8_t, RandRand>` indeed has the worst performance.
* All `BM_StdMidpoint<uint{8,16,32}_t, ZeroRand>` are all performant, even the 8-bit case.
  That is because there we are computing mid point between zero and some random number,
  thus if the branch predictor is in use, it is in optimal situation.
* Promoting 8-bit CMOV did improve performance of `BM_StdMidpoint<[u]int8_t, RandRand>`, by -59%..-64%.

# What about branch predictor?
* `BM_StdMidpoint<uint8_t, ZeroRand>` was faster than `BM_StdMidpoint<uint{16,32,64}_t, ZeroRand>`,
  which may mean that well-predicted branch is better than `cmov`.
* Promoting 8-bit CMOV degraded performance of `BM_StdMidpoint<uint8_t, ZeroRand>`,
  `cmov` is up to +10% worse than well-predicted branch.
* However, i do not believe this is a concern. If the branch is well predicted,  then the PGO
  will also say that it is well predicted, and LLVM will happily expand cmov back into branch:
  https://godbolt.org/z/P5ufig

# What about partial register stalls?
I'm not really able to answer that.
What i can say is that if the branch is unpredictable (if it is predictable, then use PGO and you'll have branch)
in ~50% of cases you will have to pay branch misprediction penalty.
```
$ grep -i MispredictPenalty X86Sched*.td
X86SchedBroadwell.td:  let MispredictPenalty = 16;
X86SchedHaswell.td:  let MispredictPenalty = 16;
X86SchedSandyBridge.td:  let MispredictPenalty = 16;
X86SchedSkylakeClient.td:  let MispredictPenalty = 14;
X86SchedSkylakeServer.td:  let MispredictPenalty = 14;
X86ScheduleBdVer2.td:  let MispredictPenalty = 20; // Minimum branch misdirection penalty.
X86ScheduleBtVer2.td:  let MispredictPenalty = 14; // Minimum branch misdirection penalty
X86ScheduleSLM.td:  let MispredictPenalty = 10;
X86ScheduleZnver1.td:  let MispredictPenalty = 17;
```
.. which it can be as small as 10 cycles and as large as 20 cycles.
Partial register stalls do not seem to be an issue for AMD CPU's.
For intel CPU's, they should be around ~5 cycles?
Is that actually an issue here? I'm not sure.

In short, i'd say this is an improvement, at least on this microbenchmark.

Fixes [[ https://bugs.llvm.org/show_bug.cgi?id=40965 | PR40965 ]].

Reviewers: craig.topper, RKSimon, spatel, andreadb, nikic

Reviewed By: craig.topper, andreadb

Subscribers: jfb, jdoerfert, llvm-commits, mclow.lists

Tags: #llvm, #libc

Differential Revision: https://reviews.llvm.org/D59035

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356300 91177308-0d34-0410-b5e6-96231b3b80d8

[AArch64] Turn BIC immediate creation into a DAG combine

Switch BIC immediate creation for vector ANDs from custom lowering
to a DAG combine, which gives generic DAG combines a change to
apply first. In particular this avoids (and x, -1) being turned into
a (bic x, 0) instead of being eliminated entirely.

Differential Revision: https://reviews.llvm.org/D59187

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356299 91177308-0d34-0410-b5e6-96231b3b80d8

AMDGPU: Fix a SIAnnotateControlFlow issue when there are multiple backedges.

Summary:
At the exit of the loop, the compiler uses a register to remember and accumulate
the number of threads that have already exited. When all active threads exit the
loop, this register is used to restore the exec mask, and the execution continues
for the post loop code.

When there is a "continue" in the loop, the compiler made a mistake to reset the
register to 0 when the "continue" backedge is taken. This will result in some
threads not executing the post loop code as they are supposed to.

This patch fixed the issue.

Reviewers:
nhaehnle, arsenm

Differential Revision:
https://reviews.llvm.org/D59312

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356298 91177308-0d34-0410-b5e6-96231b3b80d8

[CMake] Correct CMake message mode

Summary:
This wasn't actually printing out a CMake warning, it was prepending
"WARN" to the message.

Reviewers: zturner

Subscribers: mgorny, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59432

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356297 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Strip the SAE bit from the rounding mode passed to the _RND opcodes. Use TargetConstant to save a conversion in the isel table.

The asm parser generates the immediate without the SAE bit. So for consistency we should generate the MCInst the same way from CodeGen.

Since they are now both the same, remove the masking from the printer and replace with an llvm_unreachable.

Use a target constant since we're rebuilding the node anyway. Then we don't have to have isel convert it. Saves about 500 bytes from the isel table.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356294 91177308-0d34-0410-b5e6-96231b3b80d8

[SimplifyDemandedVec] Strengthen handling all undef lanes (particularly GEPs)

A change of two parts:
1) A generic enhancement for all callers of SDVE to exploit the fact that if all lanes are undef, the result is undef.
2) A GEP specific piece to strengthen/fix the vector index undef element handling, and call into the generic infrastructure when visiting the GEP.

The result is that we replace a vector gep with at least one undef in each lane with a undef. We can also do the same for vector intrinsics. Once the masked.load patch (D57372) has landed, I'll update to include call tests as well.

Differential Revision: https://reviews.llvm.org/D57468

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356293 91177308-0d34-0410-b5e6-96231b3b80d8

[X86][SSE] Fold scalar_to_vector(i64 anyext(x)) -> bitcast(scalar_to_vector(i32 anyext(x)))

Reduce the size of an any-extended i64 scalar_to_vector source to i32 - the any_extend nodes are often introduced by SimplifyDemandedBits.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356292 91177308-0d34-0410-b5e6-96231b3b80d8

[ValueTracking] Use ConstantRange overflow checks for unsigned add/sub; NFC

Use the methods introduced in rL356276 to implement the
computeOverflowForUnsigned(Add|Sub) functions in ValueTracking, by
converting the KnownBits into a ConstantRange.

This is NFC: The existing KnownBits based implementation uses the same
logic as the the ConstantRange based one. This is not the case for the
signed equivalents, so I'm only changing unsigned here.

This is in preparation for D59386, which will also intersect the
computeConstantRange() result into the range determined from KnownBits.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356290 91177308-0d34-0410-b5e6-96231b3b80d8

[InstCombine] add tests for logic of NaN fcmps; NFC

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356287 91177308-0d34-0410-b5e6-96231b3b80d8

[ConstantRange] Try to fix compiler warnings; NFC

Try to fix "ignoring return value" and "default label" errors on
clang-with-thin-lto-ubuntu buildbot.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356286 91177308-0d34-0410-b5e6-96231b3b80d8

[tests] Add a test for constexpr mask as requested in D57372

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356285 91177308-0d34-0410-b5e6-96231b3b80d8

[InstCombine] add tests for masked store/scatter; NFC

Baseline tests for D57247

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356283 91177308-0d34-0410-b5e6-96231b3b80d8

[AArch64][GlobalISel] Regbankselect: Fix G_BUILD_VECTOR trying to use s16 gpr sources.

Since we can't insert s16 gprs as we don't have 16 bit GPR registers, we need to
teach RBS to assign them to the FPR bank so our selector works.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356282 91177308-0d34-0410-b5e6-96231b3b80d8

[X86][GlobalISEL] Support lowering aligned unordered atomics

The existing lowering code is accidentally correct for unordered atomics as far as I can tell. An unordered atomic has no memory ordering, and simply requires the actual load or store to be done as a single well aligned instruction. As such, relax the restriction while adding tests to ensure the lowering remains correct in the future.

Differential Revision: https://reviews.llvm.org/D57803

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356280 91177308-0d34-0410-b5e6-96231b3b80d8

[BPF] handle external global properly

Previous commit 6bc58e6d3dbd ("[BPF] do not generate unused local/global types")
tried to exclude global variable from type generation. The condition is:
     if (Global.hasExternalLinkage())
       continue;
This is not right. It also excluded initialized globals.

The correct condition (from AssemblyWriter::printGlobal()) is:
  if (!GV->hasInitializer() && GV->hasExternalLinkage())
    Out << "external ";

Let us do the same in BTF type generation. Also added a test for it.

Signed-off-by: Yonghong Song <yhs@fb.com>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356279 91177308-0d34-0410-b5e6-96231b3b80d8

[ConstantRange] Add overflow check helpers

Add functions to ConstantRange that determine whether the
unsigned/signed addition/subtraction of two ConstantRanges
may/always/never overflows. This will allow checking overflow
conditions based on known constant ranges in addition to known bits.

I'm implementing these methods on ConstantRange to allow them to be
unit tested independently of any ValueTracking machinery. The tests
include exhaustive testing on 4-bit ranges, to make sure the result
is both conservatively correct and maximally precise.

The OverflowResult enum is redeclared on ConstantRange, because
I wanted to avoid a dependency in either direction between
ValueTracking.h and ConstantRange.h.

Differential Revision: https://reviews.llvm.org/D59193

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356276 91177308-0d34-0410-b5e6-96231b3b80d8

[AArch64] Regenerate build vector tests

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356274 91177308-0d34-0410-b5e6-96231b3b80d8

[SelectionDAG] Add SimplifyDemandedBits handling for ISD::SCALAR_TO_VECTOR

Fixes a lot of constant folding mismatches between i686 and x86_64

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356273 91177308-0d34-0410-b5e6-96231b3b80d8

[LLVM-C] Expose the "Add Discriminators" Pass To LLVM-C

Summary: Add bindings to create a wrapped "Add Discriminators" pass. Now that we have debug info support, this is a handy transform to have.

Reviewers: whitequark, deadalnix

Reviewed By: whitequark

Subscribers: dblaikie, aprantl, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D58624

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356272 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Add SimplifyDemandedBitsForTargetNode support for PINSRB/PINSRW

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356270 91177308-0d34-0410-b5e6-96231b3b80d8

YAMLIO: Improve endian type support

Summary:
Now that endian types support enumerations (D59141), the existing yaml
support for them is somewhat insufficient. The current solution was to
define the ScalarTraits class for these types, which always forwards to
the ScalarTraits of the underlying type. However, the enum types will
usually have ScalarEnumerationTraits of ScalarBitsetTraits.

In this patch I add the two extra Traits types to the endian types. In
order to properly SFINAE-ize them, I've also added an extra "Enable"
template argument to the Traits template classes.

Reviewers: zturner, sammccall

Subscribers: kristina, Bigcheese, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59289

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356269 91177308-0d34-0410-b5e6-96231b3b80d8

[ThinLTO] Restructure AliasSummary to contain ValueInfo of Aliasee

Summary:
The AliasSummary previously contained the AliaseeGUID, which was only
populated when reading the summary from bitcode. This patch changes it
to instead hold the ValueInfo of the aliasee, and always populates it.
This enables more efficient access to the ValueInfo (specifically in the
recent patch r352438 which needed to perform an index hash table lookup
using the aliasee GUID).

As noted in the comments in AliasSummary, we no longer technically need
to keep a pointer to the corresponding aliasee summary, since it could
be obtained by walking the list of summaries on the ValueInfo looking
for the summary in the same module. However, I am concerned that this
would be inefficient when walking through the index during the thin
link for various analyses. That can be reevaluated in the future.

By always populating this new field, we can remove the guard and special
handling for a 0 aliasee GUID when dumping the dot graph of the summary.

An additional improvement in this patch is when reading the summaries
from LLVM assembly we now set the AliaseeSummary field to the aliasee
summary in that same module, which makes it consistent with the behavior
when reading the summary from bitcode.

Reviewers: pcc, mehdi_amini

Subscribers: inglorion, eraman, steven_wu, dexonsmith, arphaman, llvm-commits

Differential Revision: https://reviews.llvm.org/D57470

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356268 91177308-0d34-0410-b5e6-96231b3b80d8

[Hexagon] Remove icmp undef from reduced tests

Pre-commit for D59363 (Add icmp UNDEF handling to SelectionDAG::FoldSetCC)

Approved by @kparzysz (Krzysztof Parzyszek)

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356267 91177308-0d34-0410-b5e6-96231b3b80d8

[llvm] Skip over empty line table entries.

Summary:
This is similar to how addr2line handles consecutive entries with the
same address - pick the last one.

Reviewers: dblaikie, friss, JDevlieghere

Reviewed By: dblaikie

Subscribers: eugenis, vitalybuka, echristo, JDevlieghere, probinson, aprantl, hiraditya, rupprecht, jdoerfert, llvm-commits

Tags: #llvm, #debug-info

Differential Revision: https://reviews.llvm.org/D58952

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356265 91177308-0d34-0410-b5e6-96231b3b80d8

[CodeGenPrepare] avoid crashing from replacing a phi twice

Summary:
This is a fix to bug 41052:
https://bugs.llvm.org/show_bug.cgi?id=41052

While trying to optimize a memory instruction in a dead basic block, we end up registering the same phi for replacement twice. This patch avoids registering more than the first replacement candidate for a phi.

Patch by: JesperAntonsson

Reviewers: skatkov, aprantl

Reviewed By: aprantl

Subscribers: jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59358

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356260 91177308-0d34-0410-b5e6-96231b3b80d8

[ARM] Remove EarlyCSE from backend

There is an issue with early CSE hitting an assert, so temporarily
remove the pass from the Arm backend.

Bug: https://bugs.llvm.org/show_bug.cgi?id=41081

Differential Revision: https://reviews.llvm.org/D59410

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356259 91177308-0d34-0410-b5e6-96231b3b80d8

[AMDGPU] Fix SGPR fixing through SCC chaining

Summary:
- During the fixing of SGPR copying from VGPR, ensure users of SCC is
  properly propagated, i.e.
  * only propagate through live def of SCC,
  * skip the SCC-def inst itself, and
  * stop the propagation on the other SCC-def inst after checking its
    SCC-use first.

Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59362

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356258 91177308-0d34-0410-b5e6-96231b3b80d8

[LSR] Update test from rL356256 after rebase.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356257 91177308-0d34-0410-b5e6-96231b3b80d8

[LSR] Check for signed overflow in NarrowSearchSpaceByDetectingSupersets.

We are adding a sign extended IR value to an int64_t, which can cause
signed overflows, as in the attached test case, where we have a formula
with BaseOffset = -1 and a constant with numeric_limits<int64_t>::min().

If the addition would overflow, skip the simplification for this
formula. Note that the target triple is required to trigger the failure.

Reviewers: qcolombet, gilr, kparzysz, efriedma

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D59211

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356256 91177308-0d34-0410-b5e6-96231b3b80d8

[SPARC] Regenerate label test for D59363

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356253 91177308-0d34-0410-b5e6-96231b3b80d8

[ARM] Remove icmp undef from reduced tests

Pre-commit for D59363 (Add icmp UNDEF handling to SelectionDAG::FoldSetCC)

Approved by @efriedma (Eli Friedman)

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356252 91177308-0d34-0410-b5e6-96231b3b80d8

[WebAssembly] Remove icmp undef in stackify test

Pre-commit for D59363 (Add icmp UNDEF handling to SelectionDAG::FoldSetCC)

Approved by @tlively (Thomas Lively)

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356251 91177308-0d34-0410-b5e6-96231b3b80d8

[X86][SSE] Attempt to convert SSE shift-by-var to shift-by-imm.

Prep work for PR40203

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356249 91177308-0d34-0410-b5e6-96231b3b80d8

[llvm-profdata] Deleted unused Cutoffs added by D16005

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356248 91177308-0d34-0410-b5e6-96231b3b80d8

[yaml2obj]Allow explicit setting of p_filesz, p_memsz, and p_offset

yaml2obj currently derives the p_filesz, p_memsz, and p_offset values of
program headers from their sections. This makes writing tests for
certain formats more complex, and sometimes impossible. This patch
allows setting these fields explicitly, overriding the default value,
when relevant.

Reviewed by: jakehehrlich, Higuoxing

Differential Revision: https://reviews.llvm.org/D59372

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356247 91177308-0d34-0410-b5e6-96231b3b80d8

[llvm-readobj] Delete unused variable. NFC

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356246 91177308-0d34-0410-b5e6-96231b3b80d8

[llvm-objcopy] Delete unused parameter from replaceDebugSections. NFC

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356245 91177308-0d34-0410-b5e6-96231b3b80d8

[llvm-objcopy] Don't use {}; NFC

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356244 91177308-0d34-0410-b5e6-96231b3b80d8

[ARM][ParallelDSP] Disable for big-endian

Bail early when we don't have a preheader and also if the target is
big endian because it's written with only little endian in mind!

Differential Revision: https://reviews.llvm.org/D59368

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356243 91177308-0d34-0410-b5e6-96231b3b80d8

[MIPS GlobalISel] Improve selection of constants

Certain 32 bit constants can be generated with a single instruction
instead of two. Implement materialize32BitImm function for MIPS32.

Differential Revision: https://reviews.llvm.org/D59369

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356238 91177308-0d34-0410-b5e6-96231b3b80d8

[BPF] do not generate unused local/global types

The kernel currently has a limit for # of types to be 64KB and
the size of string subsection to be 64KB. A simple bcc tool
runqlat.py generates:
  . the size of ~33KB type section, roughly ~10K types
  . the size of ~17KB string section

The majority type is from the types referenced by local
variables in the bpf program. For example, the kernel "task_struct"
itself recursively brings in ~900 other types.
This patch did the following optimization to avoid generating
unused types:
  . do not generate types for local variables unless they are
    function arguments.
  . do not generate types for external globals.

If an external global is not used in the program, llvm
already removes it from IR, so global variable saving is
typical small. For runqlat.py, only one variable "llvm.used"
is the external global.

The types for locals and external globals can be added back
once there is a usage for them.

After the above optimization, the runqlat.py generates:
  . the size of ~1.5KB type section, roughtly 500 types
  . the size of ~0.7KB string section

UPDATE:
  resubmitted the patch after previous revert with
  the following fix:
  use Global.hasExternalLinkage() to test "external"
  linkage instead of using Global.getInitializer(),
  which will assert on external variables.

Signed-off-by: Yonghong Song <yhs@fb.com>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356234 91177308-0d34-0410-b5e6-96231b3b80d8

Revert "[BPF] do not generate unused local/global types"

This reverts commit r356232.

Reason: test failure with ASSERT on enabled build.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356233 91177308-0d34-0410-b5e6-96231b3b80d8

[BPF] do not generate unused local/global types

The kernel currently has a limit for # of types to be 64KB and
the size of string subsection to be 64KB. A simple bcc tool
runqlat.py generates:
  . the size of ~33KB type section, roughly ~10K types
  . the size of ~17KB string section

The majority type is from the types referenced by local
variables in the bpf program. For example, the kernel "task_struct"
itself recursively brings in ~900 other types.
This patch did the following optimization to avoid generating
unused types:
  . do not generate types for local variables unless they are
    function arguments.
  . do not generate types for external globals.

If an external global is not used in the program, llvm
already removes it from IR, so global variable saving is
typical small. For runqlat.py, only one variable "llvm.used"
is the external global.

The types for locals and external globals can be added back
once there is a usage for them.

After the above optimization, the runqlat.py generates:
  . the size of ~1.5KB type section, roughtly 500 types
  . the size of ~0.7KB string section

Signed-off-by: Yonghong Song <yhs@fb.com>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356232 91177308-0d34-0410-b5e6-96231b3b80d8

[WebAssembly] Remove unused load/store patterns that use texternalsym

Differential Revision: https://reviews.llvm.org/D59395

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356221 91177308-0d34-0410-b5e6-96231b3b80d8

AMDGPU: Remove intrinsic operand assert

Before r355981, this was under LLVM_DEBUG. I don't think the assert is
quite right, but this really should be a verifier check. Instcombine
should not be asserting on this sort of thing.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356219 91177308-0d34-0410-b5e6-96231b3b80d8

[CGP] add another bailout for degenerate code (PR41064)

This is almost the same as:
rL355345
...and should prevent any potential crashing from examples like:
https://bugs.llvm.org/show_bug.cgi?id=41064
...although the bug was masked by:
rL355823
...and I'm not sure how to repro the problem after that change.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356218 91177308-0d34-0410-b5e6-96231b3b80d8

Tighten up tests that use -debugify as a shortcut. NFC

These now verify that a given instruction has a specific source
location, rather than any old location. We want to make sure we
propagate the correct locations from one instruction to another.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356217 91177308-0d34-0410-b5e6-96231b3b80d8

[MC] Sort FDEs by the associated CIE before emitting them.

This isn't necessary according to the DWARF standard, but it matches the
.eh_frame sections emitted by other tools in practice, and the Android
libunwindstack rejects .eh_frame sections where an FDE refers to a CIE
other than the closest previous CIE. So match the other tools and also
sort accordingly.

I consider this a bug in libunwindstack, but it's easy enough to emit
a compatible .eh_frame section for compatibility with installed
operating systems.

Differential Revision: https://reviews.llvm.org/D58266

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356216 91177308-0d34-0410-b5e6-96231b3b80d8

MIR: Allow targets to serialize MachineFunctionInfo

This has been a very painful missing feature that has made producing
reduced testcases difficult. In particular the various registers
determined for stack access during function lowering were necessary to
avoid undefined register errors in a large percentage of
cases. Implement a subset of the important fields that need to be
preserved for AMDGPU.

Most of the changes are to support targets parsing register fields and
properly reporting errors. The biggest sort-of bug remaining is for
fields that can be initialized from the IR section will be overwritten
by a default initialized machineFunctionInfo section. Another
remaining bug is the machineFunctionInfo section is still printed even
if empty.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356215 91177308-0d34-0410-b5e6-96231b3b80d8

[AArch64][GlobalISel] Add isel support for G_UADDO on s32s and s64s

This adds instruction selection support for G_UADDO on s32s and s64s.

Also
- Add an instruction selection test
- Update the arm64-xaluo.ll test to show that we generate the correct assembly

Differential Revision: https://reviews.llvm.org/D58734

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356214 91177308-0d34-0410-b5e6-96231b3b80d8

[AArch64][GlobalISel] Implement selection for G_UNMERGE of vectors to vectors.

This re-uses the previous support for extract vector elt to extract the
subvectors.

Differential Revision: https://reviews.llvm.org/D59390

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356213 91177308-0d34-0410-b5e6-96231b3b80d8

[AArch64][GlobalISel] Add some support for G_CONCAT_VECTORS.

Handles concatenating 2 x v2s32 and 2 x v4s16

Differential Revision: https://reviews.llvm.org/D59390

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356212 91177308-0d34-0410-b5e6-96231b3b80d8

[llvm-strip] Hook up (unimplemented) --only-keep-debug

For ELF, we accept but ignore --only-keep-debug. Do the same for llvm-strip.

COFF does implement this, so update the test that it is supported.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356207 91177308-0d34-0410-b5e6-96231b3b80d8

AMDGPU: Correct type for waitcnt debug flag

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356206 91177308-0d34-0410-b5e6-96231b3b80d8

Add test I forgot to git-add in r356163.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356205 91177308-0d34-0410-b5e6-96231b3b80d8

Line wrap README file

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356204 91177308-0d34-0410-b5e6-96231b3b80d8

[InstCombine] Add tests for range-based saturing math overflow; NFC

Tests for cases where overflow can be determined, but not based on
known bits.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356203 91177308-0d34-0410-b5e6-96231b3b80d8

[ARC] Add more load/store variants.

On ARC ISA, general format of load instruction is this:

    LD<zz><.x><.aa><.di> a, [b,c]
And general format of store is this:
    ST<zz><.aa><.di> c, [b,s9]
Where:

<zz> is data size field and can be one of
  <empty> (bits 00) - Word (32-bit), default behavior
  B             (bits 01) - Byte
  H             (bits 10) - Half-word (16-bit)

<.x> is data extend mode:
  <empty> (bit 0) - If size is not Word(32-bit), then data is zero extended
  X       (bit 1) - If size is not Word(32-bit), then data is sign extended

<.aa> is address write-back mode:
  <empty> (bits 00) - no write-back
  .AW  (bits 01) - Preincrement, base register updated pre memory transaction
  .AB  (bits 10) - Postincrement, base register updated post memory transaction

<.di> is cache bypass mode:
  <empty> (bit 0) - Cached memory access, default mode
  .DI     (bit 1) - Non-cached data memory access

  This patch adds these load/store instruction variants to the ARC backend.

Patch By Denis Antrushin! <denis@synopsys.com>

Differential Revision: https://reviews.llvm.org/D58980

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356200 91177308-0d34-0410-b5e6-96231b3b80d8

gn build: Add build files for clang-doc

Differential Revision: https://reviews.llvm.org/D59379

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356199 91177308-0d34-0410-b5e6-96231b3b80d8

[InstCombine] remove duplicate tests

These got accidentally doubled with rL356191.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356195 91177308-0d34-0410-b5e6-96231b3b80d8

Handle consecutive-double-quotes in Windows argument parsing

Windows command line argument processing treats consecutive double quotes
as a single double-quote. This patch implements this functionality.

Differential Revision: https://reviews.llvm.org/D58662

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356193 91177308-0d34-0410-b5e6-96231b3b80d8

[InstCombine] canonicalize funnel shift constant shift amount to be modulo bitwidth

The shift argument is defined to be modulo the bitwidth, so if that argument
is a constant, we can always reduce the constant to its minimal form to allow
better CSE and other follow-on transforms.

We need to be careful to ignore constant expressions here, or we will likely
infinite loop. I'm adding a general vector constant query for that case.

Differential Revision: https://reviews.llvm.org/D59374

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356192 91177308-0d34-0410-b5e6-96231b3b80d8

[InstCombine] add tests for funnel shift constant shift amount mod bitwidth; NFC

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356191 91177308-0d34-0410-b5e6-96231b3b80d8

[MemorySSA] Remove redundant walker assignment [NFC].

Subscribers: llvm-commits

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356189 91177308-0d34-0410-b5e6-96231b3b80d8

[Tests] Add tests to demonstrate hoisting of unordered invariant loads

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356184 91177308-0d34-0410-b5e6-96231b3b80d8

[Tests] Revert an accident change to a test

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356183 91177308-0d34-0410-b5e6-96231b3b80d8

[GlobalISel][AArch64] Add partial selection support for G_INSERT_VECTOR_ELT

This adds support for inserting elements into packed vectors. It also adds
two tests: one for selection, and one for regbank select.

Unpacked vectors will come in a follow-up.

Differential Revision: https://reviews.llvm.org/D59325

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356182 91177308-0d34-0410-b5e6-96231b3b80d8

Auto-generate an existing test to make it easier to update

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356181 91177308-0d34-0410-b5e6-96231b3b80d8

[ARC] Better classify add/sub immediate instructions in frame lowering.

Summary:
Some operations have multiple ARC instructions that are applicable.
For instance, "add r0, r0, 123" can be encoded as a "LImm" instruction
with a 32-bit immediate (8-bytes), or as a signed 12-bit immediate instruction
for the case where the source and destination register are the same (4-bytes).
The ARC assembler will choose the shortest encoding, but we should track
the correct instruction in the compiler.
This patch fixes the instruction used in some cases from ARCFrameLowering.

Subscribers: hiraditya, jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59326

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356179 91177308-0d34-0410-b5e6-96231b3b80d8

Speeding up llvm-cov export with multithreaded renderFiles implementation.

Summary:
CoverageExporterJson::renderFiles accounts for most of the execution time given a large profdata file with multiple binaries.

Proposed solution is to generate JSON for each file in parallel and sort at the end to preserve deterministic output. Also added flags to skip generating parts of the output to trim the output size.

Patch by Sajjad Mirza (@sajjadm).

Reviewers: Dor1s, vsk

Reviewed By: Dor1s, vsk

Subscribers: liaoyuke, mgrang, jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59277

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356178 91177308-0d34-0410-b5e6-96231b3b80d8

[InstCombine] add tests for funnel shift constant shift amount mod bitwidth; NFC

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356175 91177308-0d34-0410-b5e6-96231b3b80d8

[Tests] Add tests for reordering of unordered atomics on invariant locations

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356172 91177308-0d34-0410-b5e6-96231b3b80d8

Allow code motion (and thus folding) for atomic (but unordered) memory operands

Building on the work done in D57601, now that we can distinguish between atomic and volatile memory accesses, go ahead and allow code motion of unordered atomics. As seen in the diffs, this allows much better folding of memory operations into using instructions. (Mostly done by the PeepholeOpt pass.)

Note: I have not reviewed all callers of hasOrderedMemoryRef since one of them - isSafeToMove - is very widely used. I'm relying on the documented semantics of each method to judge correctness.

Differential Revision: https://reviews.llvm.org/D59345

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356170 91177308-0d34-0410-b5e6-96231b3b80d8

[Tests] Add negative folding tests w/fences as requested in D59345

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356165 91177308-0d34-0410-b5e6-96231b3b80d8

[X86] Fix the pattern changes from r356121 so that the ROR*r1/ROR*m1 pattern use the rotr opcode.

These instructions used to use rotl with a bitwidth-1 immediate. I changed the immediate to 1,
but failed to change the opcode.

Thankfully this seems to have not caused a functional issue because we now had two rotl by 1 patterns,
but the correct ones were earlier and took priority. So we just missed some optimization.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356164 91177308-0d34-0410-b5e6-96231b3b80d8

Add IR debug info support for Elemental, Pure, and Recursive Procedures.

Patch by Eric Schweitz!

Differential Revision: https://reviews.llvm.org/D54043

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356163 91177308-0d34-0410-b5e6-96231b3b80d8

[NFC][ARM] Update test

Change some regex to handle commutable instructions.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356159 91177308-0d34-0410-b5e6-96231b3b80d8

[x86] prevent infinite looping from vselect commutation (PR41066)

This is an immediate fix for:
https://bugs.llvm.org/show_bug.cgi?id=41066
...but as noted there and the code comments, we should do better
by stubbing this out sooner.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356158 91177308-0d34-0410-b5e6-96231b3b80d8

YAMLIO: Improve template arg deduction for mapOptional

Summary:
The way c++ template argument deduction works, both arguments are used
to deduce the template type in the three-argument overload of
mapOptional. This is a problem if the types are slightly different, even
if they are implicitly convertible. This is fairly easy to trigger with
integral types, as the default type of most integral constants is int,
which then requires casting the constant to the type of the other
argument.

This patch fixes that by using a separate template type for the default
value, which is then cast to the type of the first argument. To avoid
this conversion triggerring conversions marged as explicit, we use
static_assert to check that the types are implicitly convertible.

Reviewers: zturner, sammccall

Subscribers: kristina, jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59142

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356157 91177308-0d34-0410-b5e6-96231b3b80d8

AMDGPU: Scavenge register instead of findUnusedReg

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356149 91177308-0d34-0410-b5e6-96231b3b80d8

GlobalISel: Use multiple returns for intrinsic structs

This is consistent with what SelectionDAG does and is much easier to
work with than the extract sequence with an artificial wide register.

For the AMDGPU control flow intrinsics, this was producing an s128 for
the i64, i1 tuple return. Any legalization that should apply to a real
s128 value would badly obscure the direct values that need to be seen.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356147 91177308-0d34-0410-b5e6-96231b3b80d8

[SampleFDO] add suffix elision control for fcn names

Summary:
Add hooks for determining the policy used to decide whether/how
to chop off symbol 'suffixes' when locating a given function
in a sample profile.

Prior to this change, any function symbols of the form "X.Y" were
elided/truncated into just "X" when looking up things in a sample
profile data file.

With this change, the policy on suffixes can be changed by adding a
new attribute "sample-profile-suffix-elision-policy" to the function:
this attribute can have the value "all" (the default), "selected", or
"none". A value of "all" preserves the previous behavior (chop off
everything after the first "." character, then treat that as the
symbol name). A value of "selected" chops off only the rightmost
".llvm.XXXX" suffix (where "XXX" is any string not containing a "."
char). A value of "none" indicates that names should be left as is.

Subscribers: jdoerfert, wmi, mtrofin, danielcdh, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D58832

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356146 91177308-0d34-0410-b5e6-96231b3b80d8

Note ImmArg in documentation for adding intrinsics

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356145 91177308-0d34-0410-b5e6-96231b3b80d8

ARM: Add ImmArg to intrinsics

I found these by asserting in clang for any GCCBuiltin that doesn't
require mangling and requires a constant for the builtin. This means
that intrinsics are missing which don't use GCCBuiltin, don't have
builtins defined in clang, or were missing the constant annotation in
the builtin definition.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356144 91177308-0d34-0410-b5e6-96231b3b80d8

AMDGPU: Don't add unnecessary convergent attributes

These are redundant with the intrinsic declaration.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356143 91177308-0d34-0410-b5e6-96231b3b80d8

gn build: Merge r356080

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356139 91177308-0d34-0410-b5e6-96231b3b80d8

[SystemZ] Remove icmp undef

Prep-work for PR40800 (Add UNDEF handling to SelectionDAG::FoldSetCC)

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356138 91177308-0d34-0410-b5e6-96231b3b80d8

[SystemZ] Regenerate tests to make complete codegen more obvious

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356137 91177308-0d34-0410-b5e6-96231b3b80d8

[llvm-objcopy]Don't implicitly strip sections in segments

This patch changes llvm-objcopy's behaviour to not strip sections that
are in segments, if they otherwise would be due to a stripping operation
(--strip-all, --strip-sections, --strip-non-alloc). This preserves the
segment contents. It does not change the behaviour of --strip-all-gnu
(although we could choose to do so), because GNU objcopy's behaviour in
this case seems to be to strip the section, nor does it prevent removing
of sections in segments with --remove-section (if a user REALLY wants to
remove a section, we should probably let them, although I could be
persuaded that warning might be appropriate). Tests have been added to
show this latter behaviour.

This fixes https://bugs.llvm.org/show_bug.cgi?id=41006.

Reviewed by: grimar, rupprecht, jakehehrlich

Differential Revision: https://reviews.llvm.org/D59293

This is a reland of r356129, attempting to fix greendragon failures
due to a suspected compatibility issue with od on the greendragon bots
versus other versions.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356136 91177308-0d34-0410-b5e6-96231b3b80d8

Fix for buildbots

Remove unused private field.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356135 91177308-0d34-0410-b5e6-96231b3b80d8

Revert r356129 due to greendragon bot failures

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356133 91177308-0d34-0410-b5e6-96231b3b80d8

[ARM][ParallelDSP] Enable multiple uses of loads

When choosing whether a pair of loads can be combined into a single
wide load, we check that the load only has a sext user and that sext
also only has one user. But this can prevent the transformation in
the cases when parallel macs use the same loaded data multiple times.

To enable this, we need to fix up any other uses after creating the
wide load: generating a trunc and a shift + trunc pair to recreate
the narrow values. We also need to keep a record of which loads have
already been widened.

Differential Revision: https://reviews.llvm.org/D59215

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356132 91177308-0d34-0410-b5e6-96231b3b80d8

[NFC][LSR] Cleanup Cost API

Create members for Loop, ScalarEvolution, DominatorTree,
TargetTransformInfo and Formula.

Differential Revision: https://reviews.llvm.org/D58389

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356131 91177308-0d34-0410-b5e6-96231b3b80d8

[ARM] Run ARMParallelDSP in the IRPasses phase

Run EarlyCSE before ParallelDSP and do this in the backend IR opt
phase.

Differential Revision: https://reviews.llvm.org/D59257

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356130 91177308-0d34-0410-b5e6-96231b3b80d8

[llvm-objcopy]Don't implicitly strip sections in segments

This patch changes llvm-objcopy's behaviour to not strip sections that
are in segments, if they otherwise would be due to a stripping operation
(--strip-all, --strip-sections, --strip-non-alloc). This preserves the
segment contents. It does not change the behaviour of --strip-all-gnu
(although we could choose to do so), because GNU objcopy's behaviour in
this case seems to be to strip the section, nor does it prevent removing
of sections in segments with --remove-section (if a user REALLY wants to
remove a section, we should probably let them, although I could be
persuaded that warning might be appropriate). Tests have been added to
show this latter behaviour.

This fixes https://bugs.llvm.org/show_bug.cgi?id=41006.

Reviewed by: grimar, rupprecht, jakehehrlich

Differential Revision: https://reviews.llvm.org/D59293

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356129 91177308-0d34-0410-b5e6-96231b3b80d8

gn build: Merge r356082

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356128 91177308-0d34-0410-b5e6-96231b3b80d8

[RISCV] Fix rL356123

The wrong version of the patch was committed. This fixes typos that broke the build.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356124 91177308-0d34-0410-b5e6-96231b3b80d8

[RISCV][NFC] Rename callee saved regs 'CSR' to CSR_ILP32_LP64 and minor RISCVRegisterInfo refactoring

The CSR renaming further prepares the way for an upcoming patch adding support for more
RISC-V ABIs.

Modify RISCVRegisterInfo::getCalleeSavedRegs and
RISCVRegisterInfo::getReservedRegs to do MF->getSubtarget<RISCVSubtarget>()
once rather than multiple times.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@356123 91177308-0d34-0410-b5e6-96231b3b80d8