Puyan Lotfi [Fri, 30 Aug 2019 19:54:46 +0000 (19:54 +0000)]
[IFS][NFC] llvm-ifs: Fixing build errors for bots using GCC.
gcc produces the error:
error: specialization of
‘template<class T, class Enable> struct llvm::yaml::ScalarTraits’ in
different namespace
For all specializations outside of llvm::yaml. So I added llvm::yaml to these
specializations to fix the errors on the bots building with gcc (/usr/bin/c++).
James Molloy [Fri, 30 Aug 2019 19:50:49 +0000 (19:50 +0000)]
[DFAPacketizer] Allow namespacing of automata per-itinerary
The Hexagon itineraries are cunningly crafted such that functional units between
itineraries do not clash. Because all itineraries are bundled into the same DFA,
a functional unit index clash would cause an incorrect DFA to be generated.
A workaround for this is to ensure all itineraries declare the universe of all
possible functional units, but this isn't ideal for three reasons:
1) We only have a limited number of FUs we can encode in the packetizer, and
using the universe causes us to hit the limit without care.
2) Silent codegen faults are bad, and careful triage of the FU list shouldn't
be required.
3) Smooshing all itineraries into the same automaton allows combinations of
instruction classes that cannot exist, which bloats the table.
A simple solution is to allow "namespacing" packetizers.
Craig Topper [Fri, 30 Aug 2019 19:42:48 +0000 (19:42 +0000)]
[X86] Regenerate the test cases added in r370506.
Something weird happened with the v2i64/v2f64 test cases which
don't use broadcast. So they should already be hoisted, but
weren't in the version I submitted in r370506. This fixes that.
Not sure if something changed or I screwed up.
Craig Topper [Fri, 30 Aug 2019 19:26:06 +0000 (19:26 +0000)]
[X86] Add test caes for opportunities for machine LICM to unfold broadcasted constant pool loads.
MachineLICM is able to unfold loads to move an invariant load out
a loop, but X86 infrastructure currently lacks the ability to do
this when avx512 embedded broadcasting is used.
This test adds examples for the basic float point operations,
add, mul, and, or, and xor.
James Molloy [Fri, 30 Aug 2019 18:49:50 +0000 (18:49 +0000)]
[MachinePipeliner] Separate schedule emission, NFC
This is the first stage in refactoring the pipeliner and making it more
accessible for backends to override and control. This separates the logic and
state required to *emit* a scheudule from the logic that *computes* and
validates a schedule.
This will enable (a) new schedule emitters and (b) new modulo scheduling
implementations to coexist.
This tool merges interface stub files to produce a merged interface stub file
or a stub library. Currently it for stub library generation it can produce an
ELF .so stub file, or a TBD file (experimental). It will be used by the clang
-emit-interface-stubs compilation pipeline to merge and assemble the per-CU
stub files into a stub library.
Simon Pilgrim [Fri, 30 Aug 2019 17:58:55 +0000 (17:58 +0000)]
[TargetLowering] SimplifyDemandedBits ADD/SUB/MUL - correctly inherit SDNodeFlags from the original node.
Just disable NSW/NUW flags. This matches what we're already doing for the other situations for these nodes, it was just missed for the demanded constant case.
Noticed by inspection - confirmed in offline discussion with @spatel. I've checked we have test coverage in the x86 extract-bits.ll and extract-lowbits.ll tests
Craig Topper [Fri, 30 Aug 2019 17:35:08 +0000 (17:35 +0000)]
[X86] Pass v32i16/v64i8 in zmm registers on KNL target.
gcc and icc pass these types in zmm registers in zmm registers.
This patch implements a quick hack to override the register
type before calling convention handling to one that is legal.
Longer term we might want to do something similar to 256-bit
integer registers on AVX1 where we just split all the operations.
Evgeniy Stepanov [Fri, 30 Aug 2019 17:23:02 +0000 (17:23 +0000)]
MemTag: unchecked load/store optimization.
Summary:
MTE allows memory access to bypass tag check iff the address argument
is [SP, #imm]. This change takes advantage of this to demote uses of
tagged addresses to regular FrameIndex operands, reducing register
pressure in large functions.
MO_TAGGED target flag is used to signal that the FrameIndex operand
refers to memory that might be tagged, and needs to be handled with
care. Such operand must be lowered to [SP, #imm] directly, without a
scratch register.
The transformation pass attempts to predict when the offset will be
out of range and disable the optimization.
AArch64RegisterInfo::eliminateFrameIndex has an escape hatch in case
this prediction has been wrong, but it is quite inefficient and should
be avoided.
Craig Topper [Fri, 30 Aug 2019 16:05:57 +0000 (16:05 +0000)]
[X86] Merge X86InstrInfo::loadRegFromAddr/storeRegToAddr into their only call site.
I'm looking at unfolding broadcast loads on AVX512 which will
require refactoring this code to select broadcast opcodes instead
of regular load/stores in some cases. Merging them to avoid
further complicating their interfaces.
[Attributor] Use existing function information for the call site
Summary:
Instead of recomputing information for call sites we now use the
function information directly. This is always valid and once we have
call site specific information we can improve here.
This patch also bootstraps attributes that are created on-demand through
an initial update call. Information that is known will then directly be
available in the new attribute without causing an iteration delay.
The tests show how this improves the iteration count.
George Rimar [Fri, 30 Aug 2019 13:39:22 +0000 (13:39 +0000)]
[yaml2obj][obj2yaml] - Use a single "Other" field instead of "Other", "Visibility" and "StOther".
Currenly we can encode the 'st_other' field of symbol using 3 fields.
'Visibility' is used to encode STV_* values.
'Other' is used to encode everything except the visibility, but it can't handle arbitrary values.
'StOther' is used to encode arbitrary values when 'Visibility'/'Other' are not helpfull enough.
'st_other' field is used to encode symbol visibility and platform-dependent
flags and values. Problem to encode it is that it consists of Visibility part (STV_* values)
which are enumeration values and the Other part, which is different and inconsistent.
For MIPS the Other part contains flags for all STO_MIPS_* values except STO_MIPS_MIPS16.
(Like comment in ELFDumper says: "Someones in their infinite wisdom decided to make
STO_MIPS_MIPS16 flag overlapped with other ST_MIPS_xxx flags."...)
And for PPC64 the Other part might actually encode any value.
This patch implements custom logic for handling the st_other and removes
'Visibility' and 'StOther' fields.
Here is an example of a new YAML style this patch allows:
Simon Pilgrim [Fri, 30 Aug 2019 12:22:06 +0000 (12:22 +0000)]
[DAGCombine] visitMULHS - use getScalarValueSizeInBits() to make safe for vector types.
This is hidden behind a (scalar-only) isOneConstant(N1) check at the moment, but once we get around to adding vector support we need to ensure we're dealing with the scalar bitwidth, not the total.
Bjorn Pettersson [Fri, 30 Aug 2019 11:23:10 +0000 (11:23 +0000)]
[CodeGen] Introduce MachineBasicBlock::replacePhiUsesWith helper and use it. NFC
Summary:
Found a couple of places in the code where all the PHI nodes
of a MBB is updated, replacing references to one MBB by
reference to another MBB instead.
This patch simply refactors the code to use a common helper
(MachineBasicBlock::replacePhiUsesWith) for such PHI node
updates.
Roman Lebedev [Fri, 30 Aug 2019 09:51:23 +0000 (09:51 +0000)]
[LoopIdiomRecognize] BCmp loop idiom recognition
Summary:
@mclow.lists brought up this issue up in IRC.
It is a reasonably common problem to compare some two values for equality.
Those may be just some integers, strings or arrays of integers.
In C, there is `memcmp()`, `bcmp()` functions.
In C++, there exists `std::equal()` algorithm.
One can also write that function manually.
libstdc++'s `std::equal()` is specialized to directly call `memcmp()` for
various types, but not `std::byte` from C++2a. https://godbolt.org/z/mx2ejJ
libc++ does not do anything like that, it simply relies on simple C++'s
`operator==()`. https://godbolt.org/z/er0Zwf (GOOD!)
So likely, there exists a certain performance opportunities.
Let's compare performance of naive `std::equal()` (no `memcmp()`) with one that
is using `memcmp()` (in this case, compiled with modified compiler). {F8768213}
template <typename T>
static void CustomArguments(benchmark::internal::Benchmark* b) {
const size_t L2SizeBytes = []() {
for (const benchmark::CPUInfo::CacheInfo& I :
benchmark::CPUInfo::Get().caches) {
if (I.level == 2) return I.size;
}
return 0;
}();
// What is the largest range we can check to always fit within given L2 cache?
const size_t MaxLen = L2SizeBytes / /*total bufs*/ 2 /
/*maximal elt size*/ sizeof(T) / /*safety margin*/ 2;
b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN);
}
BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, InequalHalfway)
->Apply(CustomArguments<uint8_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint16_t, InequalHalfway)
->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint32_t, InequalHalfway)
->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint64_t, InequalHalfway)
->Apply(CustomArguments<uint64_t>);
```
{F8768210}
```
$ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks build-{old,new}/test/llvm-bcmp-bench
RUNNING: build-old/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpb6PEUx
2019-04-25 21:17:11
Running build-old/test/llvm-bcmp-bench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
L1 Data 16K (x8)
L1 Instruction 64K (x4)
L2 Unified 2048K (x4)
L3 Unified 8192K (x1)
Load Average: 0.65, 3.90, 4.14
---------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------
<...>
BM_bcmp<uint8_t, Identical>/512000 432131 ns 432101 ns 1613 bytes_read/iteration=1000k bytes_read/sec=2.20706G/s eltcnt=825.856M eltcnt/sec=1.18491G/s
BM_bcmp<uint8_t, Identical>_BigO 0.86 N 0.86 N
BM_bcmp<uint8_t, Identical>_RMS 8 % 8 %
<...>
BM_bcmp<uint16_t, Identical>/256000 161408 ns 161409 ns 4027 bytes_read/iteration=1000k bytes_read/sec=5.90843G/s eltcnt=1030.91M eltcnt/sec=1.58603G/s
BM_bcmp<uint16_t, Identical>_BigO 0.67 N 0.67 N
BM_bcmp<uint16_t, Identical>_RMS 25 % 25 %
<...>
BM_bcmp<uint32_t, Identical>/128000 81497 ns 81488 ns 8415 bytes_read/iteration=1000k bytes_read/sec=11.7032G/s eltcnt=1077.12M eltcnt/sec=1.57078G/s
BM_bcmp<uint32_t, Identical>_BigO 0.71 N 0.71 N
BM_bcmp<uint32_t, Identical>_RMS 42 % 42 %
<...>
BM_bcmp<uint64_t, Identical>/64000 50138 ns 50138 ns 10909 bytes_read/iteration=1000k bytes_read/sec=19.0209G/s eltcnt=698.176M eltcnt/sec=1.27647G/s
BM_bcmp<uint64_t, Identical>_BigO 0.84 N 0.84 N
BM_bcmp<uint64_t, Identical>_RMS 27 % 27 %
<...>
BM_bcmp<uint8_t, InequalHalfway>/512000 192405 ns 192392 ns 3638 bytes_read/iteration=1000k bytes_read/sec=4.95694G/s eltcnt=1.86266G eltcnt/sec=2.66124G/s
BM_bcmp<uint8_t, InequalHalfway>_BigO 0.38 N 0.38 N
BM_bcmp<uint8_t, InequalHalfway>_RMS 3 % 3 %
<...>
BM_bcmp<uint16_t, InequalHalfway>/256000 127858 ns 127860 ns 5477 bytes_read/iteration=1000k bytes_read/sec=7.45873G/s eltcnt=1.40211G eltcnt/sec=2.00219G/s
BM_bcmp<uint16_t, InequalHalfway>_BigO 0.50 N 0.50 N
BM_bcmp<uint16_t, InequalHalfway>_RMS 0 % 0 %
<...>
BM_bcmp<uint32_t, InequalHalfway>/128000 49140 ns 49140 ns 14281 bytes_read/iteration=1000k bytes_read/sec=19.4072G/s eltcnt=1.82797G eltcnt/sec=2.60478G/s
BM_bcmp<uint32_t, InequalHalfway>_BigO 0.40 N 0.40 N
BM_bcmp<uint32_t, InequalHalfway>_RMS 18 % 18 %
<...>
BM_bcmp<uint64_t, InequalHalfway>/64000 32101 ns 32099 ns 21786 bytes_read/iteration=1000k bytes_read/sec=29.7101G/s eltcnt=1.3943G eltcnt/sec=1.99381G/s
BM_bcmp<uint64_t, InequalHalfway>_BigO 0.50 N 0.50 N
BM_bcmp<uint64_t, InequalHalfway>_RMS 1 % 1 %
RUNNING: build-new/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpQ46PP0
2019-04-25 21:19:29
Running build-new/test/llvm-bcmp-bench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
L1 Data 16K (x8)
L1 Instruction 64K (x4)
L2 Unified 2048K (x4)
L3 Unified 8192K (x1)
Load Average: 1.01, 2.85, 3.71
---------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------
<...>
BM_bcmp<uint8_t, Identical>/512000 18593 ns 18590 ns 37565 bytes_read/iteration=1000k bytes_read/sec=51.2991G/s eltcnt=19.2333G eltcnt/sec=27.541G/s
BM_bcmp<uint8_t, Identical>_BigO 0.04 N 0.04 N
BM_bcmp<uint8_t, Identical>_RMS 37 % 37 %
<...>
BM_bcmp<uint16_t, Identical>/256000 18950 ns 18948 ns 37223 bytes_read/iteration=1000k bytes_read/sec=50.3324G/s eltcnt=9.52909G eltcnt/sec=13.511G/s
BM_bcmp<uint16_t, Identical>_BigO 0.08 N 0.08 N
BM_bcmp<uint16_t, Identical>_RMS 34 % 34 %
<...>
BM_bcmp<uint32_t, Identical>/128000 18627 ns 18627 ns 37895 bytes_read/iteration=1000k bytes_read/sec=51.198G/s eltcnt=4.85056G eltcnt/sec=6.87168G/s
BM_bcmp<uint32_t, Identical>_BigO 0.16 N 0.16 N
BM_bcmp<uint32_t, Identical>_RMS 35 % 35 %
<...>
BM_bcmp<uint64_t, Identical>/64000 18855 ns 18855 ns 37458 bytes_read/iteration=1000k bytes_read/sec=50.5791G/s eltcnt=2.39731G eltcnt/sec=3.3943G/s
BM_bcmp<uint64_t, Identical>_BigO 0.32 N 0.32 N
BM_bcmp<uint64_t, Identical>_RMS 33 % 33 %
<...>
BM_bcmp<uint8_t, InequalHalfway>/512000 9570 ns 9569 ns 73500 bytes_read/iteration=1000k bytes_read/sec=99.6601G/s eltcnt=37.632G eltcnt/sec=53.5046G/s
BM_bcmp<uint8_t, InequalHalfway>_BigO 0.02 N 0.02 N
BM_bcmp<uint8_t, InequalHalfway>_RMS 29 % 29 %
<...>
BM_bcmp<uint16_t, InequalHalfway>/256000 9547 ns 9547 ns 74343 bytes_read/iteration=1000k bytes_read/sec=99.8971G/s eltcnt=19.0318G eltcnt/sec=26.8159G/s
BM_bcmp<uint16_t, InequalHalfway>_BigO 0.04 N 0.04 N
BM_bcmp<uint16_t, InequalHalfway>_RMS 29 % 29 %
<...>
BM_bcmp<uint32_t, InequalHalfway>/128000 9396 ns 9394 ns 73521 bytes_read/iteration=1000k bytes_read/sec=101.518G/s eltcnt=9.41069G eltcnt/sec=13.6255G/s
BM_bcmp<uint32_t, InequalHalfway>_BigO 0.08 N 0.08 N
BM_bcmp<uint32_t, InequalHalfway>_RMS 30 % 30 %
<...>
BM_bcmp<uint64_t, InequalHalfway>/64000 9499 ns 9498 ns 73802 bytes_read/iteration=1000k bytes_read/sec=100.405G/s eltcnt=4.72333G eltcnt/sec=6.73808G/s
BM_bcmp<uint64_t, InequalHalfway>_BigO 0.16 N 0.16 N
BM_bcmp<uint64_t, InequalHalfway>_RMS 28 % 28 %
Comparing build-old/test/llvm-bcmp-bench to build-new/test/llvm-bcmp-bench
Benchmark Time CPU Time Old Time New CPU Old CPU New
---------------------------------------------------------------------------------------------------------------------------------------
<...>
BM_bcmp<uint8_t, Identical>/512000 -0.9570 -0.9570 432131 18593 432101 18590
<...>
BM_bcmp<uint16_t, Identical>/256000 -0.8826 -0.8826 161408 18950 161409 18948
<...>
BM_bcmp<uint32_t, Identical>/128000 -0.7714 -0.7714 81497 18627 81488 18627
<...>
BM_bcmp<uint64_t, Identical>/64000 -0.6239 -0.6239 50138 18855 50138 18855
<...>
BM_bcmp<uint8_t, InequalHalfway>/512000 -0.9503 -0.9503 192405 9570 192392 9569
<...>
BM_bcmp<uint16_t, InequalHalfway>/256000 -0.9253 -0.9253 127858 9547 127860 9547
<...>
BM_bcmp<uint32_t, InequalHalfway>/128000 -0.8088 -0.8088 49140 9396 49140 9394
<...>
BM_bcmp<uint64_t, InequalHalfway>/64000 -0.7041 -0.7041 32101 9499 32099 9498
```
What can we tell from the benchmark?
* Performance of naive equality check somewhat improves with element size,
maxing out at eltcnt/sec=1.58603G/s for uint16_t, or bytes_read/sec=19.0209G/s
for uint64_t. I think, that instability implies performance problems.
* Performance of `memcmp()`-aware benchmark always maxes out at around
bytes_read/sec=51.2991G/s for every type. That is 2.6x the throughput of the
naive variant!
* eltcnt/sec metric for the `memcmp()`-aware benchmark maxes out at
eltcnt/sec=27.541G/s for uint8_t (was: eltcnt/sec=1.18491G/s, so 24x) and
linearly decreases with element size.
For uint64_t, it's ~4x+ the elements/second.
* The call obvious is more pricey than the loop, with small element count.
As it can be seen from the full output {F8768210}, the `memcmp()` is almost
universally worse, independent of the element size (and thus buffer size) when
element count is less than 8.
So all in all, bcmp idiom does indeed pose untapped performance headroom.
This diff does implement said idiom recognition. I think a reasonable test
coverage is present, but do tell if there is anything obvious missing.
Now, quality. This does succeed to build and pass the test-suite, at least
without any non-bundled elements. {F8768216} {F8768217}
This transform fires 91 times:
```
$ /build/test-suite/utils/compare.py -m loop-idiom.NumBCmp result-new.json
Tests: 1149
Metric: loop-idiom.NumBCmp
Program result-new
MultiSourc...Benchmarks/7zip/7zip-benchmark 79.00
MultiSource/Applications/d/make_dparser 3.00
SingleSource/UnitTests/vla 2.00
MultiSource/Applications/Burg/burg 1.00
MultiSourc.../Applications/JM/lencod/lencod 1.00
MultiSource/Applications/lemon/lemon 1.00
MultiSource/Benchmarks/Bullet/bullet 1.00
MultiSourc...e/Benchmarks/MallocBench/gs/gs 1.00
MultiSourc...gs-C/TimberWolfMC/timberwolfmc 1.00
MultiSourc...Prolangs-C/simulator/simulator 1.00
```
The size changes are:
I'm not sure what's going on with SingleSource/UnitTests/vla.test yet, did not look.
```
$ /build/test-suite/utils/compare.py -m size..text result-{old,new}.json --filter-hash
Tests: 1149
Same hash: 907 (filtered out)
Remaining: 242
Metric: size..text
Program result-old result-new diff
test-suite...ingleSource/UnitTests/vla.test 753.00 833.00 10.6%
test-suite...marks/7zip/7zip-benchmark.test 1001697.00 966657.00 -3.5%
test-suite...ngs-C/simulator/simulator.test 32369.00 32321.00 -0.1%
test-suite...plications/d/make_dparser.test 89585.00 89505.00 -0.1%
test-suite...ce/Applications/Burg/burg.test 40817.00 40785.00 -0.1%
test-suite.../Applications/lemon/lemon.test 47281.00 47249.00 -0.1%
test-suite...TimberWolfMC/timberwolfmc.test 250065.00 250113.00 0.0%
test-suite...chmarks/MallocBench/gs/gs.test 149889.00 149873.00 -0.0%
test-suite...ications/JM/lencod/lencod.test 769585.00 769569.00 -0.0%
test-suite.../Benchmarks/Bullet/bullet.test 770049.00 770049.00 0.0%
test-suite...HMARK_ANISTROPIC_DIFFUSION/128 NaN NaN nan%
test-suite...HMARK_ANISTROPIC_DIFFUSION/256 NaN NaN nan%
test-suite...CHMARK_ANISTROPIC_DIFFUSION/64 NaN NaN nan%
test-suite...CHMARK_ANISTROPIC_DIFFUSION/32 NaN NaN nan%
test-suite...ENCHMARK_BILATERAL_FILTER/64/4 NaN NaN nan%
Geomean difference nan%
result-old result-new diff
count 1.000000e+01 10.00000 10.000000
mean 3.152090e+05 311695.40000 0.006749
std 3.790398e+05 372091.42232 0.036605
min 7.530000e+02 833.00000 -0.034981
25% 4.243300e+04 42401.00000 -0.000866
50% 1.197370e+05 119689.00000 -0.000392
75% 6.397050e+05 639705.00000 -0.000005
max 1.001697e+06 966657.00000 0.106242
```
I don't have timings though.
And now to the code. The basic idea is to completely replace the whole loop.
If we can't fully kill it, don't transform.
I have left one or two comments in the code, so hopefully it can be understood.
Also, there is a few TODO's that i have left for follow-ups:
* widening of `memcmp()`/`bcmp()`
* step smaller than the comparison size
* Metadata propagation
* more than two blocks as long as there is still a single backedge?
* ???
Summary:
The internal `Builder` is private, which means there is
currently no way to set the debuginfo locations for `SCEVExpander`.
This only adds the wrappers, but does not use them anywhere.
David Stenberg [Fri, 30 Aug 2019 09:06:50 +0000 (09:06 +0000)]
[LiveDebugValues] Insert entry values after bundles
Summary:
Change LiveDebugValues so that it inserts entry values after the bundle
which contains the clobbering instruction. Previously it would insert
the debug value after the bundle head using insertAfter(), breaking the
bundle.
Dmitri Gribenko [Fri, 30 Aug 2019 08:21:55 +0000 (08:21 +0000)]
[ADT] Removed VariadicFunction
Summary:
It is not used. It uses macro-based unrolling instead of variadic
templates, so it is not idiomatic anymore, and therefore it is a
questionable API to keep "just in case".
Martin Storsjo [Fri, 30 Aug 2019 06:56:33 +0000 (06:56 +0000)]
[LLD] [COFF] Support merging resource object files
Extend WindowsResourceParser to support using a ResourceSectionRef for
loading resources from an object file.
Only allow merging resource object files in mingw mode; keep the
existing error on multiple resource objects in link mode.
If there only is one resource object file and no .res resources,
don't parse and recreate the .rsrc section, but just link it in without
inspecting it. This allows users to produce any .rsrc section (outside
of what the parser supports), just like before. (I don't have a specific
need for this, but it reduces the risk of this new feature.)
Separate out the .rsrc section chunks in InputFiles.cpp, and only include
them in the list of section chunks to link if we've determined that there
only was one single resource object. (We need to keep other chunks from
those object files, as they can legitimately contain other sections as
well, in addition to .rsrc section chunks.)
Martin Storsjo [Fri, 30 Aug 2019 06:56:02 +0000 (06:56 +0000)]
[WindowsResource] Remove use of global variables in WindowsResourceParser
Instead of updating a global variable counter for the next index of
strings and data blobs, pass along a reference to actual data/string
vectors and let the TreeNode insertion methods add their data/strings to
the vectors when a new entry is needed.
Additionally, if the resource tree had duplicates, that were ignored
with -force:multipleres in lld, we no longer store all versions of the
duplicated resource data, now we only keep the one that actually ends
up referenced.
Martin Storsjo [Fri, 30 Aug 2019 06:55:49 +0000 (06:55 +0000)]
[COFF] Add a ResourceSectionRef method for getting resource contents
This allows llvm-readobj to print the contents of each resource
when printing resources from an object file or executable, like it
already does for plain .res files.
This requires providing the whole COFFObjectFile to ResourceSectionRef.
This supports both object files and executables. For executables,
the DataRVA field is used as is to look up the right section.
For object files, ideally we would need to complete linking of them
and fix up all relocations to know what the DataRVA field would end up
being. In practice, the only thing that makes sense for an RVA field
is an ADDR32NB relocation. Thus, find a relocation pointing at this
field, verify that it has the expected type, locate the symbol it
points at, look up the section the symbol points at, and read from the
right offset in that section.
This works both for GNU windres object files (which use one single
.rsrc section, with all relocations against the base of the .rsrc
section, with the original value of the DataRVA field being the
offset of the data from the beginning of the .rsrc section) and
cvtres object files (with two separate .rsrc$01 and .rsrc$02 sections,
and one symbol per data entry, with the original pre-relocated DataRVA
field being set to zero).
Dan Gohman [Fri, 30 Aug 2019 04:33:22 +0000 (04:33 +0000)]
[CodeGen] Fix lowering for returning the result of an extractvalue
When the number of return values exceeds the number of registers available,
SelectionDAGBuilder::visitRet transforms a function's return to use a
pointer to a buffer to hold return values. When the returned value is an
operator such as extractvalue, the value may have a non-zero result number.
Add that number to the indexing when obtaining the values to store.
This fixes https://bugs.llvm.org/show_bug.cgi?id=43132.
Unlike ppc64, which has ADDISgotTprelHA+LDgotTprelL pairs,
ppc32 just uses LDgotTprelL32, so it does not make lots of sense to use
_LO without a paired _HA.
Emit R_PPC_GOT_TPREL16 instead R_PPC_GOT_TPREL16_LO to match GCC, and
get better linker relocation check. Note, R_PPC_GOT_TPREL16_{HA,LO}
don't have good linker support:
(a) lld does not support R_PPC_GOT_TPREL16_{HA,LO}.
(b) Top of tree ld.bfd does not support R_PPC_GOT_REL16_HA Initial-Exec -> Local-Exec relaxation:
DebugInfo: add CodeView register mapping for ARM NT
Add the core registers and NEON registers mapping to the CodeView
register ID. This is sufficient to compile a basic C program with debug
info using CodeView debug info.
Dan Gohman [Thu, 29 Aug 2019 22:40:00 +0000 (22:40 +0000)]
[WebAssembly] Make __attribute__((used)) not imply export.
Add an WASM_SYMBOL_NO_STRIP flag, so that __attribute__((used)) doesn't
need to imply exporting. When targeting Emscripten, have
WASM_SYMBOL_NO_STRIP imply exporting.
`getExtendTypeForInst` includes the checks for loads and stores. This will be
used for WRO addressing modes in loads + stores.
Teach selectCopy to properly handle subregister copies on the same bank in
order to support `narrowExtendRegIfNeeded`. The extended register must be a
GPR32, so we need to support same-bank subregister copies.
Fix a bug in getSubRegForClass which would cause registers on things like
GPR32common to end up getting ssub. Just change the check to look for FPR32
rather than GPR32.
For tests:
- Add select-arith-extended-reg.mir
- Update addsub_ext.ll to include GlobalISel checks
Reid Kleckner [Thu, 29 Aug 2019 21:15:02 +0000 (21:15 +0000)]
Allow '@' to appear in x86 mingw symbols
Summary:
There is no reason to differ in assembler behavior here between -msvc
and -gnu targets. Without this setting, the text after the '@' is
interpreted as a symbol variable, like foo@IMGREL.
Simon Pilgrim [Thu, 29 Aug 2019 20:22:08 +0000 (20:22 +0000)]
[X86][SSE] combinePMULDQ - pmuldq(x, 0) -> zero vector (PR43159)
ISD::isBuildVectorAllZeros permits undef elements to be present, which means we can't return it as a zero vector. PMULDQ/PMULUDQ is an extending multiply so a multiply by zero of the lower 32-bits should result in a zero 64-bit element.
Craig Topper [Thu, 29 Aug 2019 18:09:02 +0000 (18:09 +0000)]
[X86] Remove what little support we had for MPX
-Deprecate -mmpx and -mno-mpx command line options
-Remove CPUID detection of mpx for -march=native
-Remove MPX from all CPUs
-Remove MPX preprocessor define
I've left the "mpx" string in the backend so we don't fail on old IR, but its not connected to anything.
gcc has also deprecated these command line options. https://www.phoronix.com/scan.php?page=news_item&px=GCC-Patch-To-Drop-MPX
Matt Arsenault [Thu, 29 Aug 2019 17:24:32 +0000 (17:24 +0000)]
GlobalISel: Add known bits to InstructionSelector
AMDGPU uses this for some addressing mode selection patterns. The
analysis run itself doesn't do anything so it seems easier to just
always require this than adding a way to opt in.
Alina Sbirlea [Thu, 29 Aug 2019 17:08:13 +0000 (17:08 +0000)]
[MemorySSA & LoopPassManager] Enable MemorySSA as loop dependency. Update tests.
Summary:
I'm not planning to check this in at the moment, but feedback is very welcome, in particular how this affects performance.
The feedback obtains here will guide the next steps towards enabling this.
This patch enables the use of MemorySSA in the loop pass manager.
Passes that currently use MemorySSA:
- EarlyCSE
Passes that use MemorySSA after this patch:
- EarlyCSE
- LICM
- SimpleLoopUnswitch
Loop passes that update MemorySSA (and do not use it yet, but could use it after this patch):
- LoopInstSimplify
- LoopSimplifyCFG
- LoopUnswitch
- LoopRotate
- LoopSimplify
- LCSSA
Loop passes that do *not* update MemorySSA:
- IndVarSimplify
- LoopDelete
- LoopIdiom
- LoopSink
- LoopUnroll
- LoopInterchange
- LoopUnrollAndJam
- LoopVectorize
- LoopReroll
- IRCE
Michael Liao [Thu, 29 Aug 2019 16:12:05 +0000 (16:12 +0000)]
[SimplifyCFG] Skip sinking common lifetime markers of `alloca`.
Summary:
- Similar to the workaround in fix of PR30188, skip sinking common
lifetime markers of `alloca`. They are mostly left there after
inlining functions in branches.
Roman Lebedev [Thu, 29 Aug 2019 14:46:49 +0000 (14:46 +0000)]
[NFC][SimplifyCFG] 'Safely extract low bits' pattern will also benefit from -phi-node-folding-threshold=3
This is the naive implementation of x86 BZHI/BEXTR instruction:
it takes input and bit count, and extracts low nbits up to bit width.
I.e. unlike shift it does not have any UB when nbits >= bitwidth.
Which means we don't need a while PHI here, simple select will do.
And if it's a select, it should then be trivial to fix codegen
to select it to BEXTR/BZHI.
Pavel Labath [Thu, 29 Aug 2019 14:26:05 +0000 (14:26 +0000)]
DWARFDebugLoc: Make parsing and error reporting more robust
Summary:
While examining this class for possible use in lldb, I noticed two
things:
- it spits out parsing errors directly to stderr
- the loclists parser can incorrectly return valid location lists when
parsing malformed (truncated) data
I improve the stderr situation by making the parseOneLocationList
functions return Expected<T>s. The errors are still dumped to stderr by
their callers, so this is only a partial fix, but it is enough for my
use case, as I intend to parse the locations lists one by one.
I fix the behavior in the truncated scenario by using the newly
introduced DataExtractor Cursor API.
I also add tests for handling the error cases, as they currently have no
coverage.
Simon Atanasyan [Thu, 29 Aug 2019 13:19:50 +0000 (13:19 +0000)]
[mips] Inline emitStoreWithSymOffset and emitLoadWithSymOffset methods. NFC
Both methods `MipsTargetStreamer::emitStoreWithSymOffset` and
`MipsTargetStreamer::emitLoadWithSymOffset` are almost the same and
differ argument names only. These methods are used in the single place
so it's better to inline their code and remove original methods.
When a "base" in the `lw/sw $reg1, symbol($reg2)` instruction is
a register and generated code is position independent, backend
does not add the "base" value to the symbol address.
```
lw $reg1, %got(symbol)($gp)
lw/sw $reg1, 0($reg1)
```
This patch fixes the bug and adds the missed `addu` instruction by
passing `BaseReg` into the `loadAndAddSymbolAddress` routine and handles
the case when the `BaseReg` is the zero register to escape redundant
`move reg, reg` instruction:
```
lw $reg1, %got(symbol)($gp)
addu $reg1, $reg1, $reg2
lw/sw $reg1, 0($reg1)
```
Roman Lebedev [Thu, 29 Aug 2019 12:48:04 +0000 (12:48 +0000)]
[InstSimplify] Drop leftover "division-by-zero guard" around `@llvm.umul.with.overflow` inverted overflow bit
Summary:
Now that with D65143/D65144 we've produce `@llvm.umul.with.overflow`,
and with D65147 we've flattened the CFG, we now can see that
the guard may have been there to prevent division by zero is redundant.
We can simply drop it:
```
----------------------------------------
Name: no overflow or zero
%iszero = icmp eq i4 %y, 0
%umul = smul_overflow i4 %x, %y
%umul.ov = extractvalue {i4, i1} %umul, 1
%umul.ov.not = xor %umul.ov, -1
%retval.0 = or i1 %iszero, %umul.ov.not
ret i1 %retval.0
=>
%iszero = icmp eq i4 %y, 0
%umul = smul_overflow i4 %x, %y
%umul.ov = extractvalue {i4, i1} %umul, 1
%umul.ov.not = xor %umul.ov, -1
%retval.0 = or i1 %iszero, %umul.ov.not
ret i1 %umul.ov.not
Done: 1
Optimization is correct!
```
Note that this is inverted from what we have in a previous patch,
here we are looking for the inverted overflow bit.
And that inversion is kinda problematic - given this particular
pattern we neither hoist that `not` closer to `ret` (then the pattern
would have been identical to the one without inversion,
and would have been handled by the previous patch), neither
do the opposite transform. But regardless, we should handle this too.
I've filled [[ https://bugs.llvm.org/show_bug.cgi?id=42720 | PR42720 ]].
Roman Lebedev [Thu, 29 Aug 2019 12:47:50 +0000 (12:47 +0000)]
[InstSimplify] Drop leftover "division-by-zero guard" around `@llvm.umul.with.overflow` overflow bit
Summary:
Now that with D65143/D65144 we've produce `@llvm.umul.with.overflow`,
and with D65147 we've flattened the CFG, we now can see that
the guard may have been there to prevent division by zero is redundant.
We can simply drop it:
```
----------------------------------------
Name: no overflow and not zero
%iszero = icmp ne i4 %y, 0
%umul = umul_overflow i4 %x, %y
%umul.ov = extractvalue {i4, i1} %umul, 1
%retval.0 = and i1 %iszero, %umul.ov
ret i1 %retval.0
=>
%iszero = icmp ne i4 %y, 0
%umul = umul_overflow i4 %x, %y
%umul.ov = extractvalue {i4, i1} %umul, 1
%retval.0 = and i1 %iszero, %umul.ov
ret %umul.ov
Roman Lebedev [Thu, 29 Aug 2019 12:47:34 +0000 (12:47 +0000)]
[SimplifyCFG] FoldTwoEntryPHINode(): don't bailout on i1 PHI's if we can hoist a 'not' from incoming values
Summary:
As it can be seen in the tests in D65143/D65144, even though we have formed an '@llvm.umul.with.overflow'
and got rid of potential for division-by-zero, the control flow remains, we still have that branch.
We have this condition:
```
// Don't fold i1 branches on PHIs which contain binary operators
// These can often be turned into switches and other things.
if (PN->getType()->isIntegerTy(1) &&
(isa<BinaryOperator>(PN->getIncomingValue(0)) ||
isa<BinaryOperator>(PN->getIncomingValue(1)) ||
isa<BinaryOperator>(IfCond)))
return false;
```
which was added back in rL121764 to help with `select` formation i think?
That check prevents us to flatten the CFG here, even though we know
we no longer need that guard and will be able to drop everything
but the '@llvm.umul.with.overflow' + `not`.
As it can be seen from tests, we end here because the `not` is being
sinked into the PHI's incoming values by InstCombine,
so we can't workaround this by hoisting it to after PHI.
Thus i suggest that we relax that check to not bailout if we'd get to hoist the `not`.
Roman Lebedev [Thu, 29 Aug 2019 12:47:20 +0000 (12:47 +0000)]
[InstCombine] Fold '((%x * %y) u/ %x) != %y' to '@llvm.umul.with.overflow' + overflow bit extraction
Summary:
`((%x * %y) u/ %x) != %y` is one of (3?) common ways to check that
some unsigned multiplication (will not) overflow.
Currently, we don't catch it. We could:
```
$ /repositories/alive2/build-Clang-unknown/alive -root-only ~/llvm-patch1.ll
Processing /home/lebedevri/llvm-patch1.ll..
As it can be observed from tests, while simply forming the `@llvm.umul.with.overflow`
is easy, if we were looking for the inverted answer, then more work needs to be done
to cleanup the now-pointless control-flow that was guarding against division-by-zero.
This is being addressed in follow-up patches.
Jeremy Morse [Thu, 29 Aug 2019 11:20:54 +0000 (11:20 +0000)]
[DebugInfo] LiveDebugValues: correctly discriminate kinds of variable locations
The missing line added by this patch ensures that only spilt variable
locations are candidates for being restored from the stack. Otherwise,
register or constant-value information can be interpreted as a spill
location, through a union.
The added regression test replicates a scenario where this occurs: the
stack load from [rsp] causes the register-location DBG_VALUE to be
"restored" to rsi, when it should be left alone. See PR43058 for details.
Un x-fail a test that was suffering from this from a previous patch.