Matthew Simpson [Fri, 7 Apr 2017 14:15:34 +0000 (14:15 +0000)]
Reapply r298620: [LV] Vectorize GEPs
This patch reapplies r298620. The original patch was reverted because of two
issues. First, the patch exposed a bug in InstCombine that caused the Chromium
builds to fail (PR32414). This issue was fixed in r299017. Second, the patch
introduced a bug in the vectorizer's scalars analysis that caused test suite
builds to fail on SystemZ. The scalars analysis was too aggressive and marked a
memory instruction scalar, even though it was going to be vectorized. This
issue has been fixed in the current patch and several new test cases for the
scalars analysis have been added.
Petar Jovanovic [Fri, 7 Apr 2017 13:31:36 +0000 (13:31 +0000)]
[mips][msa] Fix generation of bm(n)zi and bins[lr]i instructions
We have two cases here, the first one being the following instruction
selection from the builtin function:
bm(n)zi builtin -> vselect node -> bins[lr]i machine instruction
In case of bm(n)zi having an immediate which has either its high or low bits
set, a bins[lr] instruction can be selected through the selectVSplatMask[LR]
function. The function counts the number of bits set, and that value is
being passed to the bins[lr]i instruction as its immediate, which in turn
copies immediate modulo the size of the element in bits plus 1 as per specs,
where we get the off-by-one-error.
The other case is:
bins[lr]i -> vselect node -> bsel.v
In this case, a bsel.v instruction gets selected with a mask having one bit
less set than required.
- corrected DS_GWS_* opcodes (see VI_Shader_Programming#16.pdf for detailed description)
- address operand is not used
- several opcodes have data operand
- all opcodes have offset modifier
- DS_AND_SRC2_B32: corrected typo in mnemo
- DS_WRAP_RTN_F32 replaced with DS_WRAP_RTN_B32
- added CI/VI opcodes:
- DS_CONDXCHG32_RTN_B64
- DS_GWS_SEMA_RELEASE_ALL
- added VI opcodes:
- DS_CONSUME
- DS_APPEND
- DS_ORDERED_COUNT
Simon Dardis [Fri, 7 Apr 2017 13:03:52 +0000 (13:03 +0000)]
[SelectionDAG] Enable target specific vector scalarization of calls and returns
By target hookifying getRegisterType, getNumRegisters, getVectorBreakdown,
backends can request that LLVM to scalarize vector types for calls
and returns.
The MIPS vector ABI requires that vector arguments and returns are passed in
integer registers. With SelectionDAG's new hooks, the MIPS backend can now
handle LLVM-IR with vector types in calls and returns. E.g.
'call @foo(<4 x i32> %4)'.
Previously these cases would be scalarized for the MIPS O32/N32/N64 ABI for
calls and returns if vector types were not legal. If vector types were legal,
a single 128bit vector argument would be assigned to a single 32 bit / 64 bit
integer register.
By teaching the MIPS backend to inspect the original types, it can now
implement the MIPS vector ABI which requires a particular method of
scalarizing vectors.
Previously, the MIPS backend relied on clang to scalarize types such as "call
@foo(<4 x float> %a) into "call @foo(i32 inreg %1, i32 inreg %2, i32 inreg %3,
i32 inreg %4)".
This patch enables the MIPS backend to take either form for vector types.
Sam Kolton [Fri, 7 Apr 2017 10:53:12 +0000 (10:53 +0000)]
[AMDGPU] Move SiShrinkInstruction and SDWAPeephole to SSAOptimization passes
Summary:
Difference beetween PreRegAlloc() and MachineSSAOptimization() are that the former is run despite of -O0 optimization level. In my undestanding SiShrinkInstructions and SDWAPeephole shouldn't run when optimizations are disabled.
With this change order of passes will not change.
Legalize to a libcall.
On this occasion, also start allowing soft float subtargets. For the
moment G_FREM is the only legal floating point operation for them.
Daniel Berlin [Fri, 7 Apr 2017 01:28:36 +0000 (01:28 +0000)]
AliasAnalysis: Be less conservative about volatile than atomic.
Summary:
getModRefInfo is meant to answer the question "what impact does this
instruction have on a given memory location" (not even another
instruction).
Long debate on this on IRC comes to the conclusion the answer should be "nothing special".
That is, a noalias volatile store does not affect a memory location
just by being volatile. Note: DSE and GVN and memdep currently
believe this, because memdep just goes behind AA's back after it says
"modref" right now.
see line 635 of memdep. Prior to this patch we would get modref there, then check aliasing,
and if it said noalias, we would continue.
getModRefInfo *already* has this same AA check, it just wasn't being used because volatile was
lumped in with ordering.
(I am separately testing whether this code in memdep is now dead except for the invariant load case)
Allow specification of what kinds of class members to dump.
Previously when dumping class definitions, there were only
two modes - on or off. But it's useful to sometimes get a
little more fine-grained. For example, you might only want
to see the record layout (for example to look for extraneous
padding). This patch adds a third mode, layout mode, which
does exactly that. Only this-relative data members are
displayed in this mode.
[llvm-pdbdump] Allow pretty to only dump specific types of types.
Previously we just had the -types option, which would dump all
classes, typedefs, and enums. But this produces a lot of output
if you only want to view classes, for example. This patch breaks
this down into 3 additional options, -classes, -enums, and
-typedefs, and keeps the -types option around which implies all
3 more specific options.
This adjusts header file includes for headers and source files
in Core. In doing so, one dependency cycle is eliminated
because all the includes from Core to that project were dead
includes anyway. In places where some files in other projects
were only compiling due to a transitive include from another
header, fixups have been made so that those files also include
the header they need. Tested on Windows and Linux, and plan
to address failures on OSX and FreeBSD after watching the
bots.
[InstCombine] When checking to see if we can turn subtracts of 2^n - 1 into xor, we only need to call computeKnownBits on the RHS not the whole subtract. While there use isMask instead of isPowerOf2(C+1)
Calling computeKnownBits on the RHS should allows us to recurse one step further. isMask is equivalent to the isPowerOf2(C+1) except in the case where C is all ones. But that was already handled earlier by creating a not which is an Xor with all ones. So this should be fine.
[llvm-extract] Add option for recursive extraction
Summary:
Particularly, with --delete, this can be very useful for testing
new optimizations on some hotspots, without having to run it on the whole
application. E.g. as such:
```
llvm-extract app.bc --recursive --rfunc .*hotspot.* > hotspot.bc
llvm-extract app.bc --recursive --delete --rfunc .*hotspot.* > residual.bc
llc -filetype=obj residual.bc > residual.o
llc -filetype=obj hotspot.bc > hotspot.o
cc -o app residual.o hotspot.o
```
[InstCombine] Remove redundant combine from visitAnd
This combine is fully handled by SimplifyDemandedInstructionBits as of r299658 where I fixed this code to ensure the Add/Sub had only a single user. Otherwise it would fire and create additional instructions. That fix resulted in an improvement to code generated for tsan which is why I committed it before deleting.
[SelectionDAG] [ARM CodeGen] Fix chain information of LowerMUL
In LowerMUL, the chain information is not preserved for the new
created Load SDNode.
For example, if a Store alias with one of the operand of Mul.
The Load for that operand need to be scheduled before the Store.
The dependence is recorded in the chain of Store, in TokenFactor.
However, when lowering MUL, the SDNodes for the new Loads for
VMULL are not updated in the TokenFactor for the Store. Thus the
chain is not preserved for the lowered VMULL.
Mehdi Amini [Thu, 6 Apr 2017 20:09:31 +0000 (20:09 +0000)]
Turn some C-style vararg into variadic templates
Module::getOrInsertFunction is using C-style vararg instead of
variadic templates.
From a user prospective, it forces the use of an annoying nullptr
to mark the end of the vararg, and there's not type checking on the
arguments. The variadic template is an obvious solution to both
issues.
Use a combination of !associated, comdat, @llvm.compiler.used and
custom sections to allow dead stripping of globals and their asan
metadata. Sometimes.
Currently this works on LLD, which supports SHF_LINK_ORDER with
sh_link pointing to the associated section.
This also works on BFD, which seems to treat comdats as
all-or-nothing with respect to linker GC. There is a weird quirk
where the "first" global in each link is never GC-ed because of the
section symbols.
At this moment it does not work on Gold (as in the globals are never
stripped).
This is a re-land of r298158 rebased on D31358. This time,
asan.module_ctor is put in a comdat as well to avoid quadratic
behavior in Gold.
The only reason not to is global registration, which can be
TU-specific. This is not the case when there are no instrumented
globals. This is also limited to ELF targets, because MachO does
not have comdat, and COFF linkers may GC comdat constructors.
The benefit of this is a lot less __asan_init() calls: one per DSO
instead of one per TU. It's also necessary for the upcoming
gc-sections-for-globals change on Linux, where multiple references to
section start symbols trigger quadratic behaviour in gold linker.
Create the constructor in the module pass.
This in needed for the GC-friendly globals change, where the constructor can be
put in a comdat in some cases, but we don't know about that in the function
pass.
This is a rebase of r298731 which was reverted due to a false alarm.
Summary:
Prior to this while it would delete the dead DIGlobalVariables, it would
leave dead DICompileUnits and everything referenced therefrom. For a bit
bitcode file with thousands of compile units those dead nodes easily
outnumbered the real ones. Clean that up.
Yaxun Liu [Thu, 6 Apr 2017 19:17:32 +0000 (19:17 +0000)]
[AMDGPU] Temporarily change constant address space from 4 to 2
Our final address space mapping is to let constant address space to be 4 to match nvptx.
However for now we will make it 2 to avoid unnecessary work in FE/BE/devlib
about intrinsics returning constant pointers.
[InstSimplify] Remove unreachable default from SimplifyBinOp.
We have dedicated handlers for every opcode so nothing can get here anymore. The switch doesn't get detected as fully covered because Opcode is an unsigned. Casting to Instruction::BinaryOps still doesn't detect it because BinaryOpsEnd is in the enum and 1 past the last opcode.
Daniel Berlin [Thu, 6 Apr 2017 18:52:50 +0000 (18:52 +0000)]
NewGVN: This patch makes memory congruence work for all types of
memorydefs, not just stores. Along the way, we audit and fixup issues
about how we were tracking memory leaders, and improve the verifier
to notice more memory congruency issues.
[AMDGPU] Eliminate barrier if workgroup size is not greater than wavefront size
If a workgroup size is known to be not greater than wavefront size
the s_barrier instruction is not needed since all threads are guarantied
to come to the same point at the same time.
Jonas Paulsson [Thu, 6 Apr 2017 13:00:37 +0000 (13:00 +0000)]
[SelectionDAG] NFC patch removing a redundant check.
Since the BUILD_VECTOR has already been checked by
isBuildVectorOfConstantSDNodes() in SelectionDAG::getNode() for a
SIGN_EXTEND_INREG, it can be assumed that Op is always either undef or a
ConstantSDNode, and Ops.size() will always equal VT.getVectorNumElements().
[lit] Implement timeouts and max_time for process pool testing
This is necessary to pass the lit test suite at llvm/utils/lit/tests.
There are some pre-existing failures here, but now switching to pools
doesn't regress any tests.
I had to change test-data/lit.cfg to import DummyConfig from a module to
fix pickling problems, but I think it'll be OK if we require test
formats to be written in real .py modules outside lit.cfg files.
I also discovered that in some circumstances AsyncResult.wait() will not
raise KeyboardInterrupt in a timely manner, but you can pass a non-zero
timeout to work around this. This makes threading.Condition.wait use a
polling loop that runs through the interpreter, so it's capable of
asynchronously raising KeyboardInterrupt.
Bryant Wong [Wed, 5 Apr 2017 22:23:48 +0000 (22:23 +0000)]
[Bugpoint] Use `unique_ptr` correctly.
Moving Modules into `testMergedProgram` is incorrect (and causes segmentation
faults) since all callers expect to retain ownership. This is evidenced by the
later calls to `unique_ptr<Module>::get` in the same function.
Summary:
LSV wants to know the maximum size that can be loaded to a vector register.
On X86, this always matches the maximum register width. Implement this
accordingly and add a test to make sure that LSV can vectorize up to the
maximum permissible width on X86.
Daniel Berlin [Wed, 5 Apr 2017 19:01:58 +0000 (19:01 +0000)]
MemorySSA: Remove MemorySSA walker caching.
Summary:
Remove all the caching the clobber walker does, and that the
caching walker does. With the patch to enable storing clobbering
access results for stores, i can find no improvement with the cache
turned on (and a number of degradations, both time and memory, from
the cost of caching. For a large program i have, we do millions of
lookups and inserts with zero hits).
I haven't tried to rename or simplify the walker otherwise yet.
(Appreciate some perf testing on this past my own testing)
Petr Hosek [Wed, 5 Apr 2017 18:55:50 +0000 (18:55 +0000)]
[llvm-readobj] Only print the real size of the note
Note payloads are padded to a multiple of 4 bytes in size, but the size
of the string that should be print can be smaller e.g. the n_descsz
field in gold's version note is 9, so that's the whole size of the
string that should be printed. The padding is part of the format of a
SHT_NOTE section or PT_NOTE segment, but it's not part of the note
itself.
Printing the extra null bytes may confuse some tools, e.g. when the
llvm-readobj is sent to grep, it treats the output as binary because
it contains a null byte.
[InstCombine] fix formatting and variable names; NFCI
There must be some opportunity to refactor big chunks of nearly duplicated code in FoldOrOfICmps / FoldAndOfICmps.
Also, none of this works with vectors, but it should.
Added support of the following instructions:
- s_cbranch_cdbgsys
- s_cbranch_cdbgsys_and_user
- s_cbranch_cdbgsys_or_user
- s_cbranch_cdbguser
- s_setkill
[lit] Use process pools for test execution by default
Summary:
This drastically reduces lit test execution startup time on Windows. Our
previous strategy was to manually create one Process per job and manage
the worker pool ourselves. Instead, let's use the worker pool provided
by multiprocessing. multiprocessing.Pool(jobs) returns almost
immediately, and initializes the appropriate number of workers, so they
can all start executing tests immediately. This avoids the ramp-up
period that the old implementation suffers from. This appears to speed
up small test runs.
Here are some timings of the llvm-readobj tests on Windows using the
various execution strategies:
# multiprocessing.Pool:
$ for i in `seq 1 3`; do tim python ./bin/llvm-lit.py -sv ../llvm/test/tools/llvm-readobj/ --use-process-pool |& grep real: ; done
real: 0m1.156s
real: 0m1.078s
real: 0m1.094s
# multiprocessing.Process:
$ for i in `seq 1 3`; do tim python ./bin/llvm-lit.py -sv ../llvm/test/tools/llvm-readobj/ --use-processes |& grep real: ; done
real: 0m6.062s
real: 0m5.860s
real: 0m5.984s
# threading.Thread:
$ for i in `seq 1 3`; do tim python ./bin/llvm-lit.py -sv ../llvm/test/tools/llvm-readobj/ --use-threads |& grep real: ; done
real: 0m9.438s
real: 0m10.765s
real: 0m11.079s
I kept the old code to launch processes in case this change doesn't work
on all platforms that LLVM supports, but at some point I would like to
remove both the threading and old multiprocessing execution strategies.
[ARM] Try to re-enable MachineBranchProb.ll for ARM/AArch64
Commit r298799 changed code that made the XFAIL on MachineBranchProb.ll
irrelevant, but some configurations still failed. I can't reproduce it
locally, so I'm hoping that enabling this will tell me if some
configurations will really fail or if they were just too slow.