Alex Bradbury [Sat, 30 Mar 2019 17:59:30 +0000 (17:59 +0000)]
[RISCV] Add codegen support for ilp32f, ilp32d, lp64f, and lp64d ("hard float") ABIs
This patch adds support for the RISC-V hard float ABIs, building on top of
rL355771, which added basic target-abi parsing and MC layer support. It also
builds on some re-organisations and expansion of the upstream ABI and calling
convention tests which were recently committed directly upstream.
A number of aspects of the RISC-V float hard float ABIs require frontend
support (e.g. flattening of structs and passing int+fp for fp+fp structs in a
pair of registers), and will be addressed in a Clang patch.
As can be seen from the tests, it would be worthwhile extending
RISCVMergeBaseOffsets to handle constant pool as well as global accesses.
Alex Bradbury [Sat, 30 Mar 2019 15:53:38 +0000 (15:53 +0000)]
[RISCV] Add RV64 CHECK lines to test/CodeGen/RISCV/vararg.ll and prepare for hard float tests
vararg.ll previously missed RV64 tests. This patch also prepares for using
vararg.ll to test handling of varargs for the ilp32f/ilp32d/lp64f/lp64d hard
float ABIs. In these ABIs, varargs are passed as in either the ilp32 or lp64
ABI. Due to some slight codegen differences, different check lines are needed
for when RV32D is enabled.
Fangrui Song [Sat, 30 Mar 2019 14:08:59 +0000 (14:08 +0000)]
[llvm-objcopy] Replace the size() helper with SectionTableRef::size
Summary:
BTW, STLExtras.h provides llvm::size() which is similar to std::size()
for random access iterators. However, if we prefer qualified
llvm::size(), the member function .size() will be more convenient.
Heejin Ahn [Sat, 30 Mar 2019 11:04:48 +0000 (11:04 +0000)]
[WebAssembly] Fix unwind destination mismatches in CFG stackify
Summary:
Linearing the control flow by placing `try`/`end_try` markers can create
mismatches in unwind destinations. This patch resolves these mismatches
by wrapping those instructions with an incorrect unwind destination with
a nested `try`/`catch`/`end_try` and branching to the right destination
within the new catch block.
Heejin Ahn [Sat, 30 Mar 2019 09:29:57 +0000 (09:29 +0000)]
[WebAssembly] Run ExplicitLocals pass after CFGStackify
Summary:
While this does not change any final output, this will greatly simplify
ixing unwind destination mismatches in CFGStackify (D48345), because we
have to create some new registers there.
Alex Bradbury [Sat, 30 Mar 2019 09:15:47 +0000 (09:15 +0000)]
[RISCV] Add DAGCombine for (SplitF64 (ConstantFP x))
The SplitF64 node is used on RV32D to convert an f64 directly to a pair of i32
(necessary as bitcasting to i64 isn't legal). When performed on a ConstantFP,
this will result in a FP load from the constant pool followed by a store to
the stack and two integer loads from the stack (necessary as there is no way
to directly move between f64 FPRs and i32 GPRs on RV32D). It's always cheaper
to just materialise integers for the lo and hi parts of the FP constant, so do
that instead.
Anton Afanasyev [Sat, 30 Mar 2019 08:42:48 +0000 (08:42 +0000)]
Adds `-ftime-trace` option to clang that produces Chrome `chrome://tracing` compatible JSON profiling output dumps.
This change adds hierarchical "time trace" profiling blocks that can be visualized in Chrome, in a "flame chart" style. Each profiling block can have a "detail" string that for example indicates the file being processed, template name being instantiated, function being optimized etc.
This is taken from GitHub PR: https://github.com/aras-p/llvm-project-20170507/pull/2
Alex Bradbury [Sat, 30 Mar 2019 05:24:42 +0000 (05:24 +0000)]
[RISCV][NFC] Remove floating point operations from test/CodeGen/RISCV/vararg.ll
This minimises differences in output when compiling with hardware floating
point support, which will be done in a future patch (to demonstrate the same
vararg calling convention is used).
Heejin Ahn [Sat, 30 Mar 2019 01:31:11 +0000 (01:31 +0000)]
[WebAssembly] Optimize the number of routing blocks in FixIrreducibleCFG
Summary:
Currently we create a routing block to the dispatch block for every
predecessor of every entry. So the total number of routing blocks
created will be (# of preds) * (# of entries). But we don't need to do
this: we need at most 2 routing blocks per loop entry, one for when the
predecessor is inside the loop and one for it is outside the loop. (We
can't merge these into one because this will creates another loop cycle
between blocks inside and blocks outside) This patch fixes this and
creates at most 2 routing blocks per entry.
This also renames variable `Split` to `Routing`, which I think is a bit
clearer.
Hubert Tong [Fri, 29 Mar 2019 23:32:47 +0000 (23:32 +0000)]
[Support] Implement is_local_impl with AIX mntctl
Summary:
On AIX, we can determine whether a filesystem is remote using `mntctl`.
If the information is not found, then claim that the file is remote
(since that is the more restrictive case). Testing for the associated
interface is restored with a modified version of the unit test from
rL295768.
Thomas Lively [Fri, 29 Mar 2019 22:00:18 +0000 (22:00 +0000)]
[WebAssembly] Add mutable globals feature
Summary:
This feature is not actually used for anything in the WebAssembly
backend, but adding it allows users to get it into the target features
sections of their objects, which makes these objects
future-compatible.
Amara Emerson [Fri, 29 Mar 2019 21:30:51 +0000 (21:30 +0000)]
[X86] When using Win64 ABI, exit with error if SSE is disabled for varargs
We need XMM registers to handle varargs with the Win64 ABI. Before we would
silently generate bad code resulting in an assertion failure elsewhere in the
backend.
Matt Arsenault [Fri, 29 Mar 2019 19:14:54 +0000 (19:14 +0000)]
AMDGPU: Remove dx10-clamp from subtarget features
Since this can be set with s_setreg*, it should not be a subtarget
property. Set a default based on the calling convention, and Introduce
a new amdgpu-dx10-clamp attribute to override this if desired.
Also introduce a new amdgpu-ieee attribute to match.
The values need to match to allow inlining. I think it is OK for the
caller's dx10-clamp attribute to override the callee, but there
doesn't appear to be the infrastructure to do this currently without
definining the attribute in the generic Attributes.td.
Eventually the calling convention lowering will need to insert a mode
switch somewhere for these.
Nirav Dave [Fri, 29 Mar 2019 17:35:56 +0000 (17:35 +0000)]
[DAGCombine] Prune unnused nodes.
Summary:
Nodes that have no uses are eventually pruned when they are selected
from the worklist. Record nodes newly added to the worklist or DAG and
perform pruning after every combine attempt.
Nirav Dave [Fri, 29 Mar 2019 17:26:40 +0000 (17:26 +0000)]
[DAG] Set up infrastructure to avoid smart constructor-based dangling nodes
Summary:
Various SelectionDAG non-combine operations (e.g. the getNode smart
constructor and legalization) may leave dangling nodes by applying
optimizations without fully pruning unused result values. This results
in nodes that are never added to the worklist and therefore can not be
pruned.
Add a node inserter for the combiner to make sure such nodes have the
chance of being pruned. This allows a number of additional peephole
optimizations.
Sanjay Patel [Fri, 29 Mar 2019 16:49:38 +0000 (16:49 +0000)]
[InstCombine] move shuffle canonicalizations before other transforms
This may not be NFC, but I'm not sure how to expose any diffs in
tests. In theory, it should be slightly more efficient and possibly
more profitable to do the canonicalizations (which can increase the
undef elements in the mask) ahead of SimplifyDemandedVectorElts().
Simon Pilgrim [Fri, 29 Mar 2019 15:28:25 +0000 (15:28 +0000)]
[SLP] Add support for commutative icmp/fcmp predicates
For the cases where the icmp/fcmp predicate is commutative, use reorderInputsAccordingToOpcode to collect and commute the operands.
This requires a helper to recognise commutativity in both general Instruction and CmpInstr types - the CmpInst::isCommutative doesn't overload the Instruction::isCommutative method for reasons I'm not clear on (maybe because its based on predicate not opcode?!?).
Simon Atanasyan [Fri, 29 Mar 2019 15:15:22 +0000 (15:15 +0000)]
[mips] Fix lowering a signed immediate for *.d MSA instructions
The `lowerMSASplatImm` function zero-extends `i32` immediates while
building constant. If target type is `i64`, negative immediate loses
the sign. As a result, for example `__builtin_msa_ldi_d(-1)` lowered
to series of instruction loads incorrect value 0xffffffff to the `$w0`
register instead of single `ldi.d $w0, -1` instruction.
The fix zero-extends unsigned immediates and signed-extend signed
immediates.
Summary:
`ResolvedSchedClass` will need to be used outside of `Analysis`
(before `InstructionBenchmarkClustering` even), therefore promote
it into a non-private top-level class, and while there also
move all of the functions that are only called by `ResolvedSchedClass`
into that same new file.
Sanjay Patel [Fri, 29 Mar 2019 14:20:38 +0000 (14:20 +0000)]
[DAGCombiner] simplify shuffle of shuffle
After investigating the examples from D59777 targeting an SSE4.1 machine,
it looks like a very different problem due to how we map illegal types (256-bit in these cases).
We're missing a shuffle simplification that maps elements of a vector back to a shuffled operand.
We have a more general version of this transform in DAGCombiner::visitVECTOR_SHUFFLE(), but that
generality means it is limited to patterns with a one-use constraint, and the examples here have
2 uses. We don't need any uses or legality limitations for a simplification (no new value is
created).
It looks like we miss this pattern in IR too.
In one of the zext examples here, we have shuffle masks like this:
Hans Wennborg [Fri, 29 Mar 2019 13:40:05 +0000 (13:40 +0000)]
Switch lowering: exploit unreachable fall-through when lowering case range cluster
In the example below, we would previously emit two range checks, one for cases
1--3 and one for 4--6. This patch makes us exploit the fact that the
fall-through is unreachable and only one range check is necessary.
Andrea Di Biagio [Fri, 29 Mar 2019 12:15:37 +0000 (12:15 +0000)]
[MCA] Add an experimental MicroOpQueue stage.
This patch adds an experimental stage named MicroOpQueueStage.
MicroOpQueueStage can be used to simulate a hardware micro-op queue (basically,
a decoupling queue between 'decode' and 'dispatch'). Users can specify a queue
size, as well as a optional MaxIPC (which - in the absence of a "Decoders" stage
- can be used to simulate a different throughput from the decoders).
This stage is added to the default pipeline between the EntryStage and the
DispatchStage only if PipelineOption::MicroOpQueue is different than zero. By
default, llvm-mca sets PipelineOption::MicroOpQueue to the value of hidden flag
-micro-op-queue-size.
Throughput from the decoder can be simulated via another hidden flag named
-decoder-throughput. That flag allows us to quickly experiment with different
frontend throughputs. For targets that declare a loop buffer, flag
-decoder-throughput allows users to do multiple runs, each time simulating a
different throughput from the decoders.
This stage can/will be extended in future. For example, we could add a "buffer
full" event to notify bottlenecks caused by backpressure. flag
-decoder-throughput would probably go away if in future we delegate to another
stage (DecoderStage?) the simulation of a (potentially variable) throughput from
the decoders. For now, flag -decoder-throughput is "good enough" to run some
simple experiments.
James Henderson [Fri, 29 Mar 2019 11:47:19 +0000 (11:47 +0000)]
[llvm-readelf]Merge dynamic and static relocation printing to avoid code duplication
The majority of the printRelocation and printDynamicRelocation functions
were identical. This patch factors this all out into a new function.
There are a couple of minor differences to do with printing of symbols
without names, but I think these are harmless, and in some cases a small
improvement.
Summary:
The diff looks scary but it really isn't:
1. I moved the check for the number of measurements into `SchedClassClusterCentroid::validate()`
2. While there, added a check that we can only have a single inverse throughput measurement. I missed that when adding it initially.
3. In `Analysis::SchedClassCluster::measurementsMatch()` is called with the current LLVM values from schedule class and the values from Centroid.
3.1. The values from centroid we can already get from `SchedClassClusterCentroid::getAsPoint()`.
This isn't 100% a NFC, because previously for inverse throughput we used `min()`. I have asked whether i have done that correctly in
https://reviews.llvm.org/D57647?id=184939#inline-510384 but did not hear back. I think `avg()` should be used too, thus it is a fix.
3.2. Finally, refactor the computation of the LLVM-specified values into `Analysis::SchedClassCluster::getSchedClassPoint()`
I will need that function for [[ https://bugs.llvm.org/show_bug.cgi?id=41275 | PR41275 ]]
Kang Zhang [Fri, 29 Mar 2019 08:45:24 +0000 (08:45 +0000)]
[PowerPC] Add the support for __builtin_setrnd()
Summary:
PowerPC64/PowerPC64le supports the builtin function __builtin_setrnd to set the floating point rounding mode. This function will use the least significant two bits of integer argument to set the floating point rounding mode.
double __builtin_setrnd(int mode);
The effective values for mode are:
0 - round to nearest
1 - round to zero
2 - round to +infinity
3 - round to -infinity
Note that the mode argument will modulo 4, so if the int argument is greater than 3, it will only use the least significant two bits of the mode. Namely, builtin_setrnd(102)) is equal to builtin_setrnd(2).
Matt Arsenault [Fri, 29 Mar 2019 03:54:56 +0000 (03:54 +0000)]
AMDGPU/GlobalISel: Insert waterfall loop for vector indexing
The register index can only really be an SGPR. Lie that a VGPR index
is legal, and then rewrite the instruction in a waterfall loop to
handle the index.
Zi Xuan Wu [Fri, 29 Mar 2019 03:08:39 +0000 (03:08 +0000)]
[PowerPC] Strength reduction of multiply by a constant by shift and add/sub in place
A shift and add/sub sequence combination is faster in place of a multiply by constant.
Because the cycle or latency of multiply is not huge, we only consider such following
worthy patterns.
And the cycles or latency is subtarget-dependent so that we need consider the
subtarget to determine to do or not do such transformation.
Also data type is considered for different cycles or latency to do multiply.
Thomas Lively [Fri, 29 Mar 2019 00:14:01 +0000 (00:14 +0000)]
[WebAssembly] Merge used feature sets, update atomics linkage policy
Summary:
It does not currently make sense to use WebAssembly features in some functions
but not others, so this CL adds an IR pass that takes the union of all used
feature sets and applies it to each function in the module. This allows us to
prevent atomics from being lowered away if some function has opted in to using
them. When atomics is not enabled anywhere, we detect whether there exists any
atomic operations or thread local storage that would be stripped and disallow
linking with objects that contain atomics if and only if atomics or tls are
stripped. When atomics is enabled, mark it as used but do not require it of
other objects in the link. These changes allow libraries that do not use atomics
to be built once and linked into both single-threaded and multithreaded
binaries.
Puyan Lotfi [Thu, 28 Mar 2019 22:55:08 +0000 (22:55 +0000)]
[yaml2obj] Fixing opening empty yaml files.
Essentially echo "" | yaml2obj crashes. This patch attempts to trim whitespace
and determine if the yaml string in the file is empty or not. If the input is
empty then it will not properly print out an error message and return an error
code.
Rumeet Dhindsa [Thu, 28 Mar 2019 22:26:51 +0000 (22:26 +0000)]
Update lit config for ld.lld command to match "ld\.lld" instead of trying to match respective regex. (It was able to work with ld-lld and ld1lld as well)
Yonghong Song [Thu, 28 Mar 2019 21:59:49 +0000 (21:59 +0000)]
[BPF] add proper multi-dimensional array support
For multi-dimensional array like below
int a[2][3];
the previous implementation generates BTF_KIND_ARRAY type
like below:
. element_type: int
. index_type: unsigned int
. number of elements: 6
This is not the best way to represent arrays, esp.,
when converting BTF back to headers and users will see
int a[6];
instead.
This patch generates proper support for multi-dimensional arrays.
For "int a[2][3]", the two BTF_KIND_ARRAY types will be
generated:
Type #n:
. element_type: int
. index_type: unsigned int
. number of elements: 3
Type #(n+1):
. element_type: #n
. index_type: unsigned int
. number of elements: 2
The linux kernel already supports such a multi-dimensional
array representation properly.
Signed-off-by: Yonghong Song <yhs@fb.com>
Differential Revision: https://reviews.llvm.org/D59943
Eli Friedman [Thu, 28 Mar 2019 21:12:28 +0000 (21:12 +0000)]
[MC] Fix floating-point literal lexing.
This patch has three related fixes to improve float literal lexing:
1. Make AsmLexer::LexDigit handle floats without a decimal point more
consistently.
2. Make AsmLexer::LexFloatLiteral print an error for floats which are
apparently missing an "e".
3. Make APFloat::convertFromString use binutils-compatible exponent
parsing.
Together, this fixes some cases where a float would be incorrectly
rejected, fixes some cases where the compiler would crash, and improves
diagnostics in some cases.
Eli Friedman [Thu, 28 Mar 2019 20:44:50 +0000 (20:44 +0000)]
[InterleavedAccessPass] Don't increase the number of bytes loaded.
Even if the interleaving transform would otherwise be legal, we shouldn't
introduce an interleaved load that is wider than the original load: it might
have undefined behavior.
It might be possible to perform some sort of mask-narrowing transform in
some cases (using a narrower interleaved load, then extending the
results using shufflevectors). But I haven't tried to implement that,
at least for now.
Florian Hahn [Thu, 28 Mar 2019 20:02:33 +0000 (20:02 +0000)]
[DSE] Preserve basic block ordering using OrderedBasicBlock.
By extending OrderedBB to allow removing and replacing cached
instructions, we can preserve OrderedBBs in DSE easily. This eliminates
one source of quadratic compile time in DSE.