[Sparc] Fix incorrect MI insertion position for spilling f128.
Summary:
Obviously, new built MI (sethi+add or sethi+xor+add) for constructing large offset
should be inserted before new created MI for storing even register into memory.
So the insertion position should be *StMI instead of II.
before fixed:
std %f0, [%g1+80]
sethi 4, %g1 <<<
add %g1, %sp, %g1 <<< this two instructions should be put before "std %f0, [%g1+80]".
sethi 4, %g1
add %g1, %sp, %g1
std %f2, [%g1+88]
On PowerPC64 ELFv2 ABI, functions may have 2 entry points: global and local.
The local entry point location of a function is stored in the st_other field of the symbol, as an offset relative to the global entry point.
In order to make symbol assignments (e.g. .equ/.set) work properly with this, PPCTargetELFStreamer already copies the local entry bits from the source symbol to the destination one, on emitAssignment(). The problem is that this copy is performed only at the assignment location, where the source symbol may not yet have processed the .localentry directive, that sets the local entry. This may cause the destination symbol to end up with wrong local entry information. Other symbol info is not affected by this because, in this case, the destination symbol value is actually a symbol reference.
This change keeps track of these assignments, and update all needed st_other fields when finish() is called.
[MC][ELF] Copy top 3 bits of st_other to .symver aliases
On PowerPC64 ELFv2 ABI, the top 3 bits of st_other encode the local
entry offset. A versioned symbol alias created by .symver should copy
the bits from the source symbol.
This partly fixes PR41048. A full fix needs tracking of .set assignments
and updating st_other fields when finish() is called, see D56586.
The VOP3 form should always be the preferred selection, to be shrunk
later. This should only be an optimization issue, but this partially
works around a problem from clobbering VCC when SIFixSGPRCopies
rewrites an SCC defining operation directly to VCC.
3 of the testcases are regressions from failing to fold the immediate
in cases it should. These can be avoided by improving the VCC liveness
handling in SIFoldOperands. Simply increasing the threshold to
computeRegisterLiveness works, although this is common enough that VCC
liveness should probably be tracked throughout the pass. The hack of
leaving behind an implicit_def instruction to avoid breaking iterator
wastes instruction count, which inhibits finding the VCC def in long
chains of adds. Doing this however exposes different, worse looking
regressions from poor scheduling behavior. This could probably be
avoided around by forcing the shrink of the addc here, but the
scheduler should probably be fixed.
The r600 add test needs to be split out because it asserts on the
arguments in the new test during the calling convention lowering.
------------------------------------------------------------------------
The VOP3 form should always be the preferred selection form to be
shrunk later.
The r600 sub test needs to be split out because it asserts on the
arguments in the new test during the calling convention lowering.
------------------------------------------------------------------------
AMDGPU: Replace shrunk instruction with dummy implicit_def
This was broken if the original operand was killed. The kill flag
would appear on both instructions, and fail the verifier. Keep the
kill flag, but remove the operands from the old instruction. This has
an added benefit of really reducing the use count for future folds.
Ideally the pass would be structured more like what PeepholeOptimizer
does to avoid this hack to avoid breaking instruction iterators.
------------------------------------------------------------------------
AMDGPU: Fix incorrect commute with sub when folding immediates
When a fold of an immediate into a sub/subrev required shrinking the
instruction, the wrong VOP2 opcode was used. This was using the VOP2
equivalent of the original instruction, not the commuted instruction
with the inverted opcode.
------------------------------------------------------------------------
[PPC64] Update tests to reflect change in printing of call operand. [NFC]
The printing of branch operands for call instructions was changed to properly
handle negative offsets. Updating the tests to reflect that.
------------------------------------------------------------------------
Fix undefined behaviour in PPCInstPrinter::printBranchOperand.
Fix the undefined behaviour introduced by my previous patch r353865 (left
shifting a potentially negative value), which was caught by the bots that run
UBSan.
------------------------------------------------------------------------
[AVR] Expand 8/16-bit multiplication to libcalls on MCUs that don't have hardware MUL
This change modifies the LLVM ISel lowering settings so that
8-bit/16-bit multiplication is expanded to calls into the compiler
runtime library if the MCU being targeted does not support
multiplication in hardware.
Before this, MUL instructions would be generated on CPUs like the
ATtiny85, triggering a CPU reset due to an illegal instruction at
runtime.
First raised in https://github.com/avr-rust/rust/issues/124.
------------------------------------------------------------------------
[WebAssembly] Properly align fp128 arguments in outgoing varargs arguments
For outgoing varargs arguments, it's necessary to check the OrigAlign field
of the corresponding OutputArg entry to determine argument alignment, rather
than just computing an alignment from the argument value type. This is
because types like fp128 are split into multiple argument values, with
narrower types that don't reflect the ABI alignment of the full fp128.
This fixes the printf("printfL: %4.*Lf\n", 2, lval); testcase.
This is required for using PGO on Windows but isn't in the Windows
release packages. Windows packages are built with
LLVM_INSTALL_TOOLCHAIN_ONLY so only includes llvm "tools" listed here.
This generally follows what other targets do. I don't completely
understand why the special case for tail calls existed in the first
place; even when the code was committed in r105413, call lowering didn't
work in the way described in the comments.
Stack protector lowering breaks if the register copies are not glued to
a tail call: we have to insert the stack protector check before the tail
call, and we choose the location based on the assumption that all
physical register dependencies of a tail call are adjacent to the tail
call. (See FindSplitPointForStackProtector.) This is sort of fragile,
but I don't see any reason to break that assumption.
I'm guessing nobody has seen this before just because it's hard to
convince the scheduler to actually schedule the code in a way that
breaks; even without the glue, the only computation that could actually
be scheduled after the register copies is the computation of the call
address, and the scheduler usually prefers to schedule that before the
copies anyway.
AMDGPU: Fix incorrect commute with sub when folding immediates
When a fold of an immediate into a sub/subrev required shrinking the
instruction, the wrong VOP2 opcode was used. This was using the VOP2
equivalent of the original instruction, not the commuted instruction
with the inverted opcode.
------------------------------------------------------------------------
[llvm-dlltool] Set a proper machine type for weak symbol object files
This makes GNU binutils not reject the libraries outright.
GNU ld handles weak externals slightly differently though, so it
can't use them for aliases in import libraries, but this makes GNU
ld able to use the rest of the import libraries.
LLD accepted object files with machine type 0 aka
IMAGE_FILE_MACHINE_UNKNOWN.
[llvm] [cmake] Add additional headers only if they exist
Modify the add_header_files_for_glob() function to only add files
that do exist, rather than all matches of the glob. This fixes CMake
error when one of the include directories (which happen to include
/usr/include) contain broken symlinks.
Change the format type of *Personality and *LSDAAddress to PRIx64 since
they are of type uint64_t.
The problem was detected on mips builds, where it was printing junk values
and causing test failure.
[MIPS][microMIPS] Fix PseudoMTLOHI_MM matching and expansion
On micromips MipsMTLOHI is always matched to PseudoMTLOHI_DSP regardless
of +dsp argument. This patch checks is HasDSP predicate is present for
PseudoMTLOHI_DSP so PseudoMTLOHI_MM can be matched when appropriate.
Add expansion of PseudoMTLOHI_MM instruction into a mtlo/mthi pair.
This change fixes crash on an assertion in case of using
`soft float` ABI for mips32r6 target.
------------------------------------------------------------------------
[Mips] Fix missing masking in fast-isel of br (PR40325)
Fixes https://bugs.llvm.org/show_bug.cgi?id=40325 by zero extending
(and x, 1) the condition before branching on it.
To avoid regressing trivial cases, I'm combining emission of cmp+br
sequences for the single-use + same block case (similar to what we
do in x86). icmpbr1.ll still regresses due to the cross-bb usage
of the condition.
[mips][micromips] fix filling delay slots for PseudoIndirectBranch_MM
Filling a delay slot in 32bit jump instructions with a 16bit instruction
can cause issues. According to the documentation such an operation is
unpredictable.
This patch adds opcode Mips::PseudoIndirectBranch_MM alongside
Mips::PseudoIndirectBranch and other instructions that are expanded to jr
instruction and do not allow a 16bit instruction in their delay slots.
[RegAlloc] Avoid compile time regression with multiple copy hints.
As a fix for https://bugs.llvm.org/show_bug.cgi?id=40986 ("excessive compile
time building opencollada"), this patch makes sure that no phys reg is hinted
more than once from getRegAllocationHints().
This handles the case were many virtual registers are assigned to the same
physreg. The previous compile time fix (r343686) in weightCalcHelper() only
made sure that physical/virtual registers are passed no more than once to
addRegAllocationHint().
This demonstrates dead store elimination removing a store that may alias a gather that uses null as its base.
------------------------------------------------------------------------
[X86] Remove IntrArgMemOnly from target specific gather/scatter intrinsics
IntrArgMemOnly implies that only memory pointed to by pointer typed arguments will be accessed. But these intrinsics allow you to pass null to the pointer argument and put the full address into the index argument. Other passes won't be able to understand this.
A colleague found that ISPC was creating gathers like this and then dead store elimination removed some stores because it didn't understand what the gather was doing since the pointer argument was null.
In certain cases, the first non-frame-setup instruction in a function is
a branch. For example, it could be a cbz on an argument. Make sure we
correctly allocate the UnwindHelp, and find an appropriate register to
use to initialize it.
[COFF, ARM64] Don't put jump table into a separate COFF section for EK_LabelDifference32
Windows ARM64 has PIC relocation model and uses jump table kind
EK_LabelDifference32. This produces jump table entry as
".word LBB123 - LJTI1_2" which represents the distance between the block
and jump table.
A new relocation type (IMAGE_REL_ARM64_REL32) is needed to do the fixup
correctly if they are in different COFF section.
This change saves the jump table to the same COFF section as the
associated code. An ideal fix could be utilizing IMAGE_REL_ARM64_REL32
relocation type.
[X86] Don't peek through bitcasts before checking ISD::isBuildVectorOfConstantSDNodes in combineTruncatedArithmetic
We don't have any combines that can look through a bitcast to truncate a build vector of constants. So the truncate will stick around and give us something like this pattern (binop (trunc X), (trunc (bitcast (build_vector)))) which has two truncates in it. Which will be reversed by hoistLogicOpWithSameOpcodeHands in the generic DAG combiner. Thus causing an infinite loop.
Even if we had a combine for (truncate (bitcast (build_vector))), I think it would need to be implemented in getNode otherwise DAG combiner visit ordering would probably still visit the binop first and reverse it. Or combineTruncatedArithmetic would need to do its own constant folding.
[bindings/go] Fix building on 32-bit systems (ARM etc.)
Summary:
The patch in https://reviews.llvm.org/D53883 (by me) fails to build on 32-bit systems like ARM. Fix the array size to be less ridiculously large. 2<<20 should still be enough for all practical purposes.
[XRay][tools] Revert "Use Support/JSON.h in llvm-xray convert"
Summary:
This reverts D50129 / rL338834: [XRay][tools] Use Support/JSON.h in llvm-xray convert
Abstractions are great.
Readable code is great.
JSON support library is a *good* idea.
However unfortunately, there is an internal detail that one needs
to be aware of in `llvm::json::Object` - it uses `llvm::DenseMap`.
So for **every** `llvm::json::Object`, even if you only store a single `int`
entry there, you pay the whole price of `llvm::DenseMap`.
Unfortunately, it matters for `llvm-xray`.
I was trying to analyse the `llvm-exegesis` analysis mode performance,
and for that i wanted to view the LLVM X-Ray log visualization in Chrome
trace viewer. And the `llvm-xray convert` is sluggish, and sometimes
even ended up being killed by OOM.
`xray-log.llvm-exegesis.lwZ0sT` was acquired from `llvm-exegesis`
(compiled with ` -fxray-instruction-threshold=128`)
analysis mode over `-benchmarks-file` with 10099 points (one full
latency measurement set), with normal runtime of 0.387s.
Timings:
Old: (copied from D58580)
```
$ perf stat -r 5 ./bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace.yml xray-log.llvm-exegesis.lwZ0sT
So using `llvm::json` nearly triples run-time on that test case.
(+3x is times, not percent.)
Memory:
Old:
```
total runtime: 39.88s.
bytes allocated in total (ignoring deallocations): 79.07GB (1.98GB/s)
calls to allocation functions: 33267816 (834135/s)
temporary memory allocations: 5832298 (146235/s)
peak heap memory consumption: 9.21GB
peak RSS (including heaptrack overhead): 147.98GB
total memory leaked: 1.09MB
```
New:
```
total runtime: 17.42s.
bytes allocated in total (ignoring deallocations): 5.12GB (293.86MB/s)
calls to allocation functions: 21382982 (1227284/s)
temporary memory allocations: 232858 (13364/s)
peak heap memory consumption: 350.69MB
peak RSS (including heaptrack overhead): 2.55GB
total memory leaked: 79.95KB
```
Diff:
```
total runtime: -22.46s.
bytes allocated in total (ignoring deallocations): -73.95GB (3.29GB/s)
calls to allocation functions: -11884834 (529155/s)
temporary memory allocations: -5599440 (249307/s)
peak heap memory consumption: -8.86GB
peak RSS (including heaptrack overhead): 0B
total memory leaked: -1.01MB
```
So using `llvm::json` increases *peak* memory consumption on *this* testcase ~+27x.
And total allocation count +15x. Both of these numbers are times, *not* percent.
And note that memory usage is clearly unbound with `llvm::json`, it directly depends
on the length of the log, so peak memory consumption is always increasing.
This isn't so with the dumb code, there is no accumulating memory consumption,
peak memory consumption is fixed. Naturally, that means it will handle *much*
larger logs without OOM'ing.
Readability is good, but the price is simply unacceptable here.
Too bad none of this analysis was done as part of the development/review D50129 itself.
Reviewers: dberris, kpw, sammccall
Reviewed By: dberris
Subscribers: riccibruno, hans, courbet, jdoerfert, llvm-commits
AArch64/test: Add check for function name to machine-outliner-bad-adrp.mir
Summary:
This test was failing in one of our setups because the generated ModuleID
had the full path of the test file and that path contained the string
BL.
[X86][AVX] lowerShuffleAsLanePermuteAndPermute - fully populate the lane shuffle mask (PR40730)
As detailed on PR40730, we are not correctly filling in the lane shuffle mask (D53148/rL344446) - we fill in for the correct src lane but don't add it to the correct mask element, so any reference to the correct element is likely to see an UNDEF mask index.
This allows constant folding to propagate UNDEFs prior to the lane mask being (correctly) lowered to vperm2f128.
This patch fixes the issue by fully populating the lane shuffle mask - this is more than is necessary (if we only filled in the required mask elements we might be able to match other shuffle instructions - broadcasts etc.), but its the most cautious approach as this needs to be cherrypicked into the 8.0.0 release branch.
[MergeICmps] Make base ordering really deterministic.
Summary:
The idea is that we now manipulate bases through a `unsigned BaseID` based on
order of appearance in the comparison chain rather than through the `Value*`.
Fixes 40714.
Reviewers: gchatelet
Subscribers: mgrang, jfb, jdoerfert, llvm-commits, hans
Hans Wennborg [Tue, 12 Feb 2019 12:27:08 +0000 (12:27 +0000)]
[WebAssembly] Backport custom import name changes for LLVM to 8.0.
Specifically, this backports r352479, r352931, r353474, and r353476
to the 8.0 branch. The trunk patches don't apply cleanly to 8.0 due to
some contemporaneous mass-rename and mass-clang-tidy patches, so
this merges them to simplify rebasing.
r352479 [WebAssembly] Re-enable main-function signature rewriting
r352931 [WebAssembly] Add codegen support for the import_field attribute
r353474 [WebAssembly] Fix imported function symbol names that differ from their import names in the .o format
r353476 [WebAssembly] Update test output after rL353474. NFC.
[mips][micromips] Fix how values in .gcc_except_table are calculated
When a landing pad is calculated in a program that is compiled for micromips
with -fPIC flag, it will point to an even address.
Such an error will cause a segmentation fault, as the instructions in
micromips are aligned on odd addresses. This patch sets the last bit of the
offset where a landing pad is, to 1, which will effectively be an odd
address and point to the instruction exactly.
AArch64: enforce even/odd register pairs for CASP instructions.
ARMv8.1a CASP instructions need the first of the pair to be an even register
(otherwise the encoding is unallocated). We enforced this during assembly, but
not CodeGen before.
------------------------------------------------------------------------
When doing 128-bit atomics using CASP we might need to copy a GPRPair to a
different register, but that was unimplemented up to now.
------------------------------------------------------------------------
[X86] Remove a couple places where we unnecessarily pass 0 to the EmitPriority of some FP instruction aliases. NFC
As far as I can tell we already won't emit these aliases due to an operand count check in the tablegen code. Removing these because I couldn't make sense of the inconsistency between fadd and fmul from reading the code.
I checked the AsmMatcher and AsmWriter files before and after this change and there were no differences.
------------------------------------------------------------------------
[X86] Print %st(0) as %st when its implicit to the instruction. Continue printing it as %st(0) when its encoded in the instruction.
This is a step back from the change I made in r352985. This appears to be more consistent with gcc and objdump behavior.
------------------------------------------------------------------------
[X86] Print all register forms of x87 fadd/fsub/fdiv/fmul as having two arguments where on is %st.
All of these instructions consume one encoded register and the other register is %st. They either write the result to %st or the encoded register. Previously we printed both arguments when the encoded register was written. And we printed one argument when the result was written to %st. For the stack popping forms the encoded register is always the destination and we didn't print both operands. This was inconsistent with gcc and objdump and just makes the output assembly code harder to read.
This patch changes things to always print both operands making us consistent with gcc and objdump. The parser should still be able to handle the single register forms just as it did before. This also matches the GNU assembler behavior.
------------------------------------------------------------------------
------------------------------------------------------------------------
r353138 | ctopper | 2019-02-05 05:48:23 +0100 (Tue, 05 Feb 2019) | 1 line
[X86] Add test case from PR40529. NFC
------------------------------------------------------------------------
[X86] Connect the default fpsr and dirflag clobbers in inline assembly to the registers we have defined for them.
Summary:
We don't currently map these constraints to physical register numbers so they don't make it to the MachineIR representation of inline assembly.
This could have problems for proper dependency tracking in the machine schedulers though I don't have a test case that shows that.
------------------------------------------------------------------------
r353334 | ctopper | 2019-02-06 20:50:59 +0100 (Wed, 06 Feb 2019) | 1 line
[X86] Change the CPU on the test case for pr40529.ll to really show the bug. NFC
------------------------------------------------------------------------
[X86] Add FPCW as a register and start using it as an implicit use on floating point instructions.
Summary:
FPCW contains the rounding mode control which we manipulate to implement fp to integer conversion by changing the roudning mode, storing the value to the stack, and then changing the rounding mode back. Because we didn't model FPCW and its dependency chain, other instructions could be scheduled into the middle of the sequence.
This patch introduces the register and adds it as an implciit def of FLDCW and implicit use of the FP binary arithmetic instructions and store instructions. There are more instructions that need to be updated, but this is a good start. I believe this fixes at least the reduced test case from PR40529.
[cmake] Pass LLVM_TEMPORARILY_ALLOW_OLD_TOOLCHAIN to NATIVE configure
We should propagate this down to host builds so that e.g. people using
an optimized tablegen can do the sub-configure successfully.
------------------------------------------------------------------------
Summary: This change factors out compiler checking / warning, and documents LLVM_FORCE_USE_OLD_TOOLCHAIN. It doesn't introduce any functional changes nor policy changes, these will come late.
Summary:
Capture the current agreed-upon toolchain update policy based on the following
discussions:
- LLVM dev meeting 2018 BoF "Migrating to C++14, and beyond!"
llvm.org/devmtg/2018-10/talk-abstracts.html#bof3
- A Short Policy Proposal Regarding Host Compilers
lists.llvm.org/pipermail/llvm-dev/2018-May/123238.html
- Using C++14 code in LLVM (2018)
lists.llvm.org/pipermail/llvm-dev/2018-May/123182.html
- Using C++14 code in LLVM (2017)
lists.llvm.org/pipermail/llvm-dev/2017-October/118673.html
- Using C++14 code in LLVM (2016)
lists.llvm.org/pipermail/llvm-dev/2016-October/105483.html
- Document and Enforce new Host Compiler Policy
llvm.org/D47073
- Require GCC 5.1 and LLVM 3.5 at a minimum
llvm.org/D46723
[SystemZ] Do not return INT_MIN from strcmp/memcmp
The IPM sequence currently generated to compute the strcmp/memcmp
result will return INT_MIN for the "less than zero" case. While
this is in compliance with the standard, strictly speaking, it
turns out that common applications cannot handle this, e.g. because
they negate a comparison result in order to implement reverse
compares.
This patch changes code to use a different sequence that will result
in -2 for the "less than zero" case (same as GCC). However, this
requires that the two source operands of the compare instructions
are inverted, which breaks the optimization in removeIPMBasedCompare.
Therefore, I've removed this (and all of optimizeCompareInstr), and
replaced it with a mostly equivalent optimization in combineCCMask
at the DAGcombine level.
Hans Wennborg [Wed, 6 Feb 2019 08:31:22 +0000 (08:31 +0000)]
Merging r353155:
------------------------------------------------------------------------
r353155 | hans | 2019-02-05 12:01:54 +0100 (Tue, 05 Feb 2019) | 11 lines
Fix format string in bindings/go/llvm/ir_test.go (PR40561)
The test started failing for me recently. I don't see any changes around
this code, so maybe it's my local go version that changed or something.
The error seems real to me: we're trying to print an Attribute with %d.
The test talks about "attribute masks" I'm not sure what that refers to,
but I suppose we could print the raw pointer value, since that's
what the test seems to be comparing.
[MC] Don't error on numberless .file directives on MachO
Summary:
Before r349976, MC ignored such directives when producing an object file
and asserted when re-producing textual assembly output. I turned this
assertion into a hard error in both cases in r349976, but this makes it
unnecessarily difficult to write a single assembly file that supports
both MachO and other object formats that support .file. A user reported
this as PR40578, and we decided to go back to ignoring the directive.
Check bool attribute value in getOptionalBoolLoopAttribute.
Summary:
Check the bool value of the attribute in getOptionalBoolLoopAttribute
not just its existance.
Eliminates the warning noise generated when vectorization is explicitly disabled.
[WarnMissedTransforms] Do not warn about already vectorized loops.
LoopVectorize adds llvm.loop.isvectorized, but leaves
llvm.loop.vectorize.enable. Do not consider such a loop for user-forced
vectorization since vectorization already happened -- by prioritizing
llvm.loop.isvectorized except for TM_SuppressedByUser.
Summary:
Currently, if an instruction with a memory operand has no debug information,
X86DiscriminateMemOps will generate one based on the first line of the
enclosing function, or the last seen debug info.
This may cause confusion in certain debugging scenarios. The long term
approach would be to use the line number '0' in such cases, however, that
brings in challenges: the base discriminator value range is limited
(4096 values).
For the short term, adding an opt-in flag for this feature.
See bug 40319 (https://bugs.llvm.org/show_bug.cgi?id=40319)
Summary:
`CodeViewDebug::lowerTypeMemberFunction` used to default to a `Void`
return type if the function's type array was empty. After D54667, it
started blindly indexing the 0th item for the return type, which fails
in `getOperand` for empty arrays if assertions are enabled.
This patch restores the `Void` return type for empty type arrays, and
adds a test generated by Rust in line-only debuginfo mode.
[cmake] Fix get_llvm_lit_path() to respect LLVM_EXTERNAL_LIT always
Refactor the get_llvm_lit_path() logic to respect LLVM_EXTERNAL_LIT,
and require the fallback to be defined explicitly
as LLVM_DEFAULT_EXTERNAL_LIT. This fixes building libcxx standalone
after r346888.
The old logic was using LLVM_EXTERNAL_LIT both as user-defined cache
variable and an optional pre-definition of default value from caller
(e.g. libcxx). It included a hack to make this work by assigning
the value back and forth but it was fragile and stopped working
in libcxx.
The new logic is simpler and more transparent. Default value is
provided in a separate variable, and used only when user-specified
variable is empty (i.e. not overriden).
Reapply: [mips] Handle MipsMCExpr sub-expression for the MEK_DTPREL tag
This reapplies commit r351987 with a failed test fix. Now the test
accepts both DW_OP_GNU_push_tls_address and DW_OP_form_tls_address
opcode.
Original commit message:
```
This is a fix for a regression introduced by the rL348194 commit. In
that change new type (MEK_DTPREL) of MipsMCExpr expression was added,
but in some places of the code this type of expression considered as
unexpected.
This change fixes the bug. The MEK_DTPREL type of expression is used for
marking TLS DIEExpr only and contains a regular sub-expression. Where we
need to handle the expression, we retrieve the sub-expression and
handle it in a common way.
```
------------------------------------------------------------------------
[mips] Emit .reloc R_{MICRO}MIPS_JALR along with j(al)r(c) $25
The callee address is added as an optional operand (MCSymbol) in
AdjustInstrPostInstrSelection() and then used by asm printer to insert:
'.reloc tmplabel, R_MIPS_JALR, symbol
tmplabel:'.
Controlled with '-mips-jalr-reloc', default is true.
Hans Wennborg [Thu, 24 Jan 2019 22:07:01 +0000 (22:07 +0000)]
Merging r351930 and r351932:
------------------------------------------------------------------------
r351930 | kbeyls | 2019-01-23 09:18:39 +0100 (Wed, 23 Jan 2019) | 30 lines
[SLH] AArch64: correctly pick temporary register to mask SP
As part of speculation hardening, the stack pointer gets masked with the
taint register (X16) before a function call or before a function return.
Since there are no instructions that can directly mask writing to the
stack pointer, the stack pointer must first be transferred to another
register, where it can be masked, before that value is transferred back
to the stack pointer.
Before, that temporary register was always picked to be x17, since the
ABI allows clobbering x17 on any function call, resulting in the
following instruction pattern being inserted before function calls and
returns/tail calls:
mov x17, sp
and x17, x17, x16
mov sp, x17
However, x17 can be live in those locations, for example when the call
is an indirect call, using x17 as the target address (blr x17).
To fix this, this patch looks for an available register just before the
call or terminator instruction and uses that.
In the rare case when no register turns out to be available (this
situation is only encountered twice across the whole test-suite), just
insert a full speculation barrier at the start of the basic block where
this occurs.
[AArch64] add more tests for buildvec to shuffle transform; NFC
These are copied from the sibling x86 file. I'm not sure which
of the current outputs (if any) is considered optimal, but
someone more familiar with AArch may want to take a look.
[DAGCombiner] fix crash when converting build vector to shuffle
The regression test is reduced from the example shown in D56281.
This does raise a question as noted in the test file: do we want
to handle this pattern? I don't have a motivating example for
that on x86 yet, but it seems like we could have that pattern
there too, so we could avoid the back-and-forth using a shuffle.