[PowerPC] ELFv2 MC support for .localentry directive
A second binutils feature needed to support ELFv2 is the .localentry
directive. In the ELFv2 ABI, functions may have two entry points:
one for calling the routine locally via "bl", and one for calling the
function via function pointer (either at the source level, or implicitly
via a PLT stub for global calls). The two entry points share a single
ELF symbol, where the ELF symbol address identifies the global entry
point address, while the local entry point is found by adding a delta
offset to the symbol address. That offset is encoded into three
platform-specific bits of the ELF symbol st_other field.
The .localentry directive instructs the assembler to set those fields
to encode a particular offset. This is typically used by a function
prologue sequence like this:
Note that according to the ABI, when calling the global entry point,
r12 must be set to point the global entry point address itself; while
when calling the local entry point, r2 must be set to point to the TOC
base. The two instructions between the global and local entry point in
the above example translate the first requirement into the second.
This patch implements support in the PowerPC MC streamers to emit the
.localentry directive (both into assembler and ELF object output), as
well as support in the assembler parser to parse that directive.
In addition, there is another change required in MC fixup/relocation
handling to properly deal with relocations targeting function symbols
with two entry points: When the target function is known local, the MC
layer would immediately handle the fixup by inserting the target
address -- this is wrong, since the call may need to go to the local
entry point instead. The GNU assembler handles this case by *not*
directly resolving fixups targeting functions with two entry points,
but always emits the relocation and relies on the linker to handle
this case correctly. This patch changes LLVM MC to do the same (this
is done via the processFixupValue routine).
Similarly, there are cases where the assembler would normally emit a
relocation, but "simplify" it to a relocation targeting a *section*
instead of the actual symbol. For the same reason as above, this
may be wrong when the target symbol has two entry points. The GNU
assembler again handles this case by not performing this simplification
in that case, but leaving the relocation targeting the full symbol,
which is then resolved by the linker. This patch changes LLVM MC
to do the same (via the needsRelocateWithSymbol routine).
NOTE: The method used in this patch is overly pessimistic, since the
needsRelocateWithSymbol routine currently does not have access to the
actual target symbol, and thus must always assume that it might have
two entry points. This will be improved upon by a follow-on patch
that modifies common code to pass the target symbol when calling
needsRelocateWithSymbol.
[PowerPC] ELFv2 MC support for .abiversion directive
ELFv2 binaries are marked by a bit in the ELF header e_flags field.
A new assembler directive .abiversion can be used to set that flag.
This patch implements support in the PowerPC MC streamers to emit the
.abiversion directive (both into assembler and ELF binary output),
as well as support in the assembler parser to parse the .abiversion
directive.
[PowerPC] Refactor byval handling in LowerFormalArguments_64SVR4
When handling an incoming byval argument, we need to possibly write
incoming registers to the stack in order to create an on-stack image
of the parameter, so we can return its address to common code.
This currently uses CreateFixedObject to access the parts of the
parameter save area where the argument is (or needs to be) stored.
However, sometimes we need to access multiple parts of that area,
e.g. to write multiple registers. The code currently uses a new
CreateFixedObject call for each of these accesses, resulting in
a patchwork of overlapping (fixed) stack objects.
This doesn't really matter in the case of fixed objects, since
any access to those turns into a fixed stackpointer + offset
address anyway. However, with the upcoming ELFv2 patches, we
may actually need to place an incoming argument into our *own*
stack frame instead of the caller's. This means we need to use
CreateStackObject instead, and we cannot have multiple overlapping
instances of those.
To make the rest of the argument handling code work equally in
both situations, this patch refactors it to always use just a
single call to CreateFixedObject, and access parts of that object
as required using address arithmetic. This way, we can in a future
patch substitute CreateStackObject without further changes.
[PowerPC] Fix FrameIndex handling in SelectAddressRegImm
The PPCTargetLowering::SelectAddressRegImm routine needs to handle
FrameIndex nodes in a special manner, by tranlating them into a
TargetFrameIndex node. This was done in most cases, but seems to
have been neglected in one path: when the input tree has an OR of
the FrameIndex with an immediate. This can happen if the FrameIndex
can be proven to be sufficiently aligned that an OR of that immediate
is equivalent to an ADD.
The missing handling of FrameIndex in that case caused the SelectionDAG
instruction selection to miss opportunities to merge the OR back into
the FrameIndex node, leading to superfluous addi/ori instructions in
the final assembler output.
MC: permit emitting a symbol value as section relative
This adds an optional parameter to the EmitSymbolValue method in MCStreamer to
permit emitting a symbol value as a section relative value. This is to cover
the use in MCDwarf which should not really know about how to emit a section
relative value for a given target.
This addresses post-review comments from Eric Christopher in SVN r213275.
Matt Arsenault [Sat, 19 Jul 2014 18:44:39 +0000 (18:44 +0000)]
R600/SI: implement range reduction for sin/cos
These instructions can only take a limited input range, and return
the constant value 1 out of range. We should do range reduction to
be able to process arbitrary values. Use a FRACT instruction after
normalization to achieve this. Also add a test for constant folding
with the lowered code with unsafe-fp-math enabled.
v2: use DAG lowering instead of intrinsic, adapt test
v3: calculate constant, fold pattern into instruction definition
v4: misc style fixes, add sin-fold testcase, cosmetics
Hal Finkel [Sat, 19 Jul 2014 13:33:16 +0000 (13:33 +0000)]
[LoopVectorize] Propagate known metadata to vectorized instructions
There are some kinds of metadata that are safe to propagate from the scalar
instructions to the vector instructions (fpmath and tbaa currently).
Regarding TBAA, one might worry about propagating it on if-converted loads and
stores, because the metadata might have had a control dependency on the
condition, and thus actually aliased with some other non-speculated memory
access when the condition was false. However, this would be caught by the
runtime overlap checks.
Andrea Di Biagio [Sat, 19 Jul 2014 07:52:58 +0000 (07:52 +0000)]
[x86] Fix wrong shuffle mask in test 'combine-vec-shuffle-3.ll'. No functional change.
Function @test3c should check that the DAGCombiner is able to fold a pair of
shuffles into a new shuffle with a permute mask of <6,7,2,3>. However, one of
the shuffles in @test3c had a wrong permute mask; this prevented the DAGCombiner
from folding the shuffles into the expected result.
Now that the shuffle mask is fixed, the backend correctly folds the two shuffles
in function @test3c into a single movhlps instruction.
Hal Finkel [Sat, 19 Jul 2014 03:32:02 +0000 (03:32 +0000)]
Handle AddrSpaceCast in stripAndAccumulateInBoundsConstantOffsets
All of the other similar functions in that part of the file look through
addrspacecast in addition to bitcast, and I see no reason why
stripAndAccumulateInBoundsConstantOffsets shouldn't do so also.
Hal Finkel [Sat, 19 Jul 2014 03:25:16 +0000 (03:25 +0000)]
Make Value::isDereferenceablePointer handle offsets to pointer types with dereferenceable attributes
When we have a parameter (or call site return) with a dereferenceable
attribute, it can specify the size of an array pointed to by that parameter. If
we have a value for which we can accumulate a constant offset to such a
parameter, then we can use that offset in a direct comparison with the size
specified by the dereferenceable attribute.
This enables us to handle cases like this:
int foo(int a[static 3]) {
return a[2]; /* this is always dereferenceable */
}
When performing a dynamic stack adjustment without optimisations, we would mark
SP as def and R4 as kill. This occurred as part of the expansion of a
WIN__CHKSTK SDNode which indicated the proper handling of SP and R4. The result
would be that we would double define SP as part of an operation, which is
obviously incorrect.
Furthermore, the VTList for the chain had an incorrect parameter type of i32
instead of Other.
Correct these to permit proper lowering of __builtin_alloca at -O0.
David Blaikie [Sat, 19 Jul 2014 01:05:11 +0000 (01:05 +0000)]
Remove uses of the redundant ".reset(nullptr)" of unique_ptr, in favor of ".reset()"
It's also possible to just write "= nullptr", but there's some question
of whether that's as readable, so I leave it up to authors to pick which
they prefer for now. If we want to discuss standardizing on one or the
other, we can do that at some point in the future.
Lang Hames [Sat, 19 Jul 2014 00:19:17 +0000 (00:19 +0000)]
[MCJIT] Add a 'decodeAddend' method to RuntimeDyldMachO and teach
getBasicRelocationEntry to use this rather than 'memcpy' to get the
relocation addend. Targets with non-trivial addend encodings (E.g. AArch64) can
override decodeAddend to handle immediates with interesting encodings.
Eric Christopher [Fri, 18 Jul 2014 23:41:32 +0000 (23:41 +0000)]
Fundamentally change the MipsSubtarget replacement machinery:
a) Move the replacement level decision to the target machine.
b) Create additional subtargets at the TargetMachine level to
cache and make replacement easy.
c) Make the mips16 features obvious.
d) Remove the override logic as it no longer does anything.
e) Have MipsModuleDAGToDAGISel take only the target machine.
f) Have the constant islands pass grab the current subtarget
from the MachineFunction (via the TargetMachine) instead
of caching it.
g) Unconditionally initialize TLOF.
h) Remove the old complicated subtarget based resetting and
replace it with simple conditionals.
Hal Finkel [Fri, 18 Jul 2014 23:29:49 +0000 (23:29 +0000)]
[PowerPC] 32-bit ELF PIC support
This adds initial support for PPC32 ELF PIC (Position Independent Code; the
-fPIC variety), thus rectifying a long-standing deficiency in the PowerPC
backend.
Eric Christopher [Fri, 18 Jul 2014 22:34:20 +0000 (22:34 +0000)]
Avoid caching the relocation model on the subtarget, this is for
two reasons:
a) we're already caching the target machine which contains it,
b) which relocation model you get is dependent upon whether or
not you ask before MCCodeGenInfo is constructed on the target
machine, so avoid any latent issues there.
Lang Hames [Fri, 18 Jul 2014 20:29:36 +0000 (20:29 +0000)]
[MCJIT] [AArch64] Make sure to propegate ARM64_RELOC_ADDEND values into the
RelocationEntry.
No test case yet, as this primarily hits GOT entries, which RuntimeDyldChecker
can't examine yet. I'm actively working on features that will enable us to
test this.
Eric Christopher [Fri, 18 Jul 2014 20:29:02 +0000 (20:29 +0000)]
Make non-module passes unconditionally added in the pass
manager for mips, and early exit if we don't want to do
anything because of the current subtarget.
Merges equivalent loads on both sides of a hammock/diamond
and hoists into into the header.
Merges equivalent stores on both sides of a hammock/diamond
and sinks it to the footer.
Can enable if conversion and tolerate better load misses
and store operand latencies.
David Blaikie [Fri, 18 Jul 2014 17:49:10 +0000 (17:49 +0000)]
Reapply "DebugInfo: Ensure that all debug location scope chains from instructions within a function, lead to the function itself."""
Recommits 212776 which was reverted in r212793. This has been committed
and recommitted a few times as I try to test it harder and find/fix more
issues. The most recent revert was due to an asan bot failure which I
can't seem to reproduce locally, though I believe I'm following all the
steps the buildbot does.
So I'm going to recommit this in the hopes of investigating the failure
on the buildbot itself... apologies in advance for the bot noise. If
anyone sees failures with this /please/ provide me with any
reproductions, etc.
David Peixotto [Fri, 18 Jul 2014 16:05:14 +0000 (16:05 +0000)]
MC: support different sized constants in constant pools
On AArch64 the pseudo instruction ldr <reg>, =... supports both
32-bit and 64-bit constants. Add support for 64 bit constants for
the pools to support the pseudo instruction fully.
Changes the AArch64 ldr-pseudo tests to use 32-bit registers and
adds tests with 64-bit registers.
Hal Finkel [Fri, 18 Jul 2014 15:51:28 +0000 (15:51 +0000)]
Add a dereferenceable attribute
This attribute indicates that the parameter or return pointer is
dereferenceable. Practically speaking, loads from such a pointer within the
associated byte range are safe to speculatively execute. Such pointer
parameters are common in source languages (C++ references, for example).
Daniel Sanders [Fri, 18 Jul 2014 14:28:19 +0000 (14:28 +0000)]
Add MIPS Technologies to the vendors in llvm::Triple.
This is a prerequisite for checking for 'mti' and 'img' in a consistent way in
clang. Previously 'img' could use Triple::getVendor() but 'mti' could only use
Triple::getVendorName().
Tim Northover [Fri, 18 Jul 2014 13:07:05 +0000 (13:07 +0000)]
AArch64: implement efficient f16 bitcasts
Because i16 is illegal, there's no native DAG method to
represent a bitcast to or from an f16 type. This meant LLVM was
inserting a stack store/load pair which is really not ideal.
Tim Northover [Fri, 18 Jul 2014 12:41:46 +0000 (12:41 +0000)]
CodeGen: soften f16 type by default instead of marking legal.
Actual support for softening f16 operations is still limited, and can be added
when it's needed. But Soften is much closer to being a useful thing to try
than keeping it Legal when no registers can actually hold such values.
Longer term, we probably want something between Soften and Promote semantics
for most targets, it'll be more efficient to promote the 4 basic operations to
f32 than libcall them.
[ARM] Add earlyclobber constraint to pre/post-indexed ARM STR instructions.
The post-indexed instructions were missing the constraint, causing unpredictable STR instructions to be emitted.
The earlyclobber constraint on the pre-indexed STR instructions is not strictly necessary, as the instruction selection for pre-indexed STR instructions goes through an additional layer of pseudo instructions which have the constraint defined, however it doesn't hurt to specify the constraint directly on the pre-indexed instructions as well, since at some point someone might create instances of them programmatically and then the constraint is definitely needed.
Tim Northover [Fri, 18 Jul 2014 08:30:10 +0000 (08:30 +0000)]
NVPTX: support direct f16 <-> f64 conversions via intrinsics.
Clang may well start emitting these soon, and while it may not be
directly relevant for OpenCL or GLSL, the instructions were just
sitting there waiting to be used.
Hal Finkel [Fri, 18 Jul 2014 06:51:55 +0000 (06:51 +0000)]
Rename AlignAttribute to IntAttribute
Currently the only kind of integer IR attributes that we have are alignment
attributes, and so the attribute kind that takes an integer parameter is called
AlignAttr, but that will change (we'll soon be adding a dereferenceable
attribute that also takes an integer value). Accordingly, rename AlignAttribute
to IntAttribute (class names, enums, etc.).
Jim Grosbach [Fri, 18 Jul 2014 00:40:56 +0000 (00:40 +0000)]
X86: Constant fold converting vector setcc results to float.
Since the result of a SETCC for X86 is 0 or -1 in each lane, we can
move unary operations, in this case [su]int_to_fp through the mask
operation and constant fold the operation away. Generally speaking:
UNARYOP(AND(VECTOR_CMP(x,y), constant))
--> AND(VECTOR_CMP(x,y), constant2)
where constant2 is UNARYOP(constant).
This implements the transform where UNARYOP is [su]int_to_fp.
For example, consider the simple function:
define <4 x float> @foo(<4 x float> %val, <4 x float> %test) nounwind {
%cmp = fcmp oeq <4 x float> %val, %test
%ext = zext <4 x i1> %cmp to <4 x i32>
%result = sitofp <4 x i32> %ext to <4 x float>
ret <4 x float> %result
}
Jim Grosbach [Fri, 18 Jul 2014 00:40:52 +0000 (00:40 +0000)]
AArch64: Constant fold converting vector setcc results to float.
Since the result of a SETCC for AArch64 is 0 or -1 in each lane, we can
move unary operations, in this case [su]int_to_fp through the mask
operation and constant fold the operation away. Generally speaking:
UNARYOP(AND(VECTOR_CMP(x,y), constant))
--> AND(VECTOR_CMP(x,y), constant2)
where constant2 is UNARYOP(constant).
This implements the transform where UNARYOP is [su]int_to_fp.
For example, consider the simple function:
define <4 x float> @foo(<4 x float> %val, <4 x float> %test) nounwind {
%cmp = fcmp oeq <4 x float> %val, %test
%ext = zext <4 x i1> %cmp to <4 x i32>
%result = sitofp <4 x i32> %ext to <4 x float>
ret <4 x float> %result
}
Before this change, the code is generated as:
fcmeq.4s v0, v0, v1
movi.4s v1, #0x1 // Integer splat value.
and.16b v0, v0, v1 // Mask lanes based on the comparison.
scvtf.4s v0, v0 // Convert each lane to f32.
ret
After, the code is improved to:
fcmeq.4s v0, v0, v1
fmov.4s v1, #1.00000000 // f32 splat value.
and.16b v0, v0, v1 // Mask lanes based on the comparison.
ret
The svvtf.4s has been constant folded away and the floating point 1.0f
vector lanes are materialized directly via fmov.4s.
Rather than do the folding manually in the target code, teach getNode()
in the generic SelectionDAG to handle folding constant operands of
vector [su]int_to_fp nodes. It is reasonable (as noted in a FIXME) to do
additional constant folding there as well, but I don't have test cases
for those operations, so leaving them for another time when it becomes
appropriate.
Eric Christopher [Fri, 18 Jul 2014 00:08:53 +0000 (00:08 +0000)]
Reset the Subtarget in the AsmPrinter for each machine function
and add explanatory comment about dual initialization. Fix
use of the Subtarget to grab the information off of the target machine.
Eric Christopher [Fri, 18 Jul 2014 00:08:50 +0000 (00:08 +0000)]
Avoid resetting the UseSoftFloat and FloatABIType on the TargetMachine
Options struct and move the comment to inMips16HardFloat. Use the
fact that we now know whether or not we cared about soft float to
set the libcalls.
Accordingly rename mipsSEUsesSoftFloat to abiUsesSoftFloat and
propagate since it's no longer CPU specific.
Lang Hames [Thu, 17 Jul 2014 23:11:30 +0000 (23:11 +0000)]
[MCJIT] Fix the alignment requirements for ARM and AArch64 which were mistakenly
relaxed in the big RuntimeDyldMachO cleanup of r213293.
No test case yet - this was found via inspection and there's no easy way to test
GOT alignment in RuntimeDyldChecker at the moment. I'm working on adding support
for this now, and hope to have a test case for this soon.
Alp Toker [Thu, 17 Jul 2014 20:05:29 +0000 (20:05 +0000)]
Drop the udis86 wrapper from llvm::sys
This optional dependency on the udis86 library was added some time back to aid
JIT development, but doesn't make much sense to link into LLVM binaries these
days.
TableGen: Add 'static' to a large array to avoid a huge stack allocation
Speculative fix for a -Wframe-larger-than warning from gcc. Clang will
implicitly promote such constant arrays to globals, so in theory it
won't hit this.
Rectify r213231. Use proper version of 'ComputeNumSignBits'.
Earlier when the code was in InstCombine, we were calling the version of ComputeNumSignBits in InstCombine.h
that automatically added the DataLayout* before calling into ValueTracking.
When the code moved to InstSimplify, we are calling into ValueTracking directly without passing in the DataLayout*.
This patch rectifies the same by passing DataLayout in ComputeNumSignBits.
Lang Hames [Thu, 17 Jul 2014 18:54:50 +0000 (18:54 +0000)]
[MCJIT] Significantly refactor the RuntimeDyldMachO class.
The previous implementation of RuntimeDyldMachO mixed logic for all targets
within a single class, creating problems for readability, maintainability, and
performance. To address these issues, this patch strips the RuntimeDyldMachO
class down to just target-independent functionality, and moves all
target-specific functionality into target-specific subclasses RuntimeDyldMachO.
The new class hierarchy is as follows:
class RuntimeDyldMachO
Implemented in RuntimeDyldMachO.{h,cpp}
Contains logic that is completely independent of the target. This consists
mostly of MachO helper utilities which the derived classes use to get their
work done.
template <typename Impl>
class RuntimeDyldMachOCRTPBase<Impl> : public RuntimeDyldMachO
Implemented in RuntimeDyldMachO.h
Contains generic MachO algorithms/data structures that defer to the Impl class
for target-specific behaviors.
RuntimeDyldMachOARM : public RuntimeDyldMachOCRTPBase<RuntimeDyldMachOARM>
RuntimeDyldMachOARM64 : public RuntimeDyldMachOCRTPBase<RuntimeDyldMachOARM64>
RuntimeDyldMachOI386 : public RuntimeDyldMachOCRTPBase<RuntimeDyldMachOI386>
RuntimeDyldMachOX86_64 : public RuntimeDyldMachOCRTPBase<RuntimeDyldMachOX86_64>
Implemented in their respective *.h files in lib/ExecutionEngine/RuntimeDyld/MachOTargets
Each of these contains the relocation logic specific to their target architecture.
We now consider the FPOpFusion flag when determining whether
to fuse ops. We also explicitly emit add.rn when fusion is
disabled to prevent ptxas from fusing the operations on its
own.
Adam Nemet [Thu, 17 Jul 2014 17:04:34 +0000 (17:04 +0000)]
[X86] AVX512: Move compressed displacement logic to TD
This does not actually move the logic yet but reimplements it in the Tablegen
language. Then asserts that the new implementation results in the same value.
The next patch will remove the assert and the temporary use of the TSFlags and
remove the C++ implementation.
The formula requires a limited form of the logical left and right operators.
I implemented these with the bit-extract/insert operator (i.e. blah{bits}).
Adam Nemet [Thu, 17 Jul 2014 17:04:27 +0000 (17:04 +0000)]
[TableGen] Allow shift operators to take bits<n>
Convert the operand to int if possible, i.e. if the value is properly
initialized. (I suppose there is further room for improvement here to also
peform the shift if the uninitialized bits are shifted out.)
With this little change we can now compute the scaling factor for compressed
displacement with pure tablegen code in the X86 backend. This is useful
because both the X86-disassembler-specific part of tablegen and the assembler
need this and TD is the natural sharing place.
The patch also adds the missing documentation for the shift and add operator.