Adam Nemet [Fri, 7 Oct 2016 17:06:34 +0000 (17:06 +0000)]
New utility to visualize optimization records
This is a new tool built on top of the new YAML ouput generated from
optimization remarks. It produces HTML for easy navigation and
visualization.
The tool assumes that hotness information for the remarks is available
(the YAML file was produced with PGO). It uses hotness to list the
remarks prioritized by the hotness on the index page. Clicking the
source location of the remark in the list takes you the source where the
remarks are rendedered inline in the source.
For now, the tool is meant as prototype.
It's written in Python. It uses PyYAML to parse the input.
Dehao Chen [Fri, 7 Oct 2016 15:21:31 +0000 (15:21 +0000)]
Invoke add-discriminator at -g0 -fsample-profile
Summary: -fsample-profile needs discriminator, which will not be added if built with -g0. This patch makes sure the discriminator is added for sample-profile at -g0. A followup patch will be send out to update clang tests.
Matthew Simpson [Fri, 7 Oct 2016 15:20:13 +0000 (15:20 +0000)]
[LV] Don't mark multi-use branch conditions uniform
Previously, we marked the branch conditions of latch blocks uniform after
vectorization if they were instructions contained in the loop. However, if a
condition instruction has users other than the branch, it may not remain
uniform. This patch ensures the conditions we mark uniform are only used by the
branch. This should fix PR30627.
Tom Stellard [Fri, 7 Oct 2016 14:23:29 +0000 (14:23 +0000)]
[ValueTracking] Fix crash in GetPointerBaseWithConstantOffset()
Summary:
While walking defs of pointer operands we were assuming that the pointer
size would remain constant. This is not true, because addresspacecast
instructions may cast the pointer to an address space with a different
pointer width.
This partial reverts r282612, which was a more conservative solution
to this problem.
Martin Storsjo [Fri, 7 Oct 2016 13:28:53 +0000 (13:28 +0000)]
[ARM] Reapply: Use __rt_div functions for divrem on Windows
Reapplying r283383 after revert in r283442. The additional fix
is a getting rid of a stray space in a function name, in the
refactoring part of the commit.
This avoids falling back to calling out to the GCC rem functions
(__moddi3, __umoddi3) when targeting Windows.
The __rt_div functions have flipped the two arguments compared
to the __aeabi_divmod functions. To match MSVC, we emit a
check for division by zero before actually calling the library
function (even if the library function itself also might do
the same check).
Not all calls to __rt_div functions for division are currently
merged with calls to the same function with the same parameters
for the remainder. This is more wasteful than a div + mls as before,
but avoids calls to __moddi3.
Simon Pilgrim [Fri, 7 Oct 2016 11:18:38 +0000 (11:18 +0000)]
[X86][SSE] Update register class during MOVSD/MOVSS - BLENDPD/BLENDPS commutation
MOVSD/MOVSS take a 128-bit register and a FR32/FR64 register input, the commutation code wasn't taking this into account leading to verification errors.
This patch inserts a vreg copy mi to ensure that the registers are correct.
Alexey Bataev [Fri, 7 Oct 2016 09:39:22 +0000 (09:39 +0000)]
[SLPVectorizer] Fix for PR25748: reduction vectorization after loop
unrolling.
The next code is not vectorized by the SLPVectorizer:
```
int test(unsigned int *p) {
int sum = 0;
for (int i = 0; i < 8; i++)
sum += p[i];
return sum;
}
```
During optimization this loop is fully unrolled and SLPVectorizer is
unable to vectorize it. Patch tries to fix this problem.
Oliver Stannard [Fri, 7 Oct 2016 08:48:24 +0000 (08:48 +0000)]
[ARM] Don't convert switches to lookup tables of pointers with ROPI/RWPI
With the ROPI and RWPI relocation models we can't always have pointers
to global data or functions in constant data, so don't try to convert switches
into lookup tables if any value in the lookup table would require a relocation.
We can still safely emit lookup tables of other values, such as simple
constants.
Hal Finkel [Fri, 7 Oct 2016 02:01:03 +0000 (02:01 +0000)]
[llvm-opt-report] Left justify unrolling counts, etc.
In the left part of the reports, we have things like U<number>; if some of
these numbers use more digits than others, we don't want a space in between the
U and the start of the number. Instead, the space should come afterward. This
way it is clear that the number goes with the U and not any other optimization
indicator that might come later on the line.
Hal Finkel [Fri, 7 Oct 2016 01:57:06 +0000 (01:57 +0000)]
[llvm-opt-report] Left justify unrolling counts, etc.
In the left part of the reports, we have things like U<number>; if some of
these numbers use more digits than others, we don't want a space in between the
U and the start of the number. Instead, the space should come afterward. This
way it is clear that the number goes with the U and not any other optimization
indicator that might come later on the line.
[LV] Remove triples from target-independent vectorizer tests. NFC.
Vectorizer tests in the target-independent directory should not have a target
triple. If a test really needs to query a specific backend, it belongs in the
right target subdirectory (which "REQUIRES" the right backend). Otherwise, it
should not specify a triple.
Mehdi Amini [Thu, 6 Oct 2016 23:26:29 +0000 (23:26 +0000)]
Add a static_assert to enforce that parameters to llvm::format() are not totally unsafe
Summary:
I had for the second time today a bug where llvm::format("%s", Str)
was called with Str being a StringRef. The Linux and MacOS bots were
fine, but windows having different calling convention, it printed
garbage.
Instead we can catch this at compile-time: it is never expected to
call a C vararg printf-like function with non scalar type I believe.
Rong Xu [Thu, 6 Oct 2016 20:38:13 +0000 (20:38 +0000)]
[PGO] Create weak alias for the renamed Comdat function
Add a weak alias to the renamed Comdat function in IR level instrumentation,
using it's original name. This ensures the same behavior w/ and w/o IR
instrumentation, even for non standard conforming code.
When replacing FrameIndex with BasePtr, we must preserve BasePtr for
LEA64_32r since BasePtr is used later for stack adjustment if it is
the same as StackPtr.
[DAG] Generalize build_vector -> vector_shuffle combine for more than 2 inputs
This generalizes the build_vector -> vector_shuffle combine to support any
number of inputs. The idea is to create a binary tree of shuffles, where
the first layer performs pairwise shuffles of the input vectors placing each
input element into the correct lane, and the rest of the tree blends these
shuffles together.
This doesn't try to be smart and create any sort of "optimal" shuffles.
The assumption is that even a "poor" shuffle sequence is better than extracting
and inserting the elements one by one.
Michael Ilseman [Thu, 6 Oct 2016 17:58:38 +0000 (17:58 +0000)]
Add -strip-nonlinetable-debuginfo capability
This adds a new function to DebugInfo.cpp that takes an llvm::Module
as input and removes all debug info metadata that is not directly
needed for line tables, thus effectively stripping all type and
variable information from the module.
The primary motivation for this feature was the bitcode work flow
(cf. http://lists.llvm.org/pipermail/llvm-dev/2016-June/100643.html
for more background). This is not wired up yet, but will be in
subsequent patches. For testing, the new functionality is exposed to
opt with a -strip-nonlinetable-debuginfo option.
The secondary use-case (and one that works right now!) is as a
reduction pass in bugpoint. I added two new bugpoint options
(-disable-strip-debuginfo and -disable-strip-debug-types) to control
the new features. By default it will first attempt to remove all debug
information, then only the type info, and then proceed to hack at any
remaining MDNodes.
Matt Arsenault [Thu, 6 Oct 2016 17:54:30 +0000 (17:54 +0000)]
AMDGPU: Remove leftover implicit operands when folding immediates
When constant folding an operation to a copy or an immediate
mov, the implicit uses/defs of the old instruction were left behind,
e.g. replacing v_or_b32 left the implicit exec use on the new copy.
Hal Finkel [Thu, 6 Oct 2016 11:58:52 +0000 (11:58 +0000)]
[llvm-opt-report] Record VF, etc. correctly for multiple opts on one line
When there are multiple optimizations on one line, record the vectorization
factors, etc. correctly (instead of incorrectly substituting default values).
Reid Kleckner [Wed, 5 Oct 2016 22:36:07 +0000 (22:36 +0000)]
[codeview] Truncate records to maximum record size near 64KB
If we don't truncate, LLVM asserts when the label difference doesn't fit
in a 16 bit field. This patch truncates two kinds of data: trailing null
terminated names in symbol records, and inline line tables. The inline
line table test that I have is too large (many MB), so I'm not checking
it in.
Hal Finkel [Wed, 5 Oct 2016 22:25:33 +0000 (22:25 +0000)]
[llvm-opt-report] Distinguish inlined contexts when optimizations differ
How code is optimized sometimes, perhaps often, depends on the context into
which it was inlined. This change allows llvm-opt-report to track the
differences between the optimizations performed, or not, in different contexts,
and when these differ, display those differences.
For example, this code:
$ cat /tmp/q.cpp
void bar();
void foo(int n) {
for (int i = 0; i < n; ++i)
bar();
}
void quack() {
foo(4);
}
void quack2() {
foo(4);
}
will now produce this report:
< /home/hfinkel/src/llvm/test/tools/llvm-opt-report/Inputs/q.cpp
2 | void bar();
3 | void foo(int n) {
[[
> foo(int):
4 | for (int i = 0; i < n; ++i)
> quack(), quack2():
4 U4 | for (int i = 0; i < n; ++i)
]]
5 | bar();
6 | }
7 |
8 | void quack() {
9 I | foo(4);
10 | }
11 |
12 | void quack2() {
13 I | foo(4);
14 | }
15 |
Note that the tool has demangled the function names, and grouped the reports
associated with line 4. This shows that the loop on line 4 was unrolled by a
factor of 4 when inlined into the functions quack() and quack2(), but not in
the function foo(int) itself.
Hal Finkel [Wed, 5 Oct 2016 22:10:35 +0000 (22:10 +0000)]
Add an llvm-opt-report tool to generate basic source-annotated optimization summaries
LLVM now has the ability to record information from optimization remarks in a
machine-consumable YAML file for later analysis. This can be enabled in opt
(see r282539), and D25225 adds a Clang flag to do the same. This patch adds
llvm-opt-report, a tool to generate basic optimization "listing" files
(annotated sources with information about what optimizations were performed)
from one of these YAML inputs.
D19678 proposed to add this capability directly to Clang, but this more-general
YAML-based infrastructure was the direction we decided upon in that review
thread.
For this optimization report, I focused on making the output as succinct as
possible while providing information on inlining and loop transformations. The
goal here is that the source code should still be easily readable in the
report. My primary inspiration here is the reports generated by Cray's tools
(http://docs.cray.com/books/S-2496-4101/html-S-2496-4101/z1112823641oswald.html).
These reports are highly regarded within the HPC community. Intel's compiler,
for example, also has an optimization-report capability
(https://software.intel.com/sites/default/files/managed/55/b1/new-compiler-optimization-reports.pdf).
$ cat /tmp/v.c
void bar();
void foo() { bar(); }
void Test(int *res, int *c, int *d, int *p, int n) {
int i;
#pragma clang loop vectorize(assume_safety)
for (i = 0; i < 1600; i++) {
res[i] = (p[i] == 0) ? res[i] : res[i] + d[i];
}
for (i = 0; i < 16; i++) {
res[i] = (p[i] == 0) ? res[i] : res[i] + d[i];
}
foo();
foo(); bar(); foo();
}
D25225 adds -fsave-optimization-record (and
-fsave-optimization-record=filename), and this would be used as follows:
< /tmp/v.c
2 | void bar();
3 | void foo() { bar(); }
4 |
5 | void Test(int *res, int *c, int *d, int *p, int n) {
6 | int i;
7 |
8 | #pragma clang loop vectorize(assume_safety)
9 V4,2 | for (i = 0; i < 1600; i++) {
10 | res[i] = (p[i] == 0) ? res[i] : res[i] + d[i];
11 | }
12 |
13 U16 | for (i = 0; i < 16; i++) {
14 | res[i] = (p[i] == 0) ? res[i] : res[i] + d[i];
15 | }
16 |
17 I | foo();
18 |
19 | foo(); bar(); foo();
I | ^
I | ^
20 | }
Each source line gets a prefix giving the line number, and a few columns for
important optimizations: inlining, loop unrolling and loop vectorization. An
'I' is printed next to a line where a function was inlined, a 'U' next to an
unrolled loop, and 'V' next to a vectorized loop. These are printed on the
relevant code line when that seems unambiguous, or on subsequent lines when
multiple potential options exist (messages, both positive and negative, from
the same optimization with different column numbers are taken to indicate
potential ambiguity). When on subsequent lines, a '^' is output in the relevant
column.
Annotated source for all relevant input files are put into the listing file
(each starting with '<' and then the file name).
You can disable having the unrolling/vectorization factors appear by using the
-s flag.