**IPC**, **uOps Per Cycle**, and **Block RThroughput** (Block Reciprocal
Throughput).
+Field *DispatchWidth* is the maximum number of micro opcodes that are dispatched
+to the out-of-order backend every simulated cycle.
+
IPC is computed dividing the total number of simulated instructions by the total
-number of cycles. In the absence of loop-carried data dependencies, the
-observed IPC tends to a theoretical maximum which can be computed by dividing
-the number of instructions of a single iteration by the *Block RThroughput*.
+number of cycles.
+
+Field *Block RThroughput* is the reciprocal of the block throughput. Block
+throuhgput is a theoretical quantity computed as the maximum number of blocks
+(i.e. iterations) that can be executed per simulated clock cycle in the absence
+of loop carried dependencies. Block throughput is is superiorly
+limited by the dispatch rate, and the availability of hardware resources.
+
+In the absence of loop-carried data dependencies, the observed IPC tends to a
+theoretical maximum which can be computed by dividing the number of instructions
+of a single iteration by the `Block RThroughput`.
Field 'uOps Per Cycle' is computed dividing the total number of simulated micro
opcodes by the total number of cycles. A delta between Dispatch Width and this
field is an indicator of a performance issue. In the absence of loop-carried
data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical
maximum throughput which can be computed by dividing the number of uOps of a
-single iteration by the *Block RThroughput*.
+single iteration by the `Block RThroughput`.
Field *uOps Per Cycle* is bounded from above by the dispatch width. That is
because the dispatch width limits the maximum size of a dispatch group. Both IPC
and it limits the number of instructions that can be executed in parallel every
cycle. A delta between Dispatch Width and the theoretical maximum uOps per
Cycle (computed by dividing the number of uOps of a single iteration by the
-*Block RTrhoughput*) is an indicator of a performance bottleneck caused by the
+`Block RThroughput`) is an indicator of a performance bottleneck caused by the
lack of hardware resources.
In general, the lower the Block RThroughput, the better.
In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there
-are no loop-carried dependencies, the observed *uOps Per Cycle* is expected to
+are no loop-carried dependencies, the observed `uOps Per Cycle` is expected to
approach 1.50 when the number of iterations tends to infinity. The delta between
the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is
an indicator of a performance bottleneck caused by the lack of hardware
extra information related to the number of micro opcodes, and opcode properties
(i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
+Field *RThroughput* is the reciprocal of the instruction throughput. Throughput
+is computed as the maximum number of instructions of a same type that can be
+executed per clock cycle in the absence of operand dependencies. In this
+example, the reciprocal throughput of a vector float multiply is 1
+cycles/instruction. That is because the FP multiplier JFPM is only available
+from pipeline JFPU1.
+
The third section is the *Resource pressure view*. This view reports
the average number of resource cycles consumed every iteration by instructions
for every processor resource unit available on the target. Information is
cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
especially when compared to other low latency instructions.
+Bottleneck Analysis
+^^^^^^^^^^^^^^^^^^^
+The ``-bottleneck-analysis`` command line option enables the analysis of
+performance bottlenecks.
+
+This analysis is potentially expensive. It attempts to correlate increases in
+backend pressure (caused by pipeline resource pressure and data dependencies) to
+dynamic dispatch stalls.
+
+Below is an example of ``-bottleneck-analysis`` output generated by
+:program:`llvm-mca` for 500 iterations of the dot-product example on btver2.
+
+.. code-block:: none
+
+
+ Cycles with backend pressure increase [ 48.07% ]
+ Throughput Bottlenecks:
+ Resource Pressure [ 47.77% ]
+ - JFPA [ 47.77% ]
+ - JFPU0 [ 47.77% ]
+ Data Dependencies: [ 0.30% ]
+ - Register Dependencies [ 0.30% ]
+ - Memory Dependencies [ 0.00% ]
+
+ Critical sequence based on the simulation:
+
+ Instruction Dependency Information
+ +----< 2. vhaddps %xmm3, %xmm3, %xmm4
+ |
+ | < loop carried >
+ |
+ | 0. vmulps %xmm0, %xmm1, %xmm2
+ +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ]
+ +----> 2. vhaddps %xmm3, %xmm3, %xmm4 ## REGISTER dependency: %xmm3
+ |
+ | < loop carried >
+ |
+ +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ]
+
+
+According to the analysis, throughput is limited by resource pressure and not by
+data dependencies. The analysis observed increases in backend pressure during
+48.07% of the simulated run. Almost all those pressure increase events were
+caused by contention on processor resources JFPA/JFPU0.
+
+The `critical sequence` is the most expensive sequence of instructions according
+to the simulation. It is annotated to provide extra information about critical
+register dependencies and resource interferences between instructions.
+
+Instructions from the critical sequence are expected to significantly impact
+performance. By construction, the accuracy of this analysis is strongly
+dependent on the simulation and (as always) by the quality of the processor
+model in llvm.
+
+
Extra Statistics to Further Diagnose Performance Issues
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``-all-stats`` command line option enables extra statistics and performance