[MCA][doc] Add a section for the 'Bottleneck Analysis'.

author Andrea Di Biagio <Andrea_DiBiagio@sn.scee.net>

Mon, 5 Aug 2019 13:18:37 +0000 (13:18 +0000)

committer Andrea Di Biagio <Andrea_DiBiagio@sn.scee.net>

Mon, 5 Aug 2019 13:18:37 +0000 (13:18 +0000)
author Andrea Di Biagio <Andrea_DiBiagio@sn.scee.net>
Mon, 5 Aug 2019 13:18:37 +0000 (13:18 +0000)
committer Andrea Di Biagio <Andrea_DiBiagio@sn.scee.net>
Mon, 5 Aug 2019 13:18:37 +0000 (13:18 +0000)
diff --git a/docs/CommandGuide/llvm-mca.rst b/docs/CommandGuide/llvm-mca.rst

index c8b11fc6ed22e9e1b7fbeeeae725d90c8f3b5672..f2ebbec43c05f34736f886a8c52a28d55d15d5f3 100644 (file)
--- a/docs/CommandGuide/llvm-mca.rst
+++ b/docs/CommandGuide/llvm-mca.rst
@@ -373,17 +373,28 @@ overview of the performance throughput. Important performance indicators are
  **IPC**, **uOps Per Cycle**, and  **Block RThroughput** (Block Reciprocal
  Throughput).
  
+Field *DispatchWidth* is the maximum number of micro opcodes that are dispatched
+to the out-of-order backend every simulated cycle.
+
  IPC is computed dividing the total number of simulated instructions by the total
-number of cycles. In the absence of loop-carried data dependencies, the
-observed IPC tends to a theoretical maximum which can be computed by dividing
-the number of instructions of a single iteration by the *Block RThroughput*.
+number of cycles.
+
+Field *Block RThroughput* is the reciprocal of the block throughput. Block
+throuhgput is a theoretical quantity computed as the maximum number of blocks
+(i.e. iterations) that can be executed per simulated clock cycle in the absence
+of loop carried dependencies. Block throughput is is superiorly
+limited by the dispatch rate, and the availability of hardware resources.
+
+In the absence of loop-carried data dependencies, the observed IPC tends to a
+theoretical maximum which can be computed by dividing the number of instructions
+of a single iteration by the `Block RThroughput`.
  
  Field 'uOps Per Cycle' is computed dividing the total number of simulated micro
  opcodes by the total number of cycles. A delta between Dispatch Width and this
  field is an indicator of a performance issue. In the absence of loop-carried
  data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical
  maximum throughput which can be computed by dividing the number of uOps of a
-single iteration by the *Block RThroughput*.
+single iteration by the `Block RThroughput`.
  
  Field *uOps Per Cycle* is bounded from above by the dispatch width. That is
  because the dispatch width limits the maximum size of a dispatch group. Both IPC
@@ -392,12 +403,12 @@ availability of hardware resources affects the resource pressure distribution,
  and it limits the number of instructions that can be executed in parallel every
  cycle.  A delta between Dispatch Width and the theoretical maximum uOps per
  Cycle (computed by dividing the number of uOps of a single iteration by the
-*Block RTrhoughput*) is an indicator of a performance bottleneck caused by the
+`Block RThroughput`) is an indicator of a performance bottleneck caused by the
  lack of hardware resources.
  In general, the lower the Block RThroughput, the better.
  
  In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there
-are no loop-carried dependencies, the observed *uOps Per Cycle* is expected to
+are no loop-carried dependencies, the observed `uOps Per Cycle` is expected to
  approach 1.50 when the number of iterations tends to infinity. The delta between
  the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is
  an indicator of a performance bottleneck caused by the lack of hardware
@@ -409,6 +420,13 @@ throughput of every instruction in the sequence. That section also reports
  extra information related to the number of micro opcodes, and opcode properties
  (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
  
+Field *RThroughput* is the reciprocal of the instruction throughput. Throughput
+is computed as the maximum number of instructions of a same type that can be
+executed per clock cycle in the absence of operand dependencies. In this
+example, the reciprocal throughput of a vector float multiply is 1
+cycles/instruction.  That is because the FP multiplier JFPM is only available
+from pipeline JFPU1.
+
  The third section is the *Resource pressure view*.  This view reports
  the average number of resource cycles consumed every iteration by instructions
  for every processor resource unit available on the target.  Information is
@@ -540,6 +558,61 @@ resources, the delta between the two counters is small.  However, the number of
  cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
  especially when compared to other low latency instructions.
  
+Bottleneck Analysis
+^^^^^^^^^^^^^^^^^^^
+The ``-bottleneck-analysis`` command line option enables the analysis of
+performance bottlenecks.
+
+This analysis is potentially expensive. It attempts to correlate increases in
+backend pressure (caused by pipeline resource pressure and data dependencies) to
+dynamic dispatch stalls.
+
+Below is an example of ``-bottleneck-analysis`` output generated by
+:program:`llvm-mca` for 500 iterations of the dot-product example on btver2.
+
+.. code-block:: none
+
+
+  Cycles with backend pressure increase [ 48.07% ]
+  Throughput Bottlenecks: 
+    Resource Pressure       [ 47.77% ]
+    - JFPA  [ 47.77% ]
+    - JFPU0  [ 47.77% ]
+    Data Dependencies:      [ 0.30% ]
+    - Register Dependencies [ 0.30% ]
+    - Memory Dependencies   [ 0.00% ]
+  
+  Critical sequence based on the simulation:
+  
+                Instruction                         Dependency Information
+   +----< 2.    vhaddps %xmm3, %xmm3, %xmm4
+   |
+   |    < loop carried > 
+   |
+   |      0.    vmulps  %xmm0, %xmm1, %xmm2
+   +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
+   +----> 2.    vhaddps %xmm3, %xmm3, %xmm4         ## REGISTER dependency:  %xmm3
+   |
+   |    < loop carried > 
+   |
+   +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
+
+
+According to the analysis, throughput is limited by resource pressure and not by
+data dependencies.  The analysis observed increases in backend pressure during
+48.07% of the simulated run. Almost all those pressure increase events were
+caused by contention on processor resources JFPA/JFPU0.
+
+The `critical sequence` is the most expensive sequence of instructions according
+to the simulation. It is annotated to provide extra information about critical
+register dependencies and resource interferences between instructions.
+
+Instructions from the critical sequence are expected to significantly impact
+performance. By construction, the accuracy of this analysis is strongly
+dependent on the simulation and (as always) by the quality of the processor
+model in llvm.
+
+
  Extra Statistics to Further Diagnose Performance Issues
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  The ``-all-stats`` command line option enables extra statistics and performance
author	Andrea Di Biagio <Andrea_DiBiagio@sn.scee.net>
	Mon, 5 Aug 2019 13:18:37 +0000 (13:18 +0000)
committer	Andrea Di Biagio <Andrea_DiBiagio@sn.scee.net>
	Mon, 5 Aug 2019 13:18:37 +0000 (13:18 +0000)