Improve multivariate statistics documentation

author Alvaro Herrera <alvherre@alvh.no-ip.org>

Thu, 20 Apr 2017 18:43:33 +0000 (15:43 -0300)

committer Alvaro Herrera <alvherre@alvh.no-ip.org>

Thu, 20 Apr 2017 18:43:33 +0000 (15:43 -0300)
author Alvaro Herrera <alvherre@alvh.no-ip.org>
Thu, 20 Apr 2017 18:43:33 +0000 (15:43 -0300)
committer Alvaro Herrera <alvherre@alvh.no-ip.org>
Thu, 20 Apr 2017 18:43:33 +0000 (15:43 -0300)
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml

index ef36e87a720d98b966a5f103c613416faba69ca4..ed74704b2adebf6cb7e5048f64855e72febca378 100644 (file)
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4320,6 +4320,17 @@
        <entry>Owner of the statistic</entry>
       </row>
  
+     <row>
+      <entry><structfield>stxkeys</structfield></entry>
+      <entry><type>int2vector</type></entry>
+      <entry><literal><link linkend="catalog-pg-attribute"><structname>pg_attribute</structname></link>.attnum</literal></entry>
+      <entry>
+       This is an array of values that indicate which table columns this
+       statistic covers. For example a value of <literal>1 3</literal> would
+       mean that the first and the third table columns make up the statistic key.
+      </entry>
+     </row>
+
       <row>
        <entry><structfield>stxkind</structfield></entry>
        <entry><type>char[]</type></entry>
@@ -4332,17 +4343,6 @@
        </entry>
       </row>
  
-     <row>
-      <entry><structfield>stxkeys</structfield></entry>
-      <entry><type>int2vector</type></entry>
-      <entry><literal><link linkend="catalog-pg-attribute"><structname>pg_attribute</structname></link>.attnum</literal></entry>
-      <entry>
-       This is an array of values that indicate which table columns this
-       statistic covers. For example a value of <literal>1 3</literal> would
-       mean that the first and the third table columns make up the statistic key.
-      </entry>
-     </row>
-
       <row>
        <entry><structfield>stxndistinct</structfield></entry>
        <entry><type>pg_ndistinct</type></entry>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml

index 8d30fd1384caef3f6ea4676c4a0789eda0d1df51..a8bebb14363a4ca06f9389c2352db158cd0a224f 100644 (file)
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -906,6 +906,8 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 &lt; 100 AND unique2 &gt; 9000
     <secondary>of the planner</secondary>
    </indexterm>
  
+  <sect2>
+   <title>Single-Column Statistics</title>
    <para>
     As we saw in the previous section, the query planner needs to estimate
     the number of rows retrieved by a query in order to make good choices
@@ -1043,7 +1045,200 @@ WHERE tablename = 'road';
     Further details about the planner's use of statistics can be found in
     <xref linkend="planner-stats-details">.
    </para>
+  </sect2>
+
+  <sect2 id="planner-stats-extended">
+   <title>Extended Statistics</title>
+
+   <indexterm zone="planner-stats-extended">
+    <primary>statistics</primary>
+    <secondary>of the planner</secondary>
+   </indexterm>
  
+   <indexterm>
+    <primary>correlation</primary>
+    <secondary>in the query planner</secondary>
+   </indexterm>
+
+   <indexterm>
+    <primary>pg_statistic_ext</primary>
+   </indexterm>
+
+   <para>
+    It is common to see slow queries running bad execution plans because
+    multiple columns used in the query clauses are correlated.
+    The planner normally assumes that multiple conditions
+    are independent of each other,
+    an assumption that does not hold when column values are correlated.
+    Regular statistics, because of their per-individual-column nature,
+    do not capture the knowledge of cross-column correlation;
+    <firstterm>multivariate statistics</firstterm> can be used to instruct
+    the server to obtain statistics across such a set of columns,
+    which are later used by the query optimizer
+    to determine cardinality and selectivity
+    of clauses involving those columns.
+    Multivariate statistics are currently the only use of
+    <firstterm>extended statistics</firstterm>.
+   </para>
+
+   <para>
+    Extended statistics are created using
+    <xref linkend="sql-createstatistics">, which see for more details.
+    Data collection is deferred until the next <command>ANALYZE</command>
+    on the table, after which the stored values can be examined in the
+    <link linkend="catalog-pg-statistic-ext"><structname>pg_statistic_ext</structname></link>
+    catalog.
+   </para>
+
+   <para>
+    The following subsections describe the types of extended statistics
+    that are currently supported.
+   </para>
+
+   <sect3>
+    <title>Functional Dependencies</title>
+
+    <para>
+     The simplest type of extended statistics are functional dependencies,
+     a concept used in definitions of database normal forms.
+     Put simply, it is said that column <literal>b</> is functionally
+     dependent on column <literal>a</> if knowledge of the value of
+     <literal>a</> is sufficient to determine the value of <literal>b</>.
+     In normalized databases, functional dependencies are allowed only on
+     primary keys and superkeys. However, many data sets are in practice not
+     fully normalized for various reasons; intentional denormalization for
+     performance reasons is a common example.
+    </para>
+
+    <para>
+     The existance of functional dependencies directly affects the accuracy
+     of estimates in certain queries.
+     The reason is that conditions on the dependent columns do not
+     restrict the result set, but the query planner (lacking functional
+     dependency knowledge) considers them independent, resulting in
+     underestimates.
+     To inform the planner about the functional dependencies, we collect
+     measurements of dependency during <command>ANALYZE</>. Assessing
+     the degree of dependency between all sets of columns would be
+     prohibitively expensive, so the search is limited to potential
+     dependencies defined using the <literal>dependencies</> option of
+     extended statistics.  It is advisable to create
+     <literal>dependencies</> statistics if and only if functional
+     dependencies actually exist, to avoid unnecessary overhead on both
+     <command>ANALYZE</> and query planning.
+    </para>
+
+    <para>
+     To inspect functional dependencies on a statistics
+     <literal>stts</literal>, you may do this:
+<programlisting>
+CREATE STATISTICS stts WITH (dependencies)
+               ON (zip, city) FROM zipcodes;
+ANALYZE zipcodes;
+SELECT stxname, stxkeys, stxdependencies
+  FROM pg_statistic_ext
+ WHERE stxname = 'stts';
+ stxname | stxkeys |              stxdependencies               
+---------+---------+--------------------------------------------
+ stts    | 1 5     | [{1 => 5 : 1.000000}, {5 => 1 : 0.423130}]
+(1 row)
+</programlisting>
+     where it can be seen that column 1 (a zip code) fully determines column
+     5 (city) so the coefficient is 1.0, while city only determines zip code
+     about 42% of the time, meaning that there are many cities (58%) that are
+     represented by more than a single ZIP code.
+    </para>
+
+    <para>
+     When computing the selectivity, the planner inspects all conditions and
+     attempts to identify which conditions are already implied by other
+     conditions.  The selectivity estimates from any redundant conditions are
+     ignored from a selectivity point of view. In the example query above,
+     the selectivity estimates for either of the conditions may be eliminated,
+     thus improving the overall estimate.
+    </para>
+
+    <sect4>
+     <title>Limitations of Functional Dependencies</title>
+
+     <para>
+      Functional dependencies are a very simple type of statistics, and
+      as such have several limitations. The first limitation is that they
+      only work with simple equality conditions, comparing columns and constant
+      values. It's not possible to use them to eliminate equality conditions
+      comparing two columns or a column to an expression, range clauses,
+      <literal>LIKE</> or any other type of conditions.
+     </para>
+
+     <para>
+      When eliminating the implied conditions, the planner assumes that the
+      conditions are compatible. Consider the following example, where
+      this assumption does not hold:
+
+<programlisting>
+EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 10;
+                                 QUERY PLAN                                  
+-----------------------------------------------------------------------------
+ Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual rows=0 loops=1)
+   Filter: ((a = 1) AND (b = 10))
+   Rows Removed by Filter: 10000
+</programlisting>
+
+      While there are no rows with such combination of values, the planner
+      is unable to verify whether the values match &mdash; it only knows that
+      the columns are functionally dependent.
+     </para>
+
+     <para>
+      This assumption is related to queries executed on the database; in many
+      cases, it's actually satisfied (e.g. when the GUI only allows selecting
+      compatible values). But if that's not the case, functional dependencies
+      may not be a viable option.
+     </para>
+    </sect4>
+   </sect3>
+
+   <sect3>
+    <title>Multivariate N-Distinct Coefficients</title>
+
+    <para>
+     Single-column statistics store the number of distinct values in each
+     column.  Estimates of the number of distinct values on more than one
+     column (for example, for <literal>GROUP BY a, b</literal>) are
+     frequently wrong when the planner only has single-column statistical
+     data, however, causing it to select bad plans.
+     In order to improve n-distinct estimation when multiple columns are
+     grouped together, the <literal>ndistinct</> option of extended statistics
+     can be used, which instructs <command>ANALYZE</> to collect n-distinct
+     estimates for all possible combinations of two or more columns of the set
+     of columns in the statistics object (the per-column estimates are already
+     available in <structname>pg_statistic</>).
+    </para>
+
+    <para>
+     Continuing the above example, the n-distinct coefficients in a ZIP
+     code table may look like the following:
+<programlisting>
+CREATE STATISTICS stts2 WITH (ndistinct)
+               ON (zip, state, city) FROM zipcodes;
+ANALYZE zipcodes;
+SELECT stxkeys AS k, stxndistinct AS nd
+  FROM pg_statistic_ext
+ WHERE stxname = 'stts2';
+-[ RECORD 1 ]---------------------------------------------
+k  | 1 2 5
+nd | [{(b 1 2), 33178.000000}, {(b 1 5), 33178.000000},
+      {(b 2 5), 27435.000000}, {(b 1 2 5), 33178.000000}]
+(1 row)
+</programlisting>
+     which indicates that there are three combinations of columns that
+     have 33178 distinct values: ZIP code and state; ZIP code and city;
+     and ZIP code, city and state (the fact that they are all equal is
+     expected given the nature of ZIP-code data).  On the other hand,
+     the combination of city and state only has 27435 distinct values.
+    </para>
+   </sect3>
+  </sect2>
   </sect1>
  
   <sect1 id="explicit-joins">
diff --git a/doc/src/sgml/planstats.sgml b/doc/src/sgml/planstats.sgml

index 124e7e20cede9d7d63d749cdcdb0e4c342dbf2fc..f4430eb23cf40cdd0973c2f05bb7a927f1c2667c 100644 (file)
--- a/doc/src/sgml/planstats.sgml
+++ b/doc/src/sgml/planstats.sgml
@@ -445,159 +445,141 @@ rows = (outer_cardinality * inner_cardinality) * selectivity
     operator-specific selectivity functions are mostly found
     in <filename>src/backend/utils/adt/selfuncs.c</filename>.
    </para>
+ </sect1>
  
-  <sect2 id="functional-dependencies">
-   <title>Functional Dependencies</title>
+ <sect1 id="multivariate-statistics-examples">
+  <title>Multivariate Statistics Examples</title>
  
-   <para>
-    The simplest type of extended statistics are functional dependencies,
-    used in definitions of database normal forms. When simplified, saying that
-    <literal>b</> is functionally dependent on <literal>a</> means that
-    knowledge of value of <literal>a</> is sufficient to determine value of
-    <literal>b</>.
-   </para>
+  <indexterm>
+   <primary>row estimation</primary>
+   <secondary>multivariate</secondary>
+  </indexterm>
  
+  <sect2>
+   <title>Functional dependencies</title>
     <para>
-    In normalized databases, only functional dependencies on primary keys
-    and superkeys are allowed. However, in practice, many data sets are not
-    fully normalized, for example, due to intentional denormalization for
-    performance reasons.
-   </para>
+    Multivariate correlation can be seen with a very simple data set &mdash; a
+    table with two columns, both containing the same values:
  
-   <para>
-    Functional dependencies directly affect accuracy of the estimates, as
-    conditions on the dependent column(s) do not restrict the result set,
-    resulting in underestimates.
+<programlisting>
+CREATE TABLE t (a INT, b INT);
+INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i);
+ANALYZE t;
+</programlisting>
+
+    As explained in <xref linkend="planner-stats">, the planner can determine
+    cardinality of <structname>t</structname> using the number of pages and
+    rows obtained from <structname>pg_class</structname>:
+
+<programlisting>
+SELECT relpages, reltuples FROM pg_class WHERE relname = 't';
+
+ relpages | reltuples
+----------+-----------
+       45 |     10000
+</programlisting>
+
+    The data distribution is very simple; there are only 100 distinct values
+    in each column, uniformly distributed.
     </para>
  
     <para>
-    To inform the planner about the functional dependencies, we collect
-    measurements of dependency during <command>ANALYZE</>. Assessing
-    dependency between all sets of columns would be prohibitively
-    expensive, so we limit our search to potential dependencies defined
-    using the <command>CREATE STATISTICS</> command.
+    The following example shows the result of estimating a <literal>WHERE</>
+    condition on the <structfield>a</> column:
  
  <programlisting>
-CREATE TABLE t (a INT, b INT);
-INSERT INTO t SELECT i/100, i/100 FROM generate_series(1,10000) s(i);
-CREATE STATISTICS s1 WITH (dependencies) ON (a, b) FROM t;
-ANALYZE t;
-EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1;
-                                          QUERY PLAN                                             
--------------------------------------------------------------------------------------------------
- Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual time=0.095..3.118 rows=100 loops=1)
-   Filter: ((a = 1) AND (b = 1))
+EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1;
+                                 QUERY PLAN                                  
+-------------------------------------------------------------------------------
+ Seq Scan on t  (cost=0.00..170.00 rows=100 width=8) (actual rows=100 loops=1)
+   Filter: (a = 1)
     Rows Removed by Filter: 9900
- Planning time: 0.367 ms
- Execution time: 3.380 ms
-(5 rows)
  </programlisting>
  
-    The planner is now aware of the functional dependencies and considers
-    them when computing the selectivity of the second condition.  Running
-    the query without the statistics would lead to quite different estimates.
+    The planner examines the condition and determines the selectivity
+    of this clause to be 1%.  By comparing this estimate and the actual
+    number of rows, we see that the estimate is very accurate
+    (in fact exact, as the table is very small).  Changing the
+    <literal>WHERE</> to use the <structfield>b</> column, an identical
+    plan is generated.  Observe what happens if we apply the same
+    condition on both columns combining them with <literal>AND</>:
  
  <programlisting>
-DROP STATISTICS s1;
-EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1;
-                                          QUERY PLAN                                           
------------------------------------------------------------------------------------------------
- Seq Scan on t  (cost=0.00..195.00 rows=1 width=8) (actual time=0.000..6.379 rows=100 loops=1)
+EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
+                                 QUERY PLAN                                  
+-----------------------------------------------------------------------------
+ Seq Scan on t  (cost=0.00..195.00 rows=1 width=8) (actual rows=100 loops=1)
     Filter: ((a = 1) AND (b = 1))
     Rows Removed by Filter: 9900
- Planning time: 0.000 ms
- Execution time: 6.379 ms
-(5 rows)
  </programlisting>
-   </para>
  
-   <para>
-    If no dependency exists, the collected statistics do not influence the
-    query plan. The only effect is to slow down <command>ANALYZE</>. Should
-    partial dependencies exist these will also be stored and applied
-    during planning.
+    The planner estimates the selectivity for each condition individually,
+    arriving to the 1% estimates as above, and then multiplies them, getting
+    the final 0.01% estimate. The <quote>actual</quote> figures, however,
+    show that this results in a significant underestimate, as the actual
+    number of rows matching the conditions (100) is two orders of magnitude
+    higher than the estimated value.
     </para>
  
     <para>
-    Similarly to per-column statistics, extended statistics are stored in
-    a system catalog called <structname>pg_statistic_ext</structname>.
-    To inspect the statistics <literal>s1</literal> defined above,
-    you may do this:
+    This problem can be fixed by applying functional-dependency
+    multivariate statistics on the two columns:
  
  <programlisting>
-SELECT stxname,stxdependencies FROM pg_statistic_ext WHERE stxname = 's1';
- stxname |              stxdependencies
----------+--------------------------------------------
- s1      | [{1 => 2 : 1.000000}, {2 => 1 : 1.000000}]
-(1 row)
+CREATE STATISTICS stts WITH (dependencies) ON (a, b) FROM t;
+ANALYZE t;
+EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual rows=100 loops=1)
+   Filter: ((a = 1) AND (b = 1))
+   Rows Removed by Filter: 9900
  </programlisting>
-
-     This shows that the statistics are defined on table <structname>t</>,
-     <structfield>attnums</structfield> lists attribute numbers of columns
-     (references <structname>pg_attribute</structname>). It also shows
-     the length in bytes of the functional dependencies, as found by
-     <command>ANALYZE</> when serialized into a <literal>bytea</> column.
     </para>
+  </sect2>
  
+  <sect2>
+   <title>Multivariate N-Distinct coefficients</title>
     <para>
-    When computing the selectivity, the planner inspects all conditions and
-    attempts to identify which conditions are already implied by other
-    conditions.  The selectivity estimates from any redundant conditions are
-    ignored from a selectivity point of view. In the example query above,
-    the selectivity estimates for either of the conditions may be eliminated,
-    thus improving the overall estimate.
+    A similar problem occurs with estimation of the cardinality of distinct
+    elements, used to determine the number of groups that would be generated
+    by a <command>GROUP BY</command> clause.  When <command>GROUP BY</command>
+    lists a single column, the n-distinct estimate (which can be seen as the
+    number of rows returned by the aggregate execution node) is very accurate:
+<programlisting>
+EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a;
+                                       QUERY PLAN                                        
+-----------------------------------------------------------------------------------------
+ HashAggregate  (cost=195.00..196.00 rows=100 width=12) (actual rows=100 loops=1)
+   Group Key: a
+   -&gt;  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000 loops=1)
+</programlisting>
+    But without multivariate statistics, the estimate for the number of
+    groups in a query with two columns in <command>GROUP BY</command>, as
+    in the following example, is off by an order of magnitude:
+<programlisting>
+EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
+                                       QUERY PLAN                                        
+--------------------------------------------------------------------------------------------
+ HashAggregate  (cost=220.00..230.00 rows=1000 width=16) (actual rows=100 loops=1)
+   Group Key: a, b
+   -&gt;  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)
+</programlisting>
+    By dropping the existing statistics and re-creating it to include n-distinct
+    calculation, the estimate is much improved:
+<programlisting>
+DROP STATISTICS stts;
+CREATE STATISTICS stts WITH (dependencies, ndistinct) ON (a, b) FROM t;
+ANALYZE t;
+EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
+                                       QUERY PLAN                                        
+--------------------------------------------------------------------------------------------
+ HashAggregate  (cost=220.00..221.00 rows=100 width=16) (actual rows=100 loops=1)
+   Group Key: a, b
+   -&gt;  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)
+</programlisting>
     </para>
  
-    <sect3 id="functional-dependencies-limitations">
-     <title>Limitations of functional dependencies</title>
-
-     <para>
-      Functional dependencies are a very simple type of statistics, and
-      as such have several limitations. The first limitation is that they
-      only work with simple equality conditions, comparing columns and constant
-      values. It's not possible to use them to eliminate equality conditions
-      comparing two columns or a column to an expression, range clauses,
-      <literal>LIKE</> or any other type of conditions.
-     </para>
-
-     <para>
-      When eliminating the implied conditions, the planner assumes that the
-      conditions are compatible. Consider the following example, violating
-      this assumption:
-
-<programlisting>
-EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 10;
-                                          QUERY PLAN
------------------------------------------------------------------------------------------------
- Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual time=2.992..2.992 rows=0 loops=1)
-   Filter: ((a = 1) AND (b = 10))
-   Rows Removed by Filter: 10000
- Planning time: 0.232 ms
- Execution time: 3.033 ms
-(5 rows)
-</programlisting>
-
-      While there are no rows with such combination of values, the planner
-      is unable to verify whether the values match - it only knows that
-      the columns are functionally dependent.
-     </para>
-
-     <para>
-      This assumption is more about queries executed on the database - in many
-      cases, it's actually satisfied (e.g. when the GUI only allows selecting
-      compatible values). But if that's not the case, functional dependencies
-      may not be a viable option.
-     </para>
-
-     <para>
-      For additional information about functional dependencies, see
-      <filename>src/backend/statistics/README.dependencies</>.
-     </para>
-
-    </sect3>
-
    </sect2>
-
   </sect1>
-
  </chapter>
author	Alvaro Herrera <alvherre@alvh.no-ip.org>
	Thu, 20 Apr 2017 18:43:33 +0000 (15:43 -0300)
committer	Alvaro Herrera <alvherre@alvh.no-ip.org>
	Thu, 20 Apr 2017 18:43:33 +0000 (15:43 -0300)
doc/src/sgml/catalogs.sgml		patch \| blob \| history
doc/src/sgml/perform.sgml		patch \| blob \| history
doc/src/sgml/planstats.sgml		patch \| blob \| history