granicus.if.org Git - postgresql/blob - doc/src/sgml/perform.sgml

   1 <!-- $PostgreSQL: pgsql/doc/src/sgml/perform.sgml,v 1.72 2009/08/07 20:54:31 alvherre Exp $ -->
   2
   3  <chapter id="performance-tips">
   4   <title>Performance Tips</title>
   5
   6   <indexterm zone="performance-tips">
   7    <primary>performance</primary>
   8   </indexterm>
   9
  10   <para>
  11    Query performance can be affected by many things. Some of these can
  12    be controlled by the user, while others are fundamental to the underlying
  13    design of the system.  This chapter provides some hints about understanding
  14    and tuning <productname>PostgreSQL</productname> performance.
  15   </para>
  16
  17   <sect1 id="using-explain">
  18    <title>Using <command>EXPLAIN</command></title>
  19
  20    <indexterm zone="using-explain">
  21     <primary>EXPLAIN</primary>
  22    </indexterm>
  23
  24    <indexterm zone="using-explain">
  25     <primary>query plan</primary>
  26    </indexterm>
  27
  28    <para>
  29     <productname>PostgreSQL</productname> devises a <firstterm>query
  30     plan</firstterm> for each query it receives.  Choosing the right
  31     plan to match the query structure and the properties of the data
  32     is absolutely critical for good performance, so the system includes
  33     a complex <firstterm>planner</> that tries to choose good plans.
  34     You can use the
  35     <xref linkend="sql-explain" endterm="sql-explain-title"> command
  36     to see what query plan the planner creates for any query.
  37     Plan-reading is an art that deserves an extensive tutorial, which
  38     this is not; but here is some basic information.
  39    </para>
  40
  41    <para>
  42     The structure of a query plan is a tree of <firstterm>plan nodes</>.
  43     Nodes at the bottom level of the tree are table scan nodes: they return raw rows
  44     from a table.  There are different types of scan nodes for different
  45     table access methods: sequential scans, index scans, and bitmap index
  46     scans.  If the query requires joining, aggregation, sorting, or other
  47     operations on the raw rows, then there will be additional nodes
  48     above the scan nodes to perform these operations.  Again,
  49     there is usually more than one possible way to do these operations,
  50     so different node types can appear here too.  The output
  51     of <command>EXPLAIN</command> has one line for each node in the plan
  52     tree, showing the basic node type plus the cost estimates that the planner
  53     made for the execution of that plan node.  The first line (topmost node)
  54     has the estimated total execution cost for the plan; it is this number
  55     that the planner seeks to minimize.
  56    </para>
  57
  58    <para>
  59     Here is a trivial example, just to show what the output looks like:
  60     <footnote>
  61      <para>
  62       Examples in this section are drawn from the regression test database
  63       after doing a <command>VACUUM ANALYZE</>, using 8.2 development sources.
  64       You should be able to get similar results if you try the examples yourself,
  65       but your estimated costs and row counts might vary slightly
  66       because <command>ANALYZE</>'s statistics are random samples rather
  67       than exact.
  68      </para>
  69     </footnote>
  70
  71 <programlisting>
  72 EXPLAIN SELECT * FROM tenk1;
  73
  74                          QUERY PLAN
  75 -------------------------------------------------------------
  76  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=244)
  77 </programlisting>
  78    </para>
  79
  80    <para>
  81     The numbers that are quoted by <command>EXPLAIN</command> are (left
  82     to right):
  83
  84     <itemizedlist>
  85      <listitem>
  86       <para>
  87        Estimated start-up cost (time expended before the output scan can start,
  88        e.g., time to do the sorting in a sort node)
  89       </para>
  90      </listitem>
  91
  92      <listitem>
  93       <para>
  94        Estimated total cost (if all rows are retrieved, though they might
  95        not be; e.g., a query with a <literal>LIMIT</> clause will stop
  96        short of paying the total cost of the <literal>Limit</> plan node's
  97        input node)
  98       </para>
  99      </listitem>
 100
 101      <listitem>
 102       <para>
 103        Estimated number of rows output by this plan node (again, only if
 104        executed to completion)
 105       </para>
 106      </listitem>
 107
 108      <listitem>
 109       <para>
 110        Estimated average width (in bytes) of rows output by this plan
 111        node
 112       </para>
 113      </listitem>
 114     </itemizedlist>
 115    </para>
 116
 117    <para>
 118     The costs are measured in arbitrary units determined by the planner's
 119     cost parameters (see <xref linkend="runtime-config-query-constants">).
 120     Traditional practice is to measure the costs in units of disk page
 121     fetches; that is, <xref linkend="guc-seq-page-cost"> is conventionally
 122     set to <literal>1.0</> and the other cost parameters are set relative
 123     to that.  (The examples in this section are run with the default cost
 124     parameters.)
 125    </para>
 126
 127    <para>
 128     It's important to note that the cost of an upper-level node includes
 129     the cost of all its child nodes.  It's also important to realize that
 130     the cost only reflects things that the planner cares about.
 131     In particular, the cost does not consider the time spent transmitting
 132     result rows to the client, which could be an important
 133     factor in the real elapsed time; but the planner ignores it because
 134     it cannot change it by altering the plan.  (Every correct plan will
 135     output the same row set, we trust.)
 136    </para>
 137
 138    <para>
 139     The <literal>rows</> value is a little tricky
 140     because it is <emphasis>not</emphasis> the
 141     number of rows processed or scanned by the plan node.  It is usually less,
 142     reflecting the estimated selectivity of any <literal>WHERE</>-clause
 143     conditions that are being
 144     applied at the node.  Ideally the top-level rows estimate will
 145     approximate the number of rows actually returned, updated, or deleted
 146     by the query.
 147    </para>
 148
 149    <para>
 150     Returning to our example:
 151
 152 <programlisting>
 153 EXPLAIN SELECT * FROM tenk1;
 154
 155                          QUERY PLAN
 156 -------------------------------------------------------------
 157  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=244)
 158 </programlisting>
 159    </para>
 160
 161    <para>
 162     This is about as straightforward as it gets.  If you do:
 163
 164 <programlisting>
 165 SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
 166 </programlisting>
 167
 168     you will find that <classname>tenk1</classname> has 358 disk
 169     pages and 10000 rows.  The estimated cost is computed as (disk pages read *
 170     <xref linkend="guc-seq-page-cost">) + (rows scanned *
 171     <xref linkend="guc-cpu-tuple-cost">).  By default,
 172     <varname>seq_page_cost</> is 1.0 and <varname>cpu_tuple_cost</> is 0.01,
 173     so the estimated cost is (358 * 1.0) + (10000 * 0.01) = 458.
 174    </para>
 175
 176    <para>
 177     Now let's modify the original query to add a <literal>WHERE</> condition:
 178
 179 <programlisting>
 180 EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 7000;
 181
 182                          QUERY PLAN
 183 ------------------------------------------------------------
 184  Seq Scan on tenk1  (cost=0.00..483.00 rows=7033 width=244)
 185    Filter: (unique1 &lt; 7000)
 186 </programlisting>
 187
 188     Notice that the <command>EXPLAIN</> output shows the <literal>WHERE</>
 189     clause being applied as a <quote>filter</> condition; this means that
 190     the plan node checks the condition for each row it scans, and outputs
 191     only the ones that pass the condition.
 192     The estimate of output rows has been reduced because of the <literal>WHERE</>
 193     clause.
 194     However, the scan will still have to visit all 10000 rows, so the cost
 195     hasn't decreased; in fact it has gone up a bit (by 10000 * <xref
 196     linkend="guc-cpu-operator-cost">, to be exact) to reflect the extra CPU
 197     time spent checking the <literal>WHERE</> condition.
 198    </para>
 199
 200    <para>
 201     The actual number of rows this query would select is 7000, but the <literal>rows</>
 202     estimate is only approximate.  If you try to duplicate this experiment,
 203     you will probably get a slightly different estimate; moreover, it will
 204     change after each <command>ANALYZE</command> command, because the
 205     statistics produced by <command>ANALYZE</command> are taken from a
 206     randomized sample of the table.
 207    </para>
 208
 209    <para>
 210     Now, let's make the condition more restrictive:
 211
 212 <programlisting>
 213 EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 100;
 214
 215                                   QUERY PLAN
 216 ------------------------------------------------------------------------------
 217  Bitmap Heap Scan on tenk1  (cost=2.37..232.35 rows=106 width=244)
 218    Recheck Cond: (unique1 &lt; 100)
 219    ->  Bitmap Index Scan on tenk1_unique1  (cost=0.00..2.37 rows=106 width=0)
 220          Index Cond: (unique1 &lt; 100)
 221 </programlisting>
 222
 223     Here the planner has decided to use a two-step plan: the bottom plan
 224     node visits an index to find the locations of rows matching the index
 225     condition, and then the upper plan node actually fetches those rows
 226     from the table itself.  Fetching the rows separately is much more
 227     expensive than sequentially reading them, but because not all the pages
 228     of the table have to be visited, this is still cheaper than a sequential
 229     scan.  (The reason for using two plan levels is that the upper plan
 230     node sorts the row locations identified by the index into physical order
 231     before reading them, to minimize the cost of separate fetches.
 232     The <quote>bitmap</> mentioned in the node names is the mechanism that
 233     does the sorting.)
 234    </para>
 235
 236    <para>
 237     If the <literal>WHERE</> condition is selective enough, the planner might
 238     switch to a <quote>simple</> index scan plan:
 239
 240 <programlisting>
 241 EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 3;
 242
 243                                   QUERY PLAN
 244 ------------------------------------------------------------------------------
 245  Index Scan using tenk1_unique1 on tenk1  (cost=0.00..10.00 rows=2 width=244)
 246    Index Cond: (unique1 &lt; 3)
 247 </programlisting>
 248
 249     In this case the table rows are fetched in index order, which makes them
 250     even more expensive to read, but there are so few that the extra cost
 251     of sorting the row locations is not worth it.  You'll most often see
 252     this plan type for queries that fetch just a single row, and for queries
 253     that have an <literal>ORDER BY</> condition that matches the index
 254     order.
 255    </para>
 256
 257    <para>
 258     Add another condition to the <literal>WHERE</> clause:
 259
 260 <programlisting>
 261 EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 3 AND stringu1 = 'xxx';
 262
 263                                   QUERY PLAN
 264 ------------------------------------------------------------------------------
 265  Index Scan using tenk1_unique1 on tenk1  (cost=0.00..10.01 rows=1 width=244)
 266    Index Cond: (unique1 &lt; 3)
 267    Filter: (stringu1 = 'xxx'::name)
 268 </programlisting>
 269
 270     The added condition <literal>stringu1 = 'xxx'</literal> reduces the
 271     output-rows estimate, but not the cost because we still have to visit the
 272     same set of rows.  Notice that the <literal>stringu1</> clause
 273     cannot be applied as an index condition (since this index is only on
 274     the <literal>unique1</> column).  Instead it is applied as a filter on
 275     the rows retrieved by the index.  Thus the cost has actually gone up
 276     slightly to reflect this extra checking.
 277    </para>
 278
 279    <para>
 280     If there are indexes on several columns referenced in <literal>WHERE</>, the
 281     planner might choose to use an AND or OR combination of the indexes:
 282
 283 <programlisting>
 284 EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 100 AND unique2 &gt; 9000;
 285
 286                                      QUERY PLAN
 287 -------------------------------------------------------------------------------------
 288  Bitmap Heap Scan on tenk1  (cost=11.27..49.11 rows=11 width=244)
 289    Recheck Cond: ((unique1 &lt; 100) AND (unique2 &gt; 9000))
 290    -&gt;  BitmapAnd  (cost=11.27..11.27 rows=11 width=0)
 291          -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..2.37 rows=106 width=0)
 292                Index Cond: (unique1 &lt; 100)
 293          -&gt;  Bitmap Index Scan on tenk1_unique2  (cost=0.00..8.65 rows=1042 width=0)
 294                Index Cond: (unique2 &gt; 9000)
 295 </programlisting>
 296
 297     But this requires visiting both indexes, so it's not necessarily a win
 298     compared to using just one index and treating the other condition as
 299     a filter.  If you vary the ranges involved you'll see the plan change
 300     accordingly.
 301    </para>
 302
 303    <para>
 304     Let's try joining two tables, using the columns we have been discussing:
 305
 306 <programlisting>
 307 EXPLAIN SELECT *
 308 FROM tenk1 t1, tenk2 t2
 309 WHERE t1.unique1 &lt; 100 AND t1.unique2 = t2.unique2;
 310
 311                                       QUERY PLAN
 312 --------------------------------------------------------------------------------------
 313  Nested Loop  (cost=2.37..553.11 rows=106 width=488)
 314    -&gt;  Bitmap Heap Scan on tenk1 t1  (cost=2.37..232.35 rows=106 width=244)
 315          Recheck Cond: (unique1 &lt; 100)
 316          -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..2.37 rows=106 width=0)
 317                Index Cond: (unique1 &lt; 100)
 318    -&gt;  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.00..3.01 rows=1 width=244)
 319          Index Cond: (t2.unique2 = t1.unique2)
 320 </programlisting>
 321    </para>
 322
 323    <para>
 324     In this nested-loop join, the outer (upper) scan is the same bitmap index scan we
 325     saw earlier, and so its cost and row count are the same because we are
 326     applying the <literal>WHERE</> clause <literal>unique1 &lt; 100</literal>
 327     at that node.
 328     The <literal>t1.unique2 = t2.unique2</literal> clause is not relevant yet,
 329     so it doesn't affect the row count of the outer scan.  For the inner (lower) scan, the
 330     <literal>unique2</> value of the current outer-scan row is plugged into
 331     the inner index scan to produce an index condition like
 332     <literal>t2.unique2 = <replaceable>constant</replaceable></literal>.
 333     So we get the same inner-scan plan and costs that we'd get from, say,
 334     <literal>EXPLAIN SELECT * FROM tenk2 WHERE unique2 = 42</literal>.  The
 335     costs of the loop node are then set on the basis of the cost of the outer
 336     scan, plus one repetition of the inner scan for each outer row (106 * 3.01,
 337     here), plus a little CPU time for join processing.
 338    </para>
 339
 340    <para>
 341     In this example the join's output row count is the same as the product
 342     of the two scans' row counts, but that's not true in all cases because
 343     you can have <literal>WHERE</> clauses that mention both tables
 344     and so can only be applied at the join point, not to either input scan.
 345     For example, if we added
 346     <literal>WHERE ... AND t1.hundred &lt; t2.hundred</literal>,
 347     that would decrease the output row count of the join node, but not change
 348     either input scan.
 349    </para>
 350
 351    <para>
 352     One way to look at variant plans is to force the planner to disregard
 353     whatever strategy it thought was the cheapest, using the enable/disable
 354     flags described in <xref linkend="runtime-config-query-enable">.
 355     (This is a crude tool, but useful.  See
 356     also <xref linkend="explicit-joins">.)
 357
 358 <programlisting>
 359 SET enable_nestloop = off;
 360 EXPLAIN SELECT *
 361 FROM tenk1 t1, tenk2 t2
 362 WHERE t1.unique1 &lt; 100 AND t1.unique2 = t2.unique2;
 363
 364                                         QUERY PLAN
 365 ------------------------------------------------------------------------------------------
 366  Hash Join  (cost=232.61..741.67 rows=106 width=488)
 367    Hash Cond: (t2.unique2 = t1.unique2)
 368    -&gt;  Seq Scan on tenk2 t2  (cost=0.00..458.00 rows=10000 width=244)
 369    -&gt;  Hash  (cost=232.35..232.35 rows=106 width=244)
 370          -&gt;  Bitmap Heap Scan on tenk1 t1  (cost=2.37..232.35 rows=106 width=244)
 371                Recheck Cond: (unique1 &lt; 100)
 372                -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..2.37 rows=106 width=0)
 373                      Index Cond: (unique1 &lt; 100)
 374 </programlisting>
 375
 376     This plan proposes to extract the 100 interesting rows of <classname>tenk1</classname>
 377     using that same old index scan, stash them into an in-memory hash table,
 378     and then do a sequential scan of <classname>tenk2</classname>, probing into the hash table
 379     for possible matches of <literal>t1.unique2 = t2.unique2</literal> for each <classname>tenk2</classname> row.
 380     The cost to read <classname>tenk1</classname> and set up the hash table is a start-up
 381     cost for the hash join, since there will be no output until we can
 382     start reading <classname>tenk2</classname>.  The total time estimate for the join also
 383     includes a hefty charge for the CPU time to probe the hash table
 384     10000 times.  Note, however, that we are <emphasis>not</emphasis> charging 10000 times 232.35;
 385     the hash table setup is only done once in this plan type.
 386    </para>
 387
 388    <para>
 389     It is possible to check the accuracy of the planner's estimated costs
 390     by using <command>EXPLAIN ANALYZE</>.  This command actually executes the query,
 391     and then displays the true run time accumulated within each plan node
 392     along with the same estimated costs that a plain <command>EXPLAIN</command> shows.
 393     For example, we might get a result like this:
 394
 395 <screen>
 396 EXPLAIN ANALYZE SELECT *
 397 FROM tenk1 t1, tenk2 t2
 398 WHERE t1.unique1 &lt; 100 AND t1.unique2 = t2.unique2;
 399
 400                                                             QUERY PLAN
 401 ----------------------------------------------------------------------------------------------------------------------------------
 402  Nested Loop  (cost=2.37..553.11 rows=106 width=488) (actual time=1.392..12.700 rows=100 loops=1)
 403    -&gt;  Bitmap Heap Scan on tenk1 t1  (cost=2.37..232.35 rows=106 width=244) (actual time=0.878..2.367 rows=100 loops=1)
 404          Recheck Cond: (unique1 &lt; 100)
 405          -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..2.37 rows=106 width=0) (actual time=0.546..0.546 rows=100 loops=1)
 406                Index Cond: (unique1 &lt; 100)
 407    -&gt;  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.00..3.01 rows=1 width=244) (actual time=0.067..0.078 rows=1 loops=100)
 408          Index Cond: (t2.unique2 = t1.unique2)
 409  Total runtime: 14.452 ms
 410 </screen>
 411
 412     Note that the <quote>actual time</quote> values are in milliseconds of
 413     real time, whereas the <literal>cost</> estimates are expressed in
 414     arbitrary units; so they are unlikely to match up.
 415     The thing to pay attention to is whether the ratios of actual time and
 416     estimated costs are consistent.
 417    </para>
 418
 419    <para>
 420     In some query plans, it is possible for a subplan node to be executed more
 421     than once.  For example, the inner index scan is executed once per outer
 422     row in the above nested-loop plan.  In such cases, the
 423     <literal>loops</> value reports the
 424     total number of executions of the node, and the actual time and rows
 425     values shown are averages per-execution.  This is done to make the numbers
 426     comparable with the way that the cost estimates are shown.  Multiply by
 427     the <literal>loops</> value to get the total time actually spent in
 428     the node.
 429    </para>
 430
 431    <para>
 432     The <literal>Total runtime</literal> shown by <command>EXPLAIN
 433     ANALYZE</command> includes executor start-up and shut-down time, as well
 434     as time spent processing the result rows.  It does not include parsing,
 435     rewriting, or planning time.  For a <command>SELECT</> query, the total
 436     run time will normally be just a little larger than the total time
 437     reported for the top-level plan node.  For <command>INSERT</>,
 438     <command>UPDATE</>, and <command>DELETE</> commands, the total run time
 439     might be considerably larger, because it includes the time spent processing
 440     the result rows.  For these commands, the time for the top plan node is
 441     essentially the time spent locating the old rows and/or computing
 442     the new ones, but it doesn't include the time spent applying the changes.
 443     Time spent firing triggers, if any, is also outside the top plan node,
 444     and is shown separately for each trigger.
 445    </para>
 446
 447    <para>
 448     It is worth noting that <command>EXPLAIN</> results should not be extrapolated
 449     to situations other than the one you are actually testing; for example,
 450     results on a toy-sized table cannot be assumed to apply to large tables.
 451     The planner's cost estimates are not linear and so it might choose
 452     a different plan for a larger or smaller table.  An extreme example
 453     is that on a table that only occupies one disk page, you'll nearly
 454     always get a sequential scan plan whether indexes are available or not.
 455     The planner realizes that it's going to take one disk page read to
 456     process the table in any case, so there's no value in expending additional
 457     page reads to look at an index.
 458    </para>
 459   </sect1>
 460
 461  <sect1 id="planner-stats">
 462   <title>Statistics Used by the Planner</title>
 463
 464   <indexterm zone="planner-stats">
 465    <primary>statistics</primary>
 466    <secondary>of the planner</secondary>
 467   </indexterm>
 468
 469   <para>
 470    As we saw in the previous section, the query planner needs to estimate
 471    the number of rows retrieved by a query in order to make good choices
 472    of query plans.  This section provides a quick look at the statistics
 473    that the system uses for these estimates.
 474   </para>
 475
 476   <para>
 477    One component of the statistics is the total number of entries in
 478    each table and index, as well as the number of disk blocks occupied
 479    by each table and index.  This information is kept in the table
 480    <link linkend="catalog-pg-class"><structname>pg_class</structname></link>,
 481    in the columns <structfield>reltuples</structfield> and
 482    <structfield>relpages</structfield>.  We can look at it with
 483    queries similar to this one:
 484
 485 <screen>
 486 SELECT relname, relkind, reltuples, relpages
 487 FROM pg_class
 488 WHERE relname LIKE 'tenk1%';
 489
 490        relname        | relkind | reltuples | relpages
 491 ----------------------+---------+-----------+----------
 492  tenk1                | r       |     10000 |      358
 493  tenk1_hundred        | i       |     10000 |       30
 494  tenk1_thous_tenthous | i       |     10000 |       30
 495  tenk1_unique1        | i       |     10000 |       30
 496  tenk1_unique2        | i       |     10000 |       30
 497 (5 rows)
 498 </screen>
 499
 500    Here we can see that <structname>tenk1</structname> contains 10000
 501    rows, as do its indexes, but the indexes are (unsurprisingly) much
 502    smaller than the table.
 503   </para>
 504
 505   <para>
 506    For efficiency reasons, <structfield>reltuples</structfield>
 507    and <structfield>relpages</structfield> are not updated on-the-fly,
 508    and so they usually contain somewhat out-of-date values.
 509    They are updated by <command>VACUUM</>, <command>ANALYZE</>, and a
 510    few DDL commands such as <command>CREATE INDEX</>.  A stand-alone
 511    <command>ANALYZE</>, that is one not part of <command>VACUUM</>,
 512    generates an approximate <structfield>reltuples</structfield> value
 513    since it does not read every row of the table.  The planner
 514    will scale the values it finds in <structname>pg_class</structname>
 515    to match the current physical table size, thus obtaining a closer
 516    approximation.
 517   </para>
 518
 519   <indexterm>
 520    <primary>pg_statistic</primary>
 521   </indexterm>
 522
 523   <para>
 524    Most queries retrieve only a fraction of the rows in a table, due
 525    to <literal>WHERE</> clauses that restrict the rows to be
 526    examined.  The planner thus needs to make an estimate of the
 527    <firstterm>selectivity</> of <literal>WHERE</> clauses, that is,
 528    the fraction of rows that match each condition in the
 529    <literal>WHERE</> clause.  The information used for this task is
 530    stored in the
 531    <link linkend="catalog-pg-statistic"><structname>pg_statistic</structname></link>
 532    system catalog.  Entries in <structname>pg_statistic</structname>
 533    are updated by the <command>ANALYZE</> and <command>VACUUM
 534    ANALYZE</> commands, and are always approximate even when freshly
 535    updated.
 536   </para>
 537
 538   <indexterm>
 539    <primary>pg_stats</primary>
 540   </indexterm>
 541
 542   <para>
 543    Rather than look at <structname>pg_statistic</structname> directly,
 544    it's better to look at its view
 545    <link linkend="view-pg-stats"><structname>pg_stats</structname></link>
 546    when examining the statistics manually.  <structname>pg_stats</structname>
 547    is designed to be more easily readable.  Furthermore,
 548    <structname>pg_stats</structname> is readable by all, whereas
 549    <structname>pg_statistic</structname> is only readable by a superuser.
 550    (This prevents unprivileged users from learning something about
 551    the contents of other people's tables from the statistics.  The
 552    <structname>pg_stats</structname> view is restricted to show only
 553    rows about tables that the current user can read.)
 554    For example, we might do:
 555
 556 <screen>
 557 SELECT attname, n_distinct, most_common_vals
 558 FROM pg_stats
 559 WHERE tablename = 'road';
 560
 561  attname | n_distinct |                                                                                                                                                                                  most_common_vals
 562 ---------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 563  name    |  -0.467008 | {"I- 580                        Ramp","I- 880                        Ramp","Sp Railroad                       ","I- 580                            ","I- 680                        Ramp","I- 80                         Ramp","14th                          St  ","5th                           St  ","Mission                       Blvd","I- 880                            "}
 564  thepath |         20 | {"[(-122.089,37.71),(-122.0886,37.711)]"}
 565 (2 rows)
 566 </screen>
 567   </para>
 568
 569   <para>
 570    The amount of information stored in <structname>pg_statistic</structname>
 571    by <command>ANALYZE</>, in particular the maximum number of entries in the
 572    <structfield>most_common_vals</> and <structfield>histogram_bounds</>
 573    arrays for each column, can be set on a
 574    column-by-column basis using the <command>ALTER TABLE SET STATISTICS</>
 575    command, or globally by setting the
 576    <xref linkend="guc-default-statistics-target"> configuration variable.
 577    The default limit is presently 100 entries.  Raising the limit
 578    might allow more accurate planner estimates to be made, particularly for
 579    columns with irregular data distributions, at the price of consuming
 580    more space in <structname>pg_statistic</structname> and slightly more
 581    time to compute the estimates.  Conversely, a lower limit might be
 582    sufficient for columns with simple data distributions.
 583   </para>
 584
 585   <para>
 586    Further details about the planner's use of statistics can be found in
 587    <xref linkend="planner-stats-details">.
 588   </para>
 589
 590  </sect1>
 591
 592  <sect1 id="explicit-joins">
 593   <title>Controlling the Planner with Explicit <literal>JOIN</> Clauses</title>
 594
 595   <indexterm zone="explicit-joins">
 596    <primary>join</primary>
 597    <secondary>controlling the order</secondary>
 598   </indexterm>
 599
 600   <para>
 601    It is possible
 602    to control the query planner to some extent by using the explicit <literal>JOIN</>
 603    syntax.  To see why this matters, we first need some background.
 604   </para>
 605
 606   <para>
 607    In a simple join query, such as:
 608 <programlisting>
 609 SELECT * FROM a, b, c WHERE a.id = b.id AND b.ref = c.id;
 610 </programlisting>
 611    the planner is free to join the given tables in any order.  For
 612    example, it could generate a query plan that joins A to B, using
 613    the <literal>WHERE</> condition <literal>a.id = b.id</>, and then
 614    joins C to this joined table, using the other <literal>WHERE</>
 615    condition.  Or it could join B to C and then join A to that result.
 616    Or it could join A to C and then join them with B &mdash; but that
 617    would be inefficient, since the full Cartesian product of A and C
 618    would have to be formed, there being no applicable condition in the
 619    <literal>WHERE</> clause to allow optimization of the join.  (All
 620    joins in the <productname>PostgreSQL</productname> executor happen
 621    between two input tables, so it's necessary to build up the result
 622    in one or another of these fashions.)  The important point is that
 623    these different join possibilities give semantically equivalent
 624    results but might have hugely different execution costs.  Therefore,
 625    the planner will explore all of them to try to find the most
 626    efficient query plan.
 627   </para>
 628
 629   <para>
 630    When a query only involves two or three tables, there aren't many join
 631    orders to worry about.  But the number of possible join orders grows
 632    exponentially as the number of tables expands.  Beyond ten or so input
 633    tables it's no longer practical to do an exhaustive search of all the
 634    possibilities, and even for six or seven tables planning might take an
 635    annoyingly long time.  When there are too many input tables, the
 636    <productname>PostgreSQL</productname> planner will switch from exhaustive
 637    search to a <firstterm>genetic</firstterm> probabilistic search
 638    through a limited number of possibilities.  (The switch-over threshold is
 639    set by the <xref linkend="guc-geqo-threshold"> run-time
 640    parameter.)
 641    The genetic search takes less time, but it won't
 642    necessarily find the best possible plan.
 643   </para>
 644
 645   <para>
 646    When the query involves outer joins, the planner has less freedom
 647    than it does for plain (inner) joins. For example, consider:
 648 <programlisting>
 649 SELECT * FROM a LEFT JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);
 650 </programlisting>
 651    Although this query's restrictions are superficially similar to the
 652    previous example, the semantics are different because a row must be
 653    emitted for each row of A that has no matching row in the join of B and C.
 654    Therefore the planner has no choice of join order here: it must join
 655    B to C and then join A to that result.  Accordingly, this query takes
 656    less time to plan than the previous query.  In other cases, the planner
 657    might be able to determine that more than one join order is safe.
 658    For example, given:
 659 <programlisting>
 660 SELECT * FROM a LEFT JOIN b ON (a.bid = b.id) LEFT JOIN c ON (a.cid = c.id);
 661 </programlisting>
 662    it is valid to join A to either B or C first.  Currently, only
 663    <literal>FULL JOIN</> completely constrains the join order.  Most
 664    practical cases involving <literal>LEFT JOIN</> or <literal>RIGHT JOIN</>
 665    can be rearranged to some extent.
 666   </para>
 667
 668   <para>
 669    Explicit inner join syntax (<literal>INNER JOIN</>, <literal>CROSS
 670    JOIN</>, or unadorned <literal>JOIN</>) is semantically the same as
 671    listing the input relations in <literal>FROM</>, so it does not
 672    constrain the join order.
 673   </para>
 674
 675   <para>
 676    Even though most kinds of <literal>JOIN</> don't completely constrain
 677    the join order, it is possible to instruct the
 678    <productname>PostgreSQL</productname> query planner to treat all
 679    <literal>JOIN</> clauses as constraining the join order anyway.
 680    For example, these three queries are logically equivalent:
 681 <programlisting>
 682 SELECT * FROM a, b, c WHERE a.id = b.id AND b.ref = c.id;
 683 SELECT * FROM a CROSS JOIN b CROSS JOIN c WHERE a.id = b.id AND b.ref = c.id;
 684 SELECT * FROM a JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);
 685 </programlisting>
 686    But if we tell the planner to honor the <literal>JOIN</> order,
 687    the second and third take less time to plan than the first.  This effect
 688    is not worth worrying about for only three tables, but it can be a
 689    lifesaver with many tables.
 690   </para>
 691
 692   <para>
 693    To force the planner to follow the join order laid out by explicit
 694    <literal>JOIN</>s,
 695    set the <xref linkend="guc-join-collapse-limit"> run-time parameter to 1.
 696    (Other possible values are discussed below.)
 697   </para>
 698
 699   <para>
 700    You do not need to constrain the join order completely in order to
 701    cut search time, because it's OK to use <literal>JOIN</> operators
 702    within items of a plain <literal>FROM</> list.  For example, consider:
 703 <programlisting>
 704 SELECT * FROM a CROSS JOIN b, c, d, e WHERE ...;
 705 </programlisting>
 706    With <varname>join_collapse_limit</> = 1, this
 707    forces the planner to join A to B before joining them to other tables,
 708    but doesn't constrain its choices otherwise.  In this example, the
 709    number of possible join orders is reduced by a factor of 5.
 710   </para>
 711
 712   <para>
 713    Constraining the planner's search in this way is a useful technique
 714    both for reducing planning time and for directing the planner to a
 715    good query plan.  If the planner chooses a bad join order by default,
 716    you can force it to choose a better order via <literal>JOIN</> syntax
 717    &mdash; assuming that you know of a better order, that is.  Experimentation
 718    is recommended.
 719   </para>
 720
 721   <para>
 722    A closely related issue that affects planning time is collapsing of
 723    subqueries into their parent query.  For example, consider:
 724 <programlisting>
 725 SELECT *
 726 FROM x, y,
 727     (SELECT * FROM a, b, c WHERE something) AS ss
 728 WHERE somethingelse;
 729 </programlisting>
 730    This situation might arise from use of a view that contains a join;
 731    the view's <literal>SELECT</> rule will be inserted in place of the view
 732    reference, yielding a query much like the above.  Normally, the planner
 733    will try to collapse the subquery into the parent, yielding:
 734 <programlisting>
 735 SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
 736 </programlisting>
 737    This usually results in a better plan than planning the subquery
 738    separately.  (For example, the outer <literal>WHERE</> conditions might be such that
 739    joining X to A first eliminates many rows of A, thus avoiding the need to
 740    form the full logical output of the subquery.)  But at the same time,
 741    we have increased the planning time; here, we have a five-way join
 742    problem replacing two separate three-way join problems.  Because of the
 743    exponential growth of the number of possibilities, this makes a big
 744    difference.  The planner tries to avoid getting stuck in huge join search
 745    problems by not collapsing a subquery if more than <varname>from_collapse_limit</>
 746    <literal>FROM</> items would result in the parent
 747    query.  You can trade off planning time against quality of plan by
 748    adjusting this run-time parameter up or down.
 749   </para>
 750
 751   <para>
 752    <xref linkend="guc-from-collapse-limit"> and <xref
 753    linkend="guc-join-collapse-limit">
 754    are similarly named because they do almost the same thing: one controls
 755    when the planner will <quote>flatten out</> subqueries, and the
 756    other controls when it will flatten out explicit joins.  Typically
 757    you would either set <varname>join_collapse_limit</> equal to
 758    <varname>from_collapse_limit</> (so that explicit joins and subqueries
 759    act similarly) or set <varname>join_collapse_limit</> to 1 (if you want
 760    to control join order with explicit joins).  But you might set them
 761    differently if you are trying to fine-tune the trade-off between planning
 762    time and run time.
 763   </para>
 764  </sect1>
 765
 766  <sect1 id="populate">
 767   <title>Populating a Database</title>
 768
 769   <para>
 770    One might need to insert a large amount of data when first populating
 771    a database. This section contains some suggestions on how to make
 772    this process as efficient as possible.
 773   </para>
 774
 775   <sect2 id="disable-autocommit">
 776    <title>Disable Autocommit</title>
 777
 778    <indexterm>
 779     <primary>autocommit</primary>
 780     <secondary>bulk-loading data</secondary>
 781    </indexterm>
 782
 783    <para>
 784     When using multiple <command>INSERT</>s, turn off autocommit and just do
 785     one commit at the end.  (In plain
 786     SQL, this means issuing <command>BEGIN</command> at the start and
 787     <command>COMMIT</command> at the end.  Some client libraries might
 788     do this behind your back, in which case you need to make sure the
 789     library does it when you want it done.)  If you allow each
 790     insertion to be committed separately,
 791     <productname>PostgreSQL</productname> is doing a lot of work for
 792     each row that is added.  An additional benefit of doing all
 793     insertions in one transaction is that if the insertion of one row
 794     were to fail then the insertion of all rows inserted up to that
 795     point would be rolled back, so you won't be stuck with partially
 796     loaded data.
 797    </para>
 798   </sect2>
 799
 800   <sect2 id="populate-copy-from">
 801    <title>Use <command>COPY</command></title>
 802
 803    <para>
 804     Use <xref linkend="sql-copy" endterm="sql-copy-title"> to load
 805     all the rows in one command, instead of using a series of
 806     <command>INSERT</command> commands.  The <command>COPY</command>
 807     command is optimized for loading large numbers of rows; it is less
 808     flexible than <command>INSERT</command>, but incurs significantly
 809     less overhead for large data loads. Since <command>COPY</command>
 810     is a single command, there is no need to disable autocommit if you
 811     use this method to populate a table.
 812    </para>
 813
 814    <para>
 815     If you cannot use <command>COPY</command>, it might help to use <xref
 816     linkend="sql-prepare" endterm="sql-prepare-title"> to create a
 817     prepared <command>INSERT</command> statement, and then use
 818     <command>EXECUTE</command> as many times as required.  This avoids
 819     some of the overhead of repeatedly parsing and planning
 820     <command>INSERT</command>. Different interfaces provide this facility
 821     in different ways; look for <quote>prepared statements</> in the interface
 822     documentation.
 823    </para>
 824
 825    <para>
 826     Note that loading a large number of rows using
 827     <command>COPY</command> is almost always faster than using
 828     <command>INSERT</command>, even if <command>PREPARE</> is used and
 829     multiple insertions are batched into a single transaction.
 830    </para>
 831
 832    <para>
 833     <command>COPY</command> is fastest when used within the same
 834     transaction as an earlier <command>CREATE TABLE</command> or
 835     <command>TRUNCATE</command> command. In such cases no WAL
 836     needs to be written, because in case of an error, the files
 837     containing the newly loaded data will be removed anyway.
 838     However, this consideration does not apply when
 839     <xref linkend="guc-archive-mode"> is on, as all commands
 840     must write WAL in that case.
 841    </para>
 842
 843   </sect2>
 844
 845   <sect2 id="populate-rm-indexes">
 846    <title>Remove Indexes</title>
 847
 848    <para>
 849     If you are loading a freshly created table, the fastest method is to
 850     create the table, bulk load the table's data using
 851     <command>COPY</command>, then create any indexes needed for the
 852     table.  Creating an index on pre-existing data is quicker than
 853     updating it incrementally as each row is loaded.
 854    </para>
 855
 856    <para>
 857     If you are adding large amounts of data to an existing table,
 858     it might be a win to drop the index,
 859     load the table, and then recreate the index.  Of course, the
 860     database performance for other users might suffer
 861     during the time the index is missing.  One should also think
 862     twice before dropping unique indexes, since the error checking
 863     afforded by the unique constraint will be lost while the index is
 864     missing.
 865    </para>
 866   </sect2>
 867
 868   <sect2 id="populate-rm-fkeys">
 869    <title>Remove Foreign Key Constraints</title>
 870
 871    <para>
 872     Just as with indexes, a foreign key constraint can be checked
 873     <quote>in bulk</> more efficiently than row-by-row.  So it might be
 874     useful to drop foreign key constraints, load data, and re-create
 875     the constraints.  Again, there is a trade-off between data load
 876     speed and loss of error checking while the constraint is missing.
 877    </para>
 878   </sect2>
 879
 880   <sect2 id="populate-work-mem">
 881    <title>Increase <varname>maintenance_work_mem</varname></title>
 882
 883    <para>
 884     Temporarily increasing the <xref linkend="guc-maintenance-work-mem">
 885     configuration variable when loading large amounts of data can
 886     lead to improved performance.  This will help to speed up <command>CREATE
 887     INDEX</> commands and <command>ALTER TABLE ADD FOREIGN KEY</> commands.
 888     It won't do much for <command>COPY</> itself, so this advice is
 889     only useful when you are using one or both of the above techniques.
 890    </para>
 891   </sect2>
 892
 893   <sect2 id="populate-checkpoint-segments">
 894    <title>Increase <varname>checkpoint_segments</varname></title>
 895
 896    <para>
 897     Temporarily increasing the <xref
 898     linkend="guc-checkpoint-segments"> configuration variable can also
 899     make large data loads faster.  This is because loading a large
 900     amount of data into <productname>PostgreSQL</productname> will
 901     cause checkpoints to occur more often than the normal checkpoint
 902     frequency (specified by the <varname>checkpoint_timeout</varname>
 903     configuration variable). Whenever a checkpoint occurs, all dirty
 904     pages must be flushed to disk. By increasing
 905     <varname>checkpoint_segments</varname> temporarily during bulk
 906     data loads, the number of checkpoints that are required can be
 907     reduced.
 908    </para>
 909   </sect2>
 910
 911   <sect2 id="populate-pitr">
 912    <title>Turn off <varname>archive_mode</varname></title>
 913
 914    <para>
 915     When loading large amounts of data into an installation that uses
 916     WAL archiving, you might want to disable archiving (turn off the
 917     <xref linkend="guc-archive-mode"> configuration variable)
 918     while loading.  It might be
 919     faster to take a new base backup after the load has completed
 920     than to process a large amount of incremental WAL data.
 921     But note that turning <varname>archive_mode</varname> on or off
 922     requires a server restart.
 923    </para>
 924
 925    <para>
 926     Aside from avoiding the time for the archiver to process the WAL data,
 927     doing this will actually make certain commands faster, because they
 928     are designed not to write WAL at all if <varname>archive_mode</varname>
 929     is off.  (They can guarantee crash safety more cheaply by doing an
 930     <function>fsync</> at the end than by writing WAL.)
 931     This applies to the following commands:
 932     <itemizedlist>
 933      <listitem>
 934       <para>
 935        <command>CREATE TABLE AS SELECT</command>
 936       </para>
 937      </listitem>
 938      <listitem>
 939       <para>
 940        <command>CREATE INDEX</command> (and variants such as
 941        <command>ALTER TABLE ADD PRIMARY KEY</command>)
 942       </para>
 943      </listitem>
 944      <listitem>
 945       <para>
 946        <command>ALTER TABLE SET TABLESPACE</command>
 947       </para>
 948      </listitem>
 949      <listitem>
 950       <para>
 951        <command>CLUSTER</command>
 952       </para>
 953      </listitem>
 954      <listitem>
 955       <para>
 956        <command>COPY FROM</command>, when the target table has been
 957        created or truncated earlier in the same transaction
 958       </para>
 959      </listitem>
 960     </itemizedlist>
 961    </para>
 962   </sect2>
 963
 964   <sect2 id="populate-analyze">
 965    <title>Run <command>ANALYZE</command> Afterwards</title>
 966
 967    <para>
 968     Whenever you have significantly altered the distribution of data
 969     within a table, running <xref linkend="sql-analyze"
 970     endterm="sql-analyze-title"> is strongly recommended. This
 971     includes bulk loading large amounts of data into the table.  Running
 972     <command>ANALYZE</command> (or <command>VACUUM ANALYZE</command>)
 973     ensures that the planner has up-to-date statistics about the
 974     table.  With no statistics or obsolete statistics, the planner might
 975     make poor decisions during query planning, leading to poor
 976     performance on any tables with inaccurate or nonexistent
 977     statistics.  Note that if the autovacuum daemon is enabled, it might
 978     run <command>ANALYZE</command> automatically; see
 979     <xref linkend="vacuum-for-statistics" endterm="vacuum-for-statistics-title">
 980     and <xref linkend="autovacuum" endterm="autovacuum-title"> for more information.
 981    </para>
 982   </sect2>
 983
 984   <sect2 id="populate-pg-dump">
 985    <title>Some Notes About <application>pg_dump</></title>
 986
 987    <para>
 988     Dump scripts generated by <application>pg_dump</> automatically apply
 989     several, but not all, of the above guidelines.  To reload a
 990     <application>pg_dump</> dump as quickly as possible, you need to
 991     do a few extra things manually.  (Note that these points apply while
 992     <emphasis>restoring</> a dump, not while <emphasis>creating</> it.
 993     The same points apply when using <application>pg_restore</> to load
 994     from a <application>pg_dump</> archive file.)
 995    </para>
 996
 997    <para>
 998     By default, <application>pg_dump</> uses <command>COPY</>, and when
 999     it is generating a complete schema-and-data dump, it is careful to
1000     load data before creating indexes and foreign keys.  So in this case
1001     several guidelines are handled automatically.  What is left
1002     for you to do is to:
1003     <itemizedlist>
1004      <listitem>
1005       <para>
1006        Set appropriate (i.e., larger than normal) values for
1007        <varname>maintenance_work_mem</varname> and
1008        <varname>checkpoint_segments</varname>.
1009       </para>
1010      </listitem>
1011      <listitem>
1012       <para>
1013        If using WAL archiving, consider disabling it during the restore.
1014        To do that, turn off <varname>archive_mode</varname> before loading the
1015        dump script, and afterwards turn it back on
1016        and take a fresh base backup.
1017       </para>
1018      </listitem>
1019      <listitem>
1020       <para>
1021        Consider whether the whole dump should be restored as a single
1022        transaction.  To do that, pass the <option>-1</> or
1023        <option>--single-transaction</> command-line option to
1024        <application>psql</> or <application>pg_restore</>. When using this
1025        mode, even the smallest of errors will rollback the entire restore,
1026        possibly discarding many hours of processing.  Depending on how
1027        interrelated the data is, that might seem preferable to manual cleanup,
1028        or not.  <command>COPY</> commands will run fastest if you use a single
1029        transaction and have WAL archiving turned off.
1030       </para>
1031      </listitem>
1032      <listitem>
1033       <para>
1034        Run <command>ANALYZE</> afterwards.
1035       </para>
1036      </listitem>
1037     </itemizedlist>
1038    </para>
1039
1040    <para>
1041     A data-only dump will still use <command>COPY</>, but it does not
1042     drop or recreate indexes, and it does not normally touch foreign
1043     keys.
1044
1045      <footnote>
1046       <para>
1047        You can get the effect of disabling foreign keys by using
1048        the <option>--disable-triggers</> option &mdash; but realize that
1049        that eliminates, rather than just postponing, foreign key
1050        validation, and so it is possible to insert bad data if you use it.
1051       </para>
1052      </footnote>
1053
1054     So when loading a data-only dump, it is up to you to drop and recreate
1055     indexes and foreign keys if you wish to use those techniques.
1056     It's still useful to increase <varname>checkpoint_segments</varname>
1057     while loading the data, but don't bother increasing
1058     <varname>maintenance_work_mem</varname>; rather, you'd do that while
1059     manually recreating indexes and foreign keys afterwards.
1060     And don't forget to <command>ANALYZE</> when you're done; see
1061     <xref linkend="vacuum-for-statistics" endterm="vacuum-for-statistics-title">
1062     and <xref linkend="autovacuum" endterm="autovacuum-title"> for more information.
1063    </para>
1064   </sect2>
1065   </sect1>
1066
1067  </chapter>