Provide a bit more high-level documentation for the GEQO planner.

author Tom Lane <tgl@sss.pgh.pa.us>

Sat, 21 Jul 2007 04:02:41 +0000 (04:02 +0000)

committer Tom Lane <tgl@sss.pgh.pa.us>

Sat, 21 Jul 2007 04:02:41 +0000 (04:02 +0000)
author Tom Lane <tgl@sss.pgh.pa.us>
Sat, 21 Jul 2007 04:02:41 +0000 (04:02 +0000)
committer Tom Lane <tgl@sss.pgh.pa.us>
Sat, 21 Jul 2007 04:02:41 +0000 (04:02 +0000)
diff --git a/doc/src/sgml/arch-dev.sgml b/doc/src/sgml/arch-dev.sgml

index c861a656e904fcbf5dcbf53a4929613e85598a46..7ee1ba357f09ffc47930393de5b2fe76c27dc4c2 100644 (file)
--- a/doc/src/sgml/arch-dev.sgml
+++ b/doc/src/sgml/arch-dev.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/arch-dev.sgml,v 2.29 2007/01/31 20:56:16 momjian Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/arch-dev.sgml,v 2.30 2007/07/21 04:02:41 tgl Exp $ -->
  
   <chapter id="overview">
    <title>Overview of PostgreSQL Internals</title>
@@ -345,9 +345,10 @@
       can be executed would take an excessive amount of time and memory
       space. In particular, this occurs when executing queries
       involving large numbers of join operations. In order to determine
-     a reasonable (not optimal) query plan in a reasonable amount of
-     time, <productname>PostgreSQL</productname> uses a <xref
-     linkend="geqo" endterm="geqo-title">.
+     a reasonable (not necessarily optimal) query plan in a reasonable amount
+     of time, <productname>PostgreSQL</productname> uses a <xref
+     linkend="geqo" endterm="geqo-title"> when the number of joins
+     exceeds a threshold (see <xref linkend="guc-geqo-threshold">).
      </para>
     </note>
  
@@ -380,20 +381,17 @@
       the index's <firstterm>operator class</>, another plan is created using
       the B-tree index to scan the relation. If there are further indexes
       present and the restrictions in the query happen to match a key of an
-     index further plans will be considered.
+     index, further plans will be considered.  Index scan plans are also
+     generated for indexes that have a sort ordering that can match the
+     query's <literal>ORDER BY</> clause (if any), or a sort ordering that
+     might be useful for merge joining (see below).
      </para>
  
      <para>
-     After all feasible plans have been found for scanning single relations,
-     plans for joining relations are created. The planner/optimizer
-     preferentially considers joins between any two relations for which there
-     exist a corresponding join clause in the <literal>WHERE</literal> qualification (i.e. for
-     which a restriction like <literal>where rel1.attr1=rel2.attr2</literal>
-     exists). Join pairs with no join clause are considered only when there
-     is no other choice, that is, a particular relation has no available
-     join clauses to any other relation. All possible plans are generated for
-     every join pair considered
-     by the planner/optimizer. The three possible join strategies are:
+     If the query requires joining two or more relations,
+     plans for joining relations are considered
+     after all feasible plans have been found for scanning single relations.
+     The three available join strategies are:
  
       <itemizedlist>
        <listitem>
@@ -439,6 +437,26 @@
       cheapest one.
      </para>
  
+    <para>
+     If the query uses fewer than <xref linkend="guc-geqo-threshold">
+     relations, a near-exhaustive search is conducted to find the best
+     join sequence.  The planner preferentially considers joins between any
+     two relations for which there exist a corresponding join clause in the
+     <literal>WHERE</literal> qualification (i.e. for
+     which a restriction like <literal>where rel1.attr1=rel2.attr2</literal>
+     exists). Join pairs with no join clause are considered only when there
+     is no other choice, that is, a particular relation has no available
+     join clauses to any other relation. All possible plans are generated for
+     every join pair considered by the planner, and the one that is
+     (estimated to be) the cheapest is chosen.
+    </para>
+
+    <para>
+     When <varname>geqo_threshold</varname> is exceeded, the join
+     sequences considered are determined by heuristics, as described
+     in <xref linkend="geqo">.  Otherwise the process is the same.
+    </para>
+
      <para>
       The finished plan tree consists of sequential or index scans of
       the base relations, plus nested-loop, merge, or hash join nodes as
diff --git a/doc/src/sgml/geqo.sgml b/doc/src/sgml/geqo.sgml

index 6225dc4c3219ac87cbc8443bff86b44e11f42336..2f680762c13bb45c3b85bbba5c43011de112eb4b 100644 (file)
--- a/doc/src/sgml/geqo.sgml
+++ b/doc/src/sgml/geqo.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/geqo.sgml,v 1.39 2007/02/16 03:50:29 momjian Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/geqo.sgml,v 1.40 2007/07/21 04:02:41 tgl Exp $ -->
  
   <chapter id="geqo">
    <chapterinfo>
@@ -186,11 +186,6 @@
      <productname>PostgreSQL</productname> optimizer.
     </para>
  
-   <para>
-    Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's Genitor
-    algorithm.
-   </para>
-
     <para>
      Specific characteristics of the <acronym>GEQO</acronym>
      implementation in <productname>PostgreSQL</productname>
@@ -224,6 +219,11 @@
      </itemizedlist>
     </para>
  
+   <para>
+    Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's
+    Genitor algorithm.
+   </para>
+
     <para>
      The <acronym>GEQO</acronym> module allows
      the <productname>PostgreSQL</productname> query optimizer to
@@ -231,6 +231,42 @@
      non-exhaustive search.
     </para>
  
+  <sect2>
+   <title>Generating Possible Plans with <acronym>GEQO</acronym></title>
+
+   <para>
+    The <acronym>GEQO</acronym> planning process uses the standard planner
+    code to generate plans for scans of individual relations.  Then join
+    plans are developed using the genetic approach.  As shown above, each
+    candidate join plan is represented by a sequence in which to join
+    the base relations.  In the initial stage, the <acronym>GEQO</acronym>
+    code simply generates some possible join sequences at random.  For each
+    join sequence considered, the standard planner code is invoked to
+    estimate the cost of performing the query using that join sequence.
+    (For each step of the join sequence, all three possible join strategies
+    are considered; and all the initially-determined relation scan plans
+    are available.  The estimated cost is the cheapest of these
+    possibilities.)  Join sequences with lower estimated cost are considered
+    <quote>more fit</> than those with higher cost.  The genetic algorithm
+    discards the least fit candidates.  Then new candidates are generated
+    by combining genes of more-fit candidates &mdash; that is, by using
+    randomly-chosen portions of known low-cost join sequences to create
+    new sequences for consideration.  This process is repeated until a
+    preset number of join sequences have been considered; then the best
+    one found at any time during the search is used to generate the finished
+    plan.
+   </para>
+
+   <para>
+    This process is inherently nondeterministic, because of the randomized
+    choices made during both the initial population selection and subsequent
+    <quote>mutation</> of the best candidates.  Hence different plans may
+    be selected from one run to the next, resulting in varying run time
+    and varying output row order.
+   </para>
+
+  </sect2>
+
    <sect2 id="geqo-future">
     <title>Future Implementation Tasks for
      <productname>PostgreSQL</> <acronym>GEQO</acronym></title>
@@ -257,6 +293,16 @@
        </itemizedlist>
       </para>
  
+     <para>
+      In the current implementation, the fitness of each candidate join
+      sequence is estimated by running the standard planner's join selection
+      and cost estimation code from scratch.  To the extent that different
+      candidates use similar sub-sequences of joins, a great deal of work
+      will be repeated.  This could be made significantly faster by retaining
+      cost estimates for sub-joins.  The problem is to avoid expending
+      unreasonable amounts of memory on retaining that state.
+     </para>
+
       <para>
        At a more basic level, it is not clear that solving query optimization
        with a GA algorithm designed for TSP is appropriate.  In the TSP case,
author	Tom Lane <tgl@sss.pgh.pa.us>
	Sat, 21 Jul 2007 04:02:41 +0000 (04:02 +0000)
committer	Tom Lane <tgl@sss.pgh.pa.us>
	Sat, 21 Jul 2007 04:02:41 +0000 (04:02 +0000)
doc/src/sgml/arch-dev.sgml		patch \| blob \| history
doc/src/sgml/geqo.sgml		patch \| blob \| history