Trim trailing whitespace

[postgresql] / doc / src / sgml / geqo.sgml
diff --git a/doc/src/sgml/geqo.sgml b/doc/src/sgml/geqo.sgml

index 546a864901b8d5f2010ed755ecfbe929f3ad2e35..e0f8adcd6ed6901653f9fb01196b221af92ee05d 100644 (file)
--- a/doc/src/sgml/geqo.sgml
+++ b/doc/src/sgml/geqo.sgml
@@ -1,40 +1,13 @@
-<!--
-$Header: /cvsroot/pgsql/doc/src/sgml/geqo.sgml,v 1.14 2000/12/16 22:44:47 tgl Exp $
-Genetic Optimizer
--->
+<!-- doc/src/sgml/geqo.sgml -->
  
   <chapter id="geqo">
-  <docinfo>
-   <author>
-    <firstname>Martin</firstname>
-    <surname>Utesch</surname>
-    <affiliation>
-     <orgname>
-      University of Mining and Technology
-     </orgname>
-     <orgdiv>
-      Institute of Automatic Control
-     </orgdiv>
-     <address>
-      <city>
-       Freiberg
-      </city>
-      <country>
-       Germany
-      </country>
-     </address>
-    </affiliation>
-   </author>
-   <date>1997-10-02</date>
-  </docinfo>
-
-  <title>Genetic Query Optimization in Database Systems</title>
+  <title>Genetic Query Optimizer</title>
  
    <para>
     <note>
      <title>Author</title>
      <para>
-     Written by <ulink url="mailto:utesch@aut.tu-freiberg.de">Martin Utesch</ulink>
+     Written by Martin Utesch (<email>utesch@aut.tu-freiberg.de</email>)
       for the Institute of Automatic Control at the University of Mining and Technology in Freiberg, Germany.
      </para>
     </note>
@@ -44,56 +17,57 @@ Genetic Optimizer
     <title>Query Handling as a Complex Optimization Problem</title>
  
     <para>
-    Among all relational operators the most difficult one to process and
-    optimize is the <firstterm>join</firstterm>. The number of alternative plans to answer a query
-    grows exponentially with the number of <command>join</command>s included in it. Further
-    optimization effort is caused by the support of a variety of
-    <firstterm>join methods</firstterm>
-    (e.g., nested loop, hash join, merge join in <productname>Postgres</productname>) to
-    process individual <command>join</command>s and a diversity of
-    <firstterm>indices</firstterm> (e.g., r-tree,
-    b-tree, hash in <productname>Postgres</productname>) as access paths for relations.
+    Among all relational operators the most difficult one to process
+    and optimize is the <firstterm>join</firstterm>. The number of
+    possible query plans grows exponentially with the
+    number of joins in the query. Further optimization effort is
+    caused by the support of a variety of <firstterm>join
+    methods</firstterm> (e.g., nested loop, hash join, merge join in
+    <productname>PostgreSQL</productname>) to process individual joins
+    and a diversity of <firstterm>indexes</firstterm> (e.g.,
+    B-tree, hash, GiST and GIN in <productname>PostgreSQL</productname>) as
+    access paths for relations.
     </para>
  
     <para>
-    The current <productname>Postgres</productname> optimizer
-    implementation performs a <firstterm>near-exhaustive search</firstterm>
-    over the space of alternative strategies. This query 
-    optimization technique is inadequate to support database application
-    domains that involve the need for extensive queries, such as artificial
-    intelligence.
+    The normal <productname>PostgreSQL</productname> query optimizer
+    performs a <firstterm>near-exhaustive search</firstterm> over the
+    space of alternative strategies. This algorithm, first introduced
+    in IBM's System R database, produces a near-optimal join order,
+    but can take an enormous amount of time and memory space when the
+    number of joins in the query grows large. This makes the ordinary
+    <productname>PostgreSQL</productname> query optimizer
+    inappropriate for queries that join a large number of tables.
     </para>
  
     <para>
      The Institute of Automatic Control at the University of Mining and
-    Technology, in Freiberg, Germany, encountered the described problems as its
-    folks wanted to take the <productname>Postgres</productname> DBMS as the backend for a decision
-    support knowledge based system for the maintenance of an electrical
-    power grid. The DBMS needed to handle large <command>join</command> queries for the
-    inference machine of the knowledge based system.
+    Technology, in Freiberg, Germany, encountered some problems when
+    it wanted to use <productname>PostgreSQL</productname> as the
+    backend for a decision support knowledge based system for the
+    maintenance of an electrical power grid. The DBMS needed to handle
+    large join queries for the inference machine of the knowledge
+    based system. The number of joins in these queries made using the
+    normal query optimizer infeasible.
     </para>
  
     <para>
-    Performance difficulties in exploring the space of possible query
-    plans created the demand for a new optimization technique being developed.
-   </para>
-
-   <para>
-    In the following we propose the implementation of a <firstterm>Genetic Algorithm</firstterm>
-    as an option for the database query optimization problem.
+    In the following we describe the implementation of a
+    <firstterm>genetic algorithm</firstterm> to solve the join
+    ordering problem in a manner that is efficient for queries
+    involving large numbers of joins.
     </para>
    </sect1>
  
    <sect1 id="geqo-intro2">
-   <title>Genetic Algorithms (<acronym>GA</acronym>)</title>
+   <title>Genetic Algorithms</title>
  
     <para>
-    The <acronym>GA</acronym> is a heuristic optimization method which
-    operates through
-    determined, randomized search. The set of possible solutions for the
+    The genetic algorithm (<acronym>GA</acronym>) is a heuristic optimization method which
+    operates through randomized search. The set of possible solutions for the
      optimization problem is considered as a
      <firstterm>population</firstterm> of <firstterm>individuals</firstterm>.
-    The degree of adaption of an individual to its environment is specified
+    The degree of adaptation of an individual to its environment is specified
      by its <firstterm>fitness</firstterm>.
     </para>
  
@@ -114,26 +88,40 @@ Genetic Optimizer
     </para>
  
     <para>
-    According to the "comp.ai.genetic" <acronym>FAQ</acronym> it cannot be stressed too
+    According to the <systemitem class="resource">comp.ai.genetic</> <acronym>FAQ</acronym> it cannot be stressed too
      strongly that a <acronym>GA</acronym> is not a pure random search for a solution to a
      problem. A <acronym>GA</acronym> uses stochastic processes, but the result is distinctly
-    non-random (better than random). 
-
-    <programlisting>
-Structured Diagram of a <acronym>GA</acronym>:
----------------------------
-
-P(t)    generation of ancestors at a time t
-P''(t)  generation of descendants at a time t
+    non-random (better than random).
+   </para>
  
+   <figure id="geqo-diagram">
+    <title>Structured Diagram of a Genetic Algorithm</title>
+
+    <informaltable frame="none">
+     <tgroup cols="2">
+      <tbody>
+       <row>
+        <entry>P(t)</entry>
+        <entry>generation of ancestors at a time t</entry>
+       </row>
+
+       <row>
+        <entry>P''(t)</entry>
+        <entry>generation of descendants at a time t</entry>
+       </row>
+      </tbody>
+     </tgroup>
+    </informaltable>
+
+<literallayout class="monospaced">
  +=========================================+
-|>>>>>>>>>>>  Algorithm GA  <<<<<<<<<<<<<<|
+|&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;  Algorithm GA  &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;|
  +=========================================+
  | INITIALIZE t := 0                       |
  +=========================================+
  | INITIALIZE P(t)                         |
  +=========================================+
-| evalute FITNESS of P(t)                 |
+| evaluate FITNESS of P(t)                |
  +=========================================+
  | while not STOPPING CRITERION do         |
  |   +-------------------------------------+
@@ -143,43 +131,39 @@ P''(t)  generation of descendants at a time t
  |   +-------------------------------------+
  |   | P(t+1) := SELECTION{P''(t) + P(t)}  |
  |   +-------------------------------------+
-|   | evalute FITNESS of P''(t)           |
+|   | evaluate FITNESS of P''(t)          |
  |   +-------------------------------------+
  |   | t := t + 1                          |
  +===+=====================================+
-    </programlisting>
-   </para>
+</literallayout>
+   </figure>
    </sect1>
  
    <sect1 id="geqo-pg-intro">
-   <title>Genetic Query Optimization (<acronym>GEQO</acronym>) in Postgres</title>
+   <title>Genetic Query Optimization (<acronym>GEQO</acronym>) in PostgreSQL</title>
  
     <para>
-    The <acronym>GEQO</acronym> module is intended for the solution of the query
-    optimization problem similar to a traveling salesman problem (<acronym>TSP</acronym>).
+    The <acronym>GEQO</acronym> module approaches the query
+    optimization problem as though it were the well-known traveling salesman
+    problem (<acronym>TSP</acronym>).
      Possible query plans are encoded as integer strings. Each string
-    represents the <command>join</command> order from one relation of the query to the next.
-    E. g., the query tree
-    <programlisting>
+    represents the join order from one relation of the query to the next.
+    For example, the join tree
+<literallayout class="monospaced">
     /\
    /\ 2
   /\ 3
  4  1
-    </programlisting>
+</literallayout>
      is encoded by the integer string '4-1-3-2',
      which means, first join relation '4' and '1', then '3', and
-    then '2', where 1, 2, 3, 4 are relids within the
-    <productname>Postgres</productname> optimizer.
-   </para>
-
-   <para>
-    Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's Genitor
-    algorithm.
+    then '2', where 1, 2, 3, 4 are relation IDs within the
+    <productname>PostgreSQL</productname> optimizer.
     </para>
  
     <para>
      Specific characteristics of the <acronym>GEQO</acronym>
-    implementation in <productname>Postgres</productname>
+    implementation in <productname>PostgreSQL</productname>
      are:
  
      <itemizedlist spacing="compact" mark="bullet">
@@ -194,10 +178,10 @@ P''(t)  generation of descendants at a time t
  
       <listitem>
        <para>
-       Usage of <firstterm>edge recombination crossover</firstterm> which is
-       especially suited
-       to keep edge losses low for the solution of the
-       <acronym>TSP</acronym> by means of a <acronym>GA</acronym>;
+       Usage of <firstterm>edge recombination crossover</firstterm>
+       which is especially suited to keep edge losses low for the
+       solution of the <acronym>TSP</acronym> by means of a
+       <acronym>GA</acronym>;
        </para>
       </listitem>
  
@@ -210,13 +194,58 @@ P''(t)  generation of descendants at a time t
      </itemizedlist>
     </para>
  
+   <para>
+    Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's
+    Genitor algorithm.
+   </para>
+
     <para>
      The <acronym>GEQO</acronym> module allows
-    the <productname>Postgres</productname> query optimizer to
-    support large <command>join</command> queries effectively through
+    the <productname>PostgreSQL</productname> query optimizer to
+    support large join queries effectively through
      non-exhaustive search.
     </para>
  
+  <sect2>
+   <title>Generating Possible Plans with <acronym>GEQO</acronym></title>
+
+   <para>
+    The <acronym>GEQO</acronym> planning process uses the standard planner
+    code to generate plans for scans of individual relations.  Then join
+    plans are developed using the genetic approach.  As shown above, each
+    candidate join plan is represented by a sequence in which to join
+    the base relations.  In the initial stage, the <acronym>GEQO</acronym>
+    code simply generates some possible join sequences at random.  For each
+    join sequence considered, the standard planner code is invoked to
+    estimate the cost of performing the query using that join sequence.
+    (For each step of the join sequence, all three possible join strategies
+    are considered; and all the initially-determined relation scan plans
+    are available.  The estimated cost is the cheapest of these
+    possibilities.)  Join sequences with lower estimated cost are considered
+    <quote>more fit</> than those with higher cost.  The genetic algorithm
+    discards the least fit candidates.  Then new candidates are generated
+    by combining genes of more-fit candidates &mdash; that is, by using
+    randomly-chosen portions of known low-cost join sequences to create
+    new sequences for consideration.  This process is repeated until a
+    preset number of join sequences have been considered; then the best
+    one found at any time during the search is used to generate the finished
+    plan.
+   </para>
+
+   <para>
+    This process is inherently nondeterministic, because of the randomized
+    choices made during both the initial population selection and subsequent
+    <quote>mutation</> of the best candidates.  To avoid surprising changes
+    of the selected plan, each run of the GEQO algorithm restarts its
+    random number generator with the current <xref linkend="guc-geqo-seed">
+    parameter setting.  As long as <varname>geqo_seed</> and the other
+    GEQO parameters are kept fixed, the same plan will be generated for a
+    given query (and other planner inputs such as statistics).  To experiment
+    with different search paths, try changing <varname>geqo_seed</>.
+   </para>
+
+  </sect2>
+
    <sect2 id="geqo-future">
     <title>Future Implementation Tasks for
      <productname>PostgreSQL</> <acronym>GEQO</acronym></title>
@@ -224,124 +253,84 @@ P''(t)  generation of descendants at a time t
       <para>
        Work is still needed to improve the genetic algorithm parameter
        settings.
-      In file <filename>backend/optimizer/geqo/geqo_params.c</filename>, routines
+      In file <filename>src/backend/optimizer/geqo/geqo_main.c</filename>,
+      routines
        <function>gimme_pool_size</function> and <function>gimme_number_generations</function>,
        we have to find a compromise for the parameter settings
        to satisfy two competing demands:
        <itemizedlist spacing="compact">
         <listitem>
-       <para>
-        Optimality of the query plan
-       </para>
+        <para>
+         Optimality of the query plan
+        </para>
         </listitem>
         <listitem>
-       <para>
-        Computing time
-       </para>
+        <para>
+         Computing time
+        </para>
         </listitem>
        </itemizedlist>
       </para>
  
-   </sect2>
+     <para>
+      In the current implementation, the fitness of each candidate join
+      sequence is estimated by running the standard planner's join selection
+      and cost estimation code from scratch.  To the extent that different
+      candidates use similar sub-sequences of joins, a great deal of work
+      will be repeated.  This could be made significantly faster by retaining
+      cost estimates for sub-joins.  The problem is to avoid expending
+      unreasonable amounts of memory on retaining that state.
+     </para>
  
-   <bibliography id="geqo-biblio">
-    <title>
-     References
-    </title>
-    <para>Reference information for <acronym>GEQ</acronym> algorithms.
-    </para>
-    <biblioentry>
-
-     <bookbiblio>
-      <title>
-       The Hitch-Hiker's Guide to Evolutionary Computation
-      </title>
-      <authorgroup>
-       <author>
-       <firstname>J&ouml;rg</firstname>
-       <surname>Heitk&ouml;tter</surname>
-       </author>
-       <author>
-       <firstname>David</firstname>
-       <surname>Beasley</surname>
-       </author>
-      </authorgroup>
-      <publisher>
-       <publishername>
-       InterNet resource
-       </publishername>
-      </publisher>
-      <abstract>
-       <para>
-       FAQ in <ulink url="news://comp.ai.genetic">comp.ai.genetic</ulink>
-       is available at <ulink
-        url="ftp://ftp.Germany.EU.net/pub/research/softcomp/EC/Welcome.html">Encore</ulink>.
-       </para>
-      </abstract>
-     </bookbiblio>
-
-     <bookbiblio>
-      <title>
-       The Design and Implementation of the Postgres Query Optimizer
-      </title>
-      <authorgroup>
-       <author>
-       <firstname>Z.</firstname>
-       <surname>Fong</surname>
-       </author>
-      </authorgroup>
-      <publisher>
-       <publishername>
-       University of California, Berkeley Computer Science Department
-       </publishername>
-      </publisher>
-      <abstract>
-       <para>
-       File <filename>planner/Report.ps</filename> in the 'postgres-papers' distribution.
-       </para>
-      </abstract>
-     </bookbiblio>
-
-     <bookbiblio>
-      <title>
-       Fundamentals of Database Systems
-      </title>
-      <authorgroup>
-       <author>
-       <firstname>R.</firstname>
-       <surname>Elmasri</surname>
-       </author>
-       <author>
-       <firstname>S.</firstname>
-       <surname>Navathe</surname>
-       </author>
-      </authorgroup>
-      <publisher>
-       <publishername>
-       The Benjamin/Cummings Pub., Inc.
-       </publishername>
-      </publisher>
-     </bookbiblio>
-
-    </biblioentry>
-   </bibliography>
+     <para>
+      At a more basic level, it is not clear that solving query optimization
+      with a GA algorithm designed for TSP is appropriate.  In the TSP case,
+      the cost associated with any substring (partial tour) is independent
+      of the rest of the tour, but this is certainly not true for query
+      optimization.  Thus it is questionable whether edge recombination
+      crossover is the most effective mutation procedure.
+     </para>
  
+   </sect2>
    </sect1>
- </chapter>
-
-<!-- Keep this comment at the end of the file
-Local variables:
-mode:sgml
-sgml-omittag:nil
-sgml-shorttag:t
-sgml-minimize-attributes:nil
-sgml-always-quote-attributes:t
-sgml-indent-step:1
-sgml-indent-data:t
-sgml-parent-document:nil
-sgml-default-dtd-file:"./reference.ced"
-sgml-exposed-tags:nil
-sgml-local-catalogs:("/usr/lib/sgml/catalog")
-sgml-local-ecat-files:nil
-End:
--->
+
+ <sect1 id="geqo-biblio">
+  <title>Further Reading</title>
+
+  <para>
+   The following resources contain additional information about
+   genetic algorithms:
+
+   <itemizedlist>
+    <listitem>
+     <para>
+      <ulink url="http://www.aip.de/~ast/EvolCompFAQ/">
+      The Hitch-Hiker's Guide to Evolutionary Computation</ulink>, (FAQ for <ulink
+      url="news://comp.ai.genetic"></ulink>)
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
+      <ulink url="http://www.red3d.com/cwr/evolve.html">
+      Evolutionary Computation and its application to art and design</ulink>, by
+      Craig Reynolds
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
+      <xref linkend="ELMA04">
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
+      <xref linkend="FONG">
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+ </sect1>
+</chapter>