-<!--
-$Header: /cvsroot/pgsql/doc/src/sgml/geqo.sgml,v 1.14 2000/12/16 22:44:47 tgl Exp $
-Genetic Optimizer
--->
+<!-- doc/src/sgml/geqo.sgml -->
<chapter id="geqo">
- <docinfo>
- <author>
- <firstname>Martin</firstname>
- <surname>Utesch</surname>
- <affiliation>
- <orgname>
- University of Mining and Technology
- </orgname>
- <orgdiv>
- Institute of Automatic Control
- </orgdiv>
- <address>
- <city>
- Freiberg
- </city>
- <country>
- Germany
- </country>
- </address>
- </affiliation>
- </author>
- <date>1997-10-02</date>
- </docinfo>
-
- <title>Genetic Query Optimization in Database Systems</title>
+ <title>Genetic Query Optimizer</title>
<para>
<note>
<title>Author</title>
<para>
- Written by <ulink url="mailto:utesch@aut.tu-freiberg.de">Martin Utesch</ulink>
+ Written by Martin Utesch (<email>utesch@aut.tu-freiberg.de</email>)
for the Institute of Automatic Control at the University of Mining and Technology in Freiberg, Germany.
</para>
</note>
<title>Query Handling as a Complex Optimization Problem</title>
<para>
- Among all relational operators the most difficult one to process and
- optimize is the <firstterm>join</firstterm>. The number of alternative plans to answer a query
- grows exponentially with the number of <command>join</command>s included in it. Further
- optimization effort is caused by the support of a variety of
- <firstterm>join methods</firstterm>
- (e.g., nested loop, hash join, merge join in <productname>Postgres</productname>) to
- process individual <command>join</command>s and a diversity of
- <firstterm>indices</firstterm> (e.g., r-tree,
- b-tree, hash in <productname>Postgres</productname>) as access paths for relations.
+ Among all relational operators the most difficult one to process
+ and optimize is the <firstterm>join</firstterm>. The number of
+ possible query plans grows exponentially with the
+ number of joins in the query. Further optimization effort is
+ caused by the support of a variety of <firstterm>join
+ methods</firstterm> (e.g., nested loop, hash join, merge join in
+ <productname>PostgreSQL</productname>) to process individual joins
+ and a diversity of <firstterm>indexes</firstterm> (e.g.,
+ B-tree, hash, GiST and GIN in <productname>PostgreSQL</productname>) as
+ access paths for relations.
</para>
<para>
- The current <productname>Postgres</productname> optimizer
- implementation performs a <firstterm>near-exhaustive search</firstterm>
- over the space of alternative strategies. This query
- optimization technique is inadequate to support database application
- domains that involve the need for extensive queries, such as artificial
- intelligence.
+ The normal <productname>PostgreSQL</productname> query optimizer
+ performs a <firstterm>near-exhaustive search</firstterm> over the
+ space of alternative strategies. This algorithm, first introduced
+ in IBM's System R database, produces a near-optimal join order,
+ but can take an enormous amount of time and memory space when the
+ number of joins in the query grows large. This makes the ordinary
+ <productname>PostgreSQL</productname> query optimizer
+ inappropriate for queries that join a large number of tables.
</para>
<para>
The Institute of Automatic Control at the University of Mining and
- Technology, in Freiberg, Germany, encountered the described problems as its
- folks wanted to take the <productname>Postgres</productname> DBMS as the backend for a decision
- support knowledge based system for the maintenance of an electrical
- power grid. The DBMS needed to handle large <command>join</command> queries for the
- inference machine of the knowledge based system.
+ Technology, in Freiberg, Germany, encountered some problems when
+ it wanted to use <productname>PostgreSQL</productname> as the
+ backend for a decision support knowledge based system for the
+ maintenance of an electrical power grid. The DBMS needed to handle
+ large join queries for the inference machine of the knowledge
+ based system. The number of joins in these queries made using the
+ normal query optimizer infeasible.
</para>
<para>
- Performance difficulties in exploring the space of possible query
- plans created the demand for a new optimization technique being developed.
- </para>
-
- <para>
- In the following we propose the implementation of a <firstterm>Genetic Algorithm</firstterm>
- as an option for the database query optimization problem.
+ In the following we describe the implementation of a
+ <firstterm>genetic algorithm</firstterm> to solve the join
+ ordering problem in a manner that is efficient for queries
+ involving large numbers of joins.
</para>
</sect1>
<sect1 id="geqo-intro2">
- <title>Genetic Algorithms (<acronym>GA</acronym>)</title>
+ <title>Genetic Algorithms</title>
<para>
- The <acronym>GA</acronym> is a heuristic optimization method which
- operates through
- determined, randomized search. The set of possible solutions for the
+ The genetic algorithm (<acronym>GA</acronym>) is a heuristic optimization method which
+ operates through randomized search. The set of possible solutions for the
optimization problem is considered as a
<firstterm>population</firstterm> of <firstterm>individuals</firstterm>.
- The degree of adaption of an individual to its environment is specified
+ The degree of adaptation of an individual to its environment is specified
by its <firstterm>fitness</firstterm>.
</para>
</para>
<para>
- According to the "comp.ai.genetic" <acronym>FAQ</acronym> it cannot be stressed too
+ According to the <systemitem class="resource">comp.ai.genetic</> <acronym>FAQ</acronym> it cannot be stressed too
strongly that a <acronym>GA</acronym> is not a pure random search for a solution to a
problem. A <acronym>GA</acronym> uses stochastic processes, but the result is distinctly
- non-random (better than random).
-
- <programlisting>
-Structured Diagram of a <acronym>GA</acronym>:
----------------------------
-
-P(t) generation of ancestors at a time t
-P''(t) generation of descendants at a time t
+ non-random (better than random).
+ </para>
+ <figure id="geqo-diagram">
+ <title>Structured Diagram of a Genetic Algorithm</title>
+
+ <informaltable frame="none">
+ <tgroup cols="2">
+ <tbody>
+ <row>
+ <entry>P(t)</entry>
+ <entry>generation of ancestors at a time t</entry>
+ </row>
+
+ <row>
+ <entry>P''(t)</entry>
+ <entry>generation of descendants at a time t</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </informaltable>
+
+<literallayout class="monospaced">
+=========================================+
-|>>>>>>>>>>> Algorithm GA <<<<<<<<<<<<<<|
+|>>>>>>>>>>> Algorithm GA <<<<<<<<<<<<<<|
+=========================================+
| INITIALIZE t := 0 |
+=========================================+
| INITIALIZE P(t) |
+=========================================+
-| evalute FITNESS of P(t) |
+| evaluate FITNESS of P(t) |
+=========================================+
| while not STOPPING CRITERION do |
| +-------------------------------------+
| +-------------------------------------+
| | P(t+1) := SELECTION{P''(t) + P(t)} |
| +-------------------------------------+
-| | evalute FITNESS of P''(t) |
+| | evaluate FITNESS of P''(t) |
| +-------------------------------------+
| | t := t + 1 |
+===+=====================================+
- </programlisting>
- </para>
+</literallayout>
+ </figure>
</sect1>
<sect1 id="geqo-pg-intro">
- <title>Genetic Query Optimization (<acronym>GEQO</acronym>) in Postgres</title>
+ <title>Genetic Query Optimization (<acronym>GEQO</acronym>) in PostgreSQL</title>
<para>
- The <acronym>GEQO</acronym> module is intended for the solution of the query
- optimization problem similar to a traveling salesman problem (<acronym>TSP</acronym>).
+ The <acronym>GEQO</acronym> module approaches the query
+ optimization problem as though it were the well-known traveling salesman
+ problem (<acronym>TSP</acronym>).
Possible query plans are encoded as integer strings. Each string
- represents the <command>join</command> order from one relation of the query to the next.
- E. g., the query tree
- <programlisting>
+ represents the join order from one relation of the query to the next.
+ For example, the join tree
+<literallayout class="monospaced">
/\
/\ 2
/\ 3
4 1
- </programlisting>
+</literallayout>
is encoded by the integer string '4-1-3-2',
which means, first join relation '4' and '1', then '3', and
- then '2', where 1, 2, 3, 4 are relids within the
- <productname>Postgres</productname> optimizer.
- </para>
-
- <para>
- Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's Genitor
- algorithm.
+ then '2', where 1, 2, 3, 4 are relation IDs within the
+ <productname>PostgreSQL</productname> optimizer.
</para>
<para>
Specific characteristics of the <acronym>GEQO</acronym>
- implementation in <productname>Postgres</productname>
+ implementation in <productname>PostgreSQL</productname>
are:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
- Usage of <firstterm>edge recombination crossover</firstterm> which is
- especially suited
- to keep edge losses low for the solution of the
- <acronym>TSP</acronym> by means of a <acronym>GA</acronym>;
+ Usage of <firstterm>edge recombination crossover</firstterm>
+ which is especially suited to keep edge losses low for the
+ solution of the <acronym>TSP</acronym> by means of a
+ <acronym>GA</acronym>;
</para>
</listitem>
</itemizedlist>
</para>
+ <para>
+ Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's
+ Genitor algorithm.
+ </para>
+
<para>
The <acronym>GEQO</acronym> module allows
- the <productname>Postgres</productname> query optimizer to
- support large <command>join</command> queries effectively through
+ the <productname>PostgreSQL</productname> query optimizer to
+ support large join queries effectively through
non-exhaustive search.
</para>
+ <sect2>
+ <title>Generating Possible Plans with <acronym>GEQO</acronym></title>
+
+ <para>
+ The <acronym>GEQO</acronym> planning process uses the standard planner
+ code to generate plans for scans of individual relations. Then join
+ plans are developed using the genetic approach. As shown above, each
+ candidate join plan is represented by a sequence in which to join
+ the base relations. In the initial stage, the <acronym>GEQO</acronym>
+ code simply generates some possible join sequences at random. For each
+ join sequence considered, the standard planner code is invoked to
+ estimate the cost of performing the query using that join sequence.
+ (For each step of the join sequence, all three possible join strategies
+ are considered; and all the initially-determined relation scan plans
+ are available. The estimated cost is the cheapest of these
+ possibilities.) Join sequences with lower estimated cost are considered
+ <quote>more fit</> than those with higher cost. The genetic algorithm
+ discards the least fit candidates. Then new candidates are generated
+ by combining genes of more-fit candidates — that is, by using
+ randomly-chosen portions of known low-cost join sequences to create
+ new sequences for consideration. This process is repeated until a
+ preset number of join sequences have been considered; then the best
+ one found at any time during the search is used to generate the finished
+ plan.
+ </para>
+
+ <para>
+ This process is inherently nondeterministic, because of the randomized
+ choices made during both the initial population selection and subsequent
+ <quote>mutation</> of the best candidates. To avoid surprising changes
+ of the selected plan, each run of the GEQO algorithm restarts its
+ random number generator with the current <xref linkend="guc-geqo-seed">
+ parameter setting. As long as <varname>geqo_seed</> and the other
+ GEQO parameters are kept fixed, the same plan will be generated for a
+ given query (and other planner inputs such as statistics). To experiment
+ with different search paths, try changing <varname>geqo_seed</>.
+ </para>
+
+ </sect2>
+
<sect2 id="geqo-future">
<title>Future Implementation Tasks for
<productname>PostgreSQL</> <acronym>GEQO</acronym></title>
<para>
Work is still needed to improve the genetic algorithm parameter
settings.
- In file <filename>backend/optimizer/geqo/geqo_params.c</filename>, routines
+ In file <filename>src/backend/optimizer/geqo/geqo_main.c</filename>,
+ routines
<function>gimme_pool_size</function> and <function>gimme_number_generations</function>,
we have to find a compromise for the parameter settings
to satisfy two competing demands:
<itemizedlist spacing="compact">
<listitem>
- <para>
- Optimality of the query plan
- </para>
+ <para>
+ Optimality of the query plan
+ </para>
</listitem>
<listitem>
- <para>
- Computing time
- </para>
+ <para>
+ Computing time
+ </para>
</listitem>
</itemizedlist>
</para>
- </sect2>
+ <para>
+ In the current implementation, the fitness of each candidate join
+ sequence is estimated by running the standard planner's join selection
+ and cost estimation code from scratch. To the extent that different
+ candidates use similar sub-sequences of joins, a great deal of work
+ will be repeated. This could be made significantly faster by retaining
+ cost estimates for sub-joins. The problem is to avoid expending
+ unreasonable amounts of memory on retaining that state.
+ </para>
- <bibliography id="geqo-biblio">
- <title>
- References
- </title>
- <para>Reference information for <acronym>GEQ</acronym> algorithms.
- </para>
- <biblioentry>
-
- <bookbiblio>
- <title>
- The Hitch-Hiker's Guide to Evolutionary Computation
- </title>
- <authorgroup>
- <author>
- <firstname>Jörg</firstname>
- <surname>Heitkötter</surname>
- </author>
- <author>
- <firstname>David</firstname>
- <surname>Beasley</surname>
- </author>
- </authorgroup>
- <publisher>
- <publishername>
- InterNet resource
- </publishername>
- </publisher>
- <abstract>
- <para>
- FAQ in <ulink url="news://comp.ai.genetic">comp.ai.genetic</ulink>
- is available at <ulink
- url="ftp://ftp.Germany.EU.net/pub/research/softcomp/EC/Welcome.html">Encore</ulink>.
- </para>
- </abstract>
- </bookbiblio>
-
- <bookbiblio>
- <title>
- The Design and Implementation of the Postgres Query Optimizer
- </title>
- <authorgroup>
- <author>
- <firstname>Z.</firstname>
- <surname>Fong</surname>
- </author>
- </authorgroup>
- <publisher>
- <publishername>
- University of California, Berkeley Computer Science Department
- </publishername>
- </publisher>
- <abstract>
- <para>
- File <filename>planner/Report.ps</filename> in the 'postgres-papers' distribution.
- </para>
- </abstract>
- </bookbiblio>
-
- <bookbiblio>
- <title>
- Fundamentals of Database Systems
- </title>
- <authorgroup>
- <author>
- <firstname>R.</firstname>
- <surname>Elmasri</surname>
- </author>
- <author>
- <firstname>S.</firstname>
- <surname>Navathe</surname>
- </author>
- </authorgroup>
- <publisher>
- <publishername>
- The Benjamin/Cummings Pub., Inc.
- </publishername>
- </publisher>
- </bookbiblio>
-
- </biblioentry>
- </bibliography>
+ <para>
+ At a more basic level, it is not clear that solving query optimization
+ with a GA algorithm designed for TSP is appropriate. In the TSP case,
+ the cost associated with any substring (partial tour) is independent
+ of the rest of the tour, but this is certainly not true for query
+ optimization. Thus it is questionable whether edge recombination
+ crossover is the most effective mutation procedure.
+ </para>
+ </sect2>
</sect1>
- </chapter>
-
-<!-- Keep this comment at the end of the file
-Local variables:
-mode:sgml
-sgml-omittag:nil
-sgml-shorttag:t
-sgml-minimize-attributes:nil
-sgml-always-quote-attributes:t
-sgml-indent-step:1
-sgml-indent-data:t
-sgml-parent-document:nil
-sgml-default-dtd-file:"./reference.ced"
-sgml-exposed-tags:nil
-sgml-local-catalogs:("/usr/lib/sgml/catalog")
-sgml-local-ecat-files:nil
-End:
--->
+
+ <sect1 id="geqo-biblio">
+ <title>Further Reading</title>
+
+ <para>
+ The following resources contain additional information about
+ genetic algorithms:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ <ulink url="http://www.aip.de/~ast/EvolCompFAQ/">
+ The Hitch-Hiker's Guide to Evolutionary Computation</ulink>, (FAQ for <ulink
+ url="news://comp.ai.genetic"></ulink>)
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <ulink url="http://www.red3d.com/cwr/evolve.html">
+ Evolutionary Computation and its application to art and design</ulink>, by
+ Craig Reynolds
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <xref linkend="ELMA04">
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <xref linkend="FONG">
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+
+ </sect1>
+</chapter>