granicus.if.org Git - postgresql/blob - doc/src/sgml/arch-dev.sgml

   1 <!--
   2 $Header: /cvsroot/pgsql/doc/src/sgml/arch-dev.sgml,v 2.21 2003/06/22 16:16:44 tgl Exp $
   3 -->
   4
   5  <chapter id="overview">
   6   <title>Overview of PostgreSQL Internals</title>
   7
   8   <note>
   9    <title>Author</title>
  10    <para>
  11     This chapter originated as part of
  12     <xref linkend="SIM98">, Stefan Simkovics'
  13     Master's Thesis prepared at Vienna University of Technology under the direction
  14     of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr.
  15    </para>
  16   </note>
  17
  18   <para>
  19    This chapter gives an overview of the internal structure of the
  20    backend of <productname>PostgreSQL</productname>.  After having
  21    read the following sections you should have an idea of how a query
  22    is processed. This chapter does not aim to provide a detailed
  23    description of the internal operation of
  24    <productname>PostgreSQL</productname>, as such a document would be
  25    very extensive. Rather, this chapter is intended to help the reader
  26    understand the general sequence of operations that occur within the
  27    backend from the point at which a query is received, to the point
  28    when the results are returned to the client.
  29   </para>
  30
  31   <sect1 id="query-path">
  32    <title>The Path of a Query</title>
  33
  34    <para>
  35     Here we give a short overview of the stages a query has to pass in
  36     order to obtain a result.
  37    </para>
  38
  39    <procedure>
  40     <step>
  41      <para>
  42       A connection from an application program to the <productname>PostgreSQL</productname>
  43       server has to be established. The application program transmits a
  44       query to the server and waits to receive the results sent back by the
  45       server.
  46      </para>
  47     </step>
  48
  49     <step>
  50      <para>
  51       The <firstterm>parser stage</firstterm> checks the query
  52       transmitted by the application
  53       program for correct syntax and creates
  54       a <firstterm>query tree</firstterm>.
  55      </para>
  56     </step>
  57
  58     <step>
  59      <para>
  60       The <firstterm>rewrite system</firstterm> takes
  61       the query tree created by the parser stage and looks for
  62       any <firstterm>rules</firstterm> (stored in the
  63       <firstterm>system catalogs</firstterm>) to apply to
  64       the query tree.  It performs the
  65       transformations given in the <firstterm>rule bodies</firstterm>.
  66       One application of the rewrite system is in the realization of
  67       <firstterm>views</firstterm>.
  68      </para>
  69
  70      <para>
  71       Whenever a query against a view
  72       (i.e. a <firstterm>virtual table</firstterm>) is made,
  73       the rewrite system rewrites the user's query to
  74       a query that accesses the <firstterm>base tables</firstterm> given in
  75       the <firstterm>view definition</firstterm> instead.
  76      </para>
  77     </step>
  78
  79     <step>
  80      <para>
  81       The <firstterm>planner/optimizer</firstterm> takes
  82       the (rewritten) querytree and creates a
  83       <firstterm>query plan</firstterm> that will be the input to the
  84       <firstterm>executor</firstterm>.
  85      </para>
  86
  87      <para>
  88       It does so by first creating all possible <firstterm>paths</firstterm>
  89       leading to the same result. For example if there is an index on a
  90       relation to be scanned, there are two paths for the
  91       scan. One possibility is a simple sequential scan and the other
  92       possibility is to use the index. Next the cost for the execution of
  93       each plan is estimated and the
  94       cheapest plan is chosen and handed back.
  95      </para>
  96     </step>
  97
  98     <step>
  99      <para>
 100       The executor recursively steps through
 101       the <firstterm>plan tree</firstterm> and
 102       retrieves tuples in the way represented by the plan.
 103       The executor makes use of the
 104       <firstterm>storage system</firstterm> while scanning
 105       relations, performs <firstterm>sorts</firstterm> and <firstterm>joins</firstterm>,
 106       evaluates <firstterm>qualifications</firstterm> and finally hands back the tuples derived.
 107      </para>
 108     </step>
 109    </procedure>
 110
 111    <para>
 112     In the following sections we will cover each of the above listed items
 113     in more detail to give a better understanding of <productname>PostgreSQL</productname>'s internal
 114     control and data structures.
 115    </para>
 116   </sect1>
 117
 118   <sect1 id="connect-estab">
 119    <title>How Connections are Established</title>
 120
 121    <para>
 122     <productname>PostgreSQL</productname> is implemented using a
 123     simple <quote>process per user</> client/server model.  In this model
 124     there is one <firstterm>client process</firstterm> connected to
 125     exactly one <firstterm>server process</firstterm>.  As we do not
 126     know ahead of time how many connections will be made, we have to
 127     use a <firstterm>master process</firstterm> that spawns a new
 128     server process every time a connection is requested. This master
 129     process is called <literal>postmaster</literal> and listens at a
 130     specified TCP/IP port for incoming connections. Whenever a request
 131     for a connection is detected the <literal>postmaster</literal>
 132     process spawns a new server process called
 133     <literal>postgres</literal>. The server tasks
 134     (<literal>postgres</literal> processes) communicate with each
 135     other using <firstterm>semaphores</firstterm> and
 136     <firstterm>shared memory</firstterm> to ensure data integrity
 137     throughout concurrent data access.
 138    </para>
 139
 140    <para>
 141     The client process can be any program that understands the
 142     <productname>PostgreSQL</productname> protocol described in
 143     <xref linkend="protocol">.  Many clients are based on the
 144     C-language library <application>libpq</>, but several independent
 145     implementations exist, such as the Java <application>JDBC</> driver.
 146    </para>
 147
 148    <para>
 149     Once a connection is established the client process can send a query
 150     to the <firstterm>backend</firstterm> (server). The query is transmitted using plain text,
 151     i.e. there is no parsing done in the <firstterm>frontend</firstterm> (client). The
 152     server parses the query, creates an <firstterm>execution plan</firstterm>,
 153     executes the plan and returns the retrieved tuples to the client
 154     by transmitting them over the established connection.
 155    </para>
 156   </sect1>
 157
 158   <sect1 id="parser-stage">
 159    <title>The Parser Stage</title>
 160
 161    <para>
 162     The <firstterm>parser stage</firstterm> consists of two parts:
 163
 164     <itemizedlist>
 165      <listitem>
 166       <para>
 167        The <firstterm>parser</firstterm> defined in
 168        <filename>gram.y</filename> and <filename>scan.l</filename> is
 169        built using the Unix tools <application>yacc</application>
 170        and <application>lex</application>.
 171       </para>
 172      </listitem>
 173      <listitem>
 174       <para>
 175        The <firstterm>transformation process</firstterm> does
 176        modifications and augmentations to the data structures returned by the parser.
 177       </para>
 178      </listitem>
 179     </itemizedlist>
 180    </para>
 181
 182    <sect2>
 183     <title>Parser</title>
 184
 185     <para>
 186      The parser has to check the query string (which arrives as
 187      plain ASCII text) for valid syntax. If the syntax is correct a
 188      <firstterm>parse tree</firstterm> is built up and handed back otherwise an error is
 189      returned. For the implementation the well known Unix
 190      tools <application>lex</application> and <application>yacc</application>
 191      are used.
 192     </para>
 193
 194     <para>
 195      The <firstterm>lexer</firstterm> is defined in the file
 196      <filename>scan.l</filename> and is responsible
 197      for recognizing <firstterm>identifiers</firstterm>,
 198      the <firstterm>SQL keywords</firstterm> etc. For
 199      every keyword or identifier that is found, a <firstterm>token</firstterm>
 200      is generated and handed to the parser.
 201     </para>
 202
 203     <para>
 204      The parser is defined in the file <filename>gram.y</filename> and consists of a
 205      set of <firstterm>grammar rules</firstterm> and <firstterm>actions</firstterm>
 206      that are executed
 207      whenever a rule is fired. The code of the actions (which
 208      is actually C-code) is used to build up the parse tree.
 209     </para>
 210
 211     <para>
 212      The file <filename>scan.l</filename> is transformed to
 213      the C-source file <filename>scan.c</filename>
 214      using the program <application>lex</application>
 215      and <filename>gram.y</filename> is transformed to
 216      <filename>gram.c</filename> using <application>yacc</application>.
 217      After these transformations have taken
 218      place a normal C-compiler can be used to create the
 219      parser. Never make any changes to the generated C-files as they will
 220      be overwritten the next time <application>lex</application>
 221      or <application>yacc</application> is called.
 222
 223      <note>
 224       <para>
 225        The mentioned transformations and compilations are normally done
 226        automatically using the <firstterm>makefiles</firstterm>
 227        shipped with the <productname>PostgreSQL</productname>
 228        source distribution.
 229       </para>
 230      </note>
 231     </para>
 232
 233     <para>
 234      A detailed description of <application>yacc</application> or
 235      the grammar rules given in <filename>gram.y</filename> would be
 236      beyond the scope of this paper. There are many books and
 237      documents dealing with <application>lex</application> and
 238      <application>yacc</application>. You should be familiar with
 239      <application>yacc</application> before you start to study the
 240      grammar given in <filename>gram.y</filename> otherwise you won't
 241      understand what happens there.
 242     </para>
 243
 244    </sect2>
 245
 246    <sect2>
 247      <title>Transformation Process</title>
 248
 249     <para>
 250      The parser stage creates a parse tree using only fixed rules about
 251      the syntactic structure of SQL.  It does not make any lookups in the
 252      system catalogs, so there is no possibility to understand the detailed
 253      semantics of the requested operations.  After the parser completes,
 254      the <firstterm>transformation process</firstterm> takes the tree handed
 255      back by the parser as input and does the semantic interpretation needed
 256      to understand which tables, functions, and operators are referenced by
 257      the query.  The data structure that is built to represent this
 258      information is called the <firstterm>query tree</>.
 259     </para>
 260
 261     <para>
 262      The reason for separating raw parsing from semantic analysis is that
 263      system catalog lookups can only be done within a transaction, and we
 264      do not wish to start a transaction immediately upon receiving a query
 265      string.  The raw parsing stage is sufficient to identify the transaction
 266      control commands (<command>BEGIN</>, <command>ROLLBACK</>, etc), and
 267      these can then be correctly executed without any further analysis.
 268      Once we know that we are dealing with an actual query (such as
 269      <command>SELECT</> or <command>UPDATE</>), it is okay to
 270      start a transaction if we're not already in one.  Only then can the
 271      transformation process be invoked.
 272     </para>
 273
 274     <para>
 275      The query tree created by the transformation process is structurally
 276      similar to the raw parse tree in most places, but it has many differences
 277      in detail.  For example, a <structname>FuncCall</> node in the
 278      parse tree represents something that looks syntactically like a function
 279      call.  This may be transformed to either a <structname>FuncExpr</>
 280      or <structname>Aggref</> node depending on whether the referenced
 281      name turns out to be an ordinary function or an aggregate function.
 282      Also, information about the actual datatypes of columns and expression
 283      results is added to the query tree.
 284     </para>
 285    </sect2>
 286   </sect1>
 287
 288   <sect1 id="rule-system">
 289    <title>The <productname>PostgreSQL</productname> Rule System</title>
 290
 291    <para>
 292     <productname>PostgreSQL</productname> supports a powerful
 293     <firstterm>rule system</firstterm> for the specification
 294     of <firstterm>views</firstterm> and ambiguous <firstterm>view updates</firstterm>.
 295     Originally the <productname>PostgreSQL</productname>
 296     rule system consisted of two implementations:
 297
 298     <itemizedlist>
 299      <listitem>
 300       <para>
 301        The first one worked using <firstterm>tuple level</firstterm> processing and was
 302        implemented deep in the <firstterm>executor</firstterm>. The rule system was
 303        called whenever an individual tuple had been accessed. This
 304        implementation was removed in 1995 when the last official release
 305        of the <productname>Berkeley Postgres</productname> project was
 306        transformed into <productname>Postgres95</productname>.
 307       </para>
 308      </listitem>
 309
 310      <listitem>
 311       <para>
 312        The second implementation of the rule system is a technique
 313        called <firstterm>query rewriting</firstterm>.
 314        The <firstterm>rewrite system</firstterm> is a module
 315        that exists between the <firstterm>parser stage</firstterm> and the
 316        <firstterm>planner/optimizer</firstterm>. This technique is still implemented.
 317       </para>
 318      </listitem>
 319     </itemizedlist>
 320    </para>
 321
 322    <para>
 323     The query rewriter is discussed in some detail in
 324     <xref linkend="rules">, so there is no need to cover it here.
 325     We will only point out that both the input and the output of the
 326     rewriter are query trees, that is, there is no change in the
 327     representation or level of semantic detail in the trees.  Rewriting
 328     can be thought of as a form of macro expansion.
 329    </para>
 330
 331   </sect1>
 332
 333   <sect1 id="planner-optimizer">
 334    <title>Planner/Optimizer</title>
 335
 336    <para>
 337     The task of the <firstterm>planner/optimizer</firstterm> is to create an optimal
 338     execution plan. It first considers all possible ways of
 339     <firstterm>scanning</firstterm> and <firstterm>joining</firstterm>
 340     the relations that appear in a
 341     query. All the created paths lead to the same result and it's the
 342     task of the optimizer to estimate the cost of executing each path and
 343     find out which one is the cheapest.
 344    </para>
 345
 346    <para>
 347     After the cheapest path is determined, a <firstterm>plan tree</>
 348     is built to pass to the executor.  This represents the desired
 349     execution plan in sufficient detail for the executor to run it.
 350    </para>
 351
 352    <sect2>
 353     <title>Generating Possible Plans</title>
 354
 355     <para>
 356      The planner/optimizer decides which plans should be generated
 357      based upon the types of indexes defined on the relations appearing in
 358      a query. There is always the possibility of performing a
 359      sequential scan on a relation, so a plan using only
 360      sequential scans is always created. Assume an index is defined on a
 361      relation (for example a B-tree index) and a query contains the
 362      restriction
 363      <literal>relation.attribute OPR constant</literal>. If
 364      <literal>relation.attribute</literal> happens to match the key of the B-tree
 365      index and <literal>OPR</literal> is one of the operators listed in
 366      the index's <firstterm>operator class</>, another plan is created using
 367      the B-tree index to scan the relation. If there are further indexes
 368      present and the restrictions in the query happen to match a key of an
 369      index further plans will be considered.
 370     </para>
 371
 372     <para>
 373      After all feasible plans have been found for scanning single relations,
 374      plans for joining relations are created. The planner/optimizer
 375      preferentially considers joins between any two relations for which there
 376      exist a corresponding join clause in the WHERE qualification (i.e. for
 377      which a restriction like <literal>where rel1.attr1=rel2.attr2</literal>
 378      exists). Join pairs with no join clause are considered only when there
 379      is no other choice, that is, a particular relation has no available
 380      join clauses to any other relation. All possible plans are generated for
 381      every join pair considered
 382      by the planner/optimizer. The three possible join strategies are:
 383
 384      <itemizedlist>
 385       <listitem>
 386        <para>
 387         <firstterm>nested loop join</firstterm>: The right relation is scanned
 388         once for every tuple found in the left relation. This strategy
 389         is easy to implement but can be very time consuming.  (However,
 390         if the right relation can be scanned with an indexscan, this can
 391         be a good strategy.  It is possible to use values from the current
 392         row of the left relation as keys for the indexscan of the right.)
 393        </para>
 394       </listitem>
 395
 396       <listitem>
 397        <para>
 398         <firstterm>merge sort join</firstterm>: Each relation is sorted on the join
 399         attributes before the join starts. Then the two relations are
 400         merged together taking into account that both relations are
 401         ordered on the join attributes. This kind of join is more
 402         attractive because each relation has to be scanned only once.
 403        </para>
 404       </listitem>
 405
 406       <listitem>
 407        <para>
 408         <firstterm>hash join</firstterm>: the right relation is first scanned
 409         and loaded into a hash table, using its join attributes as hash keys.
 410         Next the left relation is scanned and the
 411         appropriate values of every tuple found are used as hash keys to
 412         locate the matching tuples in the table.
 413        </para>
 414       </listitem>
 415      </itemizedlist>
 416     </para>
 417
 418     <para>
 419      The finished plan tree consists of sequential or index scans of the
 420      base relations, plus nestloop, merge, or hash join nodes as needed,
 421      plus any auxiliary steps needed, such as sort nodes or aggregate-function
 422      calculation nodes.  Most of these plan node types have the additional
 423      ability to do <firstterm>selection</> (discarding rows that do
 424      not meet a specified boolean condition) and <firstterm>projection</>
 425      (computation of a derived column set based on given column values,
 426      that is, evaluation of scalar expressions where needed).  One of
 427      the responsibilities of the planner is to attach selection conditions
 428      from the WHERE clause and computation of required output expressions
 429      to the most appropriate nodes of the plan tree.
 430     </para>
 431    </sect2>
 432   </sect1>
 433
 434   <sect1 id="executor">
 435    <title>Executor</title>
 436
 437    <para>
 438     The <firstterm>executor</firstterm> takes the plan handed back by the
 439     planner/optimizer and recursively processes it to extract the required set
 440     of rows.  This is essentially a demand-pull pipeline mechanism.
 441     Each time a plan node is called, it must deliver one more tuple, or
 442     report that it is done delivering tuples.
 443    </para>
 444
 445    <para>
 446     To provide a concrete example, assume that the top
 447     node is a <literal>MergeJoin</literal> node.
 448     Before any merge can be done two tuples have to be fetched (one from
 449     each subplan). So the executor recursively calls itself to
 450     process the subplans (it starts with the subplan attached to
 451     <literal>lefttree</literal>). The new top node (the top node of the left
 452     subplan) is, let's say, a
 453     <literal>Sort</literal> node and again recursion is needed to obtain
 454     an input tuple.  The child node of the <literal>Sort</literal> might
 455     be a <literal>SeqScan</> node, representing actual reading of a table.
 456     Execution of this node causes the executor to fetch a row from the
 457     table and return it up to the calling node.  The <literal>Sort</literal>
 458     node will repeatedly call its child to obtain all the rows to be sorted.
 459     When the input is exhausted (as indicated by the child node returning
 460     a NULL instead of a tuple), the <literal>Sort</literal> code performs
 461     the sort, and finally is able to return its first output row, namely
 462     the first one in sorted order.  It keeps the remaining rows stored so
 463     that it can deliver them in sorted order in response to later demands.
 464    </para>
 465
 466    <para>
 467     The <literal>MergeJoin</literal> node similarly demands the first row
 468     from its right subplan.  Then it compares the two rows to see if they
 469     can be joined; if so, it returns a join row to its caller.  On the next
 470     call, or immediately if it cannot join the current pair of inputs,
 471     it advances to the next row of one table
 472     or the other (depending on how the comparison came out), and again
 473     checks for a match.  Eventually, one subplan or the other is exhausted,
 474     and the <literal>MergeJoin</literal> node returns NULL to indicate that
 475     no more join rows can be formed.
 476    </para>
 477
 478    <para>
 479     Complex queries may involve many levels of plan nodes, but the general
 480     approach is the same: each node computes and returns its next output
 481     row each time it is called.  Each node is also responsible for applying
 482     any selection or projection expressions that were assigned to it by
 483     the planner.
 484    </para>
 485
 486    <para>
 487     The executor mechanism is used to evaluate all four basic SQL query types:
 488     <command>SELECT</>, <command>INSERT</>, <command>UPDATE</>, and
 489     <command>DELETE</>.  For <command>SELECT</>, the top-level executor
 490     code only needs to send each row returned by the query plan tree off
 491     to the client.  For <command>INSERT</>, each returned row is inserted
 492     into the target table specified for the <command>INSERT</>.  (A simple
 493     <command>INSERT ... VALUES</> command creates a trivial plan tree
 494     consisting of a single <literal>Result</> node, which computes just one
 495     result row.  But <command>INSERT ... SELECT</> may demand the full power
 496     of the executor mechanism.)  For <command>UPDATE</>, the planner arranges
 497     that each computed row includes all the updated column values, plus
 498     the <firstterm>TID</> (tuple ID, or location) of the original target row;
 499     the executor top level uses this information to create a new updated row
 500     and mark the old row deleted.  For <command>DELETE</>, the only column
 501     that is actually returned by the plan is the TID, and the executor top
 502     level simply uses the TID to visit the target rows and mark them deleted.
 503    </para>
 504
 505   </sect1>
 506
 507  </chapter>
 508
 509 <!-- Keep this comment at the end of the file
 510 Local variables:
 511 mode:sgml
 512 sgml-omittag:nil
 513 sgml-shorttag:t
 514 sgml-minimize-attributes:nil
 515 sgml-always-quote-attributes:t
 516 sgml-indent-step:1
 517 sgml-indent-data:t
 518 sgml-parent-document:nil
 519 sgml-default-dtd-file:"./reference.ced"
 520 sgml-exposed-tags:nil
 521 sgml-local-catalogs:("/usr/lib/sgml/catalog")
 522 sgml-local-ecat-files:nil
 523 End:
 524 -->