2 $Header: /cvsroot/pgsql/doc/src/sgml/arch-dev.sgml,v 2.21 2003/06/22 16:16:44 tgl Exp $
5 <chapter id="overview">
6 <title>Overview of PostgreSQL Internals</title>
11 This chapter originated as part of
12 <xref linkend="SIM98">, Stefan Simkovics'
13 Master's Thesis prepared at Vienna University of Technology under the direction
14 of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr.
19 This chapter gives an overview of the internal structure of the
20 backend of <productname>PostgreSQL</productname>. After having
21 read the following sections you should have an idea of how a query
22 is processed. This chapter does not aim to provide a detailed
23 description of the internal operation of
24 <productname>PostgreSQL</productname>, as such a document would be
25 very extensive. Rather, this chapter is intended to help the reader
26 understand the general sequence of operations that occur within the
27 backend from the point at which a query is received, to the point
28 when the results are returned to the client.
31 <sect1 id="query-path">
32 <title>The Path of a Query</title>
35 Here we give a short overview of the stages a query has to pass in
36 order to obtain a result.
42 A connection from an application program to the <productname>PostgreSQL</productname>
43 server has to be established. The application program transmits a
44 query to the server and waits to receive the results sent back by the
51 The <firstterm>parser stage</firstterm> checks the query
52 transmitted by the application
53 program for correct syntax and creates
54 a <firstterm>query tree</firstterm>.
60 The <firstterm>rewrite system</firstterm> takes
61 the query tree created by the parser stage and looks for
62 any <firstterm>rules</firstterm> (stored in the
63 <firstterm>system catalogs</firstterm>) to apply to
64 the query tree. It performs the
65 transformations given in the <firstterm>rule bodies</firstterm>.
66 One application of the rewrite system is in the realization of
67 <firstterm>views</firstterm>.
71 Whenever a query against a view
72 (i.e. a <firstterm>virtual table</firstterm>) is made,
73 the rewrite system rewrites the user's query to
74 a query that accesses the <firstterm>base tables</firstterm> given in
75 the <firstterm>view definition</firstterm> instead.
81 The <firstterm>planner/optimizer</firstterm> takes
82 the (rewritten) querytree and creates a
83 <firstterm>query plan</firstterm> that will be the input to the
84 <firstterm>executor</firstterm>.
88 It does so by first creating all possible <firstterm>paths</firstterm>
89 leading to the same result. For example if there is an index on a
90 relation to be scanned, there are two paths for the
91 scan. One possibility is a simple sequential scan and the other
92 possibility is to use the index. Next the cost for the execution of
93 each plan is estimated and the
94 cheapest plan is chosen and handed back.
100 The executor recursively steps through
101 the <firstterm>plan tree</firstterm> and
102 retrieves tuples in the way represented by the plan.
103 The executor makes use of the
104 <firstterm>storage system</firstterm> while scanning
105 relations, performs <firstterm>sorts</firstterm> and <firstterm>joins</firstterm>,
106 evaluates <firstterm>qualifications</firstterm> and finally hands back the tuples derived.
112 In the following sections we will cover each of the above listed items
113 in more detail to give a better understanding of <productname>PostgreSQL</productname>'s internal
114 control and data structures.
118 <sect1 id="connect-estab">
119 <title>How Connections are Established</title>
122 <productname>PostgreSQL</productname> is implemented using a
123 simple <quote>process per user</> client/server model. In this model
124 there is one <firstterm>client process</firstterm> connected to
125 exactly one <firstterm>server process</firstterm>. As we do not
126 know ahead of time how many connections will be made, we have to
127 use a <firstterm>master process</firstterm> that spawns a new
128 server process every time a connection is requested. This master
129 process is called <literal>postmaster</literal> and listens at a
130 specified TCP/IP port for incoming connections. Whenever a request
131 for a connection is detected the <literal>postmaster</literal>
132 process spawns a new server process called
133 <literal>postgres</literal>. The server tasks
134 (<literal>postgres</literal> processes) communicate with each
135 other using <firstterm>semaphores</firstterm> and
136 <firstterm>shared memory</firstterm> to ensure data integrity
137 throughout concurrent data access.
141 The client process can be any program that understands the
142 <productname>PostgreSQL</productname> protocol described in
143 <xref linkend="protocol">. Many clients are based on the
144 C-language library <application>libpq</>, but several independent
145 implementations exist, such as the Java <application>JDBC</> driver.
149 Once a connection is established the client process can send a query
150 to the <firstterm>backend</firstterm> (server). The query is transmitted using plain text,
151 i.e. there is no parsing done in the <firstterm>frontend</firstterm> (client). The
152 server parses the query, creates an <firstterm>execution plan</firstterm>,
153 executes the plan and returns the retrieved tuples to the client
154 by transmitting them over the established connection.
158 <sect1 id="parser-stage">
159 <title>The Parser Stage</title>
162 The <firstterm>parser stage</firstterm> consists of two parts:
167 The <firstterm>parser</firstterm> defined in
168 <filename>gram.y</filename> and <filename>scan.l</filename> is
169 built using the Unix tools <application>yacc</application>
170 and <application>lex</application>.
175 The <firstterm>transformation process</firstterm> does
176 modifications and augmentations to the data structures returned by the parser.
183 <title>Parser</title>
186 The parser has to check the query string (which arrives as
187 plain ASCII text) for valid syntax. If the syntax is correct a
188 <firstterm>parse tree</firstterm> is built up and handed back otherwise an error is
189 returned. For the implementation the well known Unix
190 tools <application>lex</application> and <application>yacc</application>
195 The <firstterm>lexer</firstterm> is defined in the file
196 <filename>scan.l</filename> and is responsible
197 for recognizing <firstterm>identifiers</firstterm>,
198 the <firstterm>SQL keywords</firstterm> etc. For
199 every keyword or identifier that is found, a <firstterm>token</firstterm>
200 is generated and handed to the parser.
204 The parser is defined in the file <filename>gram.y</filename> and consists of a
205 set of <firstterm>grammar rules</firstterm> and <firstterm>actions</firstterm>
207 whenever a rule is fired. The code of the actions (which
208 is actually C-code) is used to build up the parse tree.
212 The file <filename>scan.l</filename> is transformed to
213 the C-source file <filename>scan.c</filename>
214 using the program <application>lex</application>
215 and <filename>gram.y</filename> is transformed to
216 <filename>gram.c</filename> using <application>yacc</application>.
217 After these transformations have taken
218 place a normal C-compiler can be used to create the
219 parser. Never make any changes to the generated C-files as they will
220 be overwritten the next time <application>lex</application>
221 or <application>yacc</application> is called.
225 The mentioned transformations and compilations are normally done
226 automatically using the <firstterm>makefiles</firstterm>
227 shipped with the <productname>PostgreSQL</productname>
234 A detailed description of <application>yacc</application> or
235 the grammar rules given in <filename>gram.y</filename> would be
236 beyond the scope of this paper. There are many books and
237 documents dealing with <application>lex</application> and
238 <application>yacc</application>. You should be familiar with
239 <application>yacc</application> before you start to study the
240 grammar given in <filename>gram.y</filename> otherwise you won't
241 understand what happens there.
247 <title>Transformation Process</title>
250 The parser stage creates a parse tree using only fixed rules about
251 the syntactic structure of SQL. It does not make any lookups in the
252 system catalogs, so there is no possibility to understand the detailed
253 semantics of the requested operations. After the parser completes,
254 the <firstterm>transformation process</firstterm> takes the tree handed
255 back by the parser as input and does the semantic interpretation needed
256 to understand which tables, functions, and operators are referenced by
257 the query. The data structure that is built to represent this
258 information is called the <firstterm>query tree</>.
262 The reason for separating raw parsing from semantic analysis is that
263 system catalog lookups can only be done within a transaction, and we
264 do not wish to start a transaction immediately upon receiving a query
265 string. The raw parsing stage is sufficient to identify the transaction
266 control commands (<command>BEGIN</>, <command>ROLLBACK</>, etc), and
267 these can then be correctly executed without any further analysis.
268 Once we know that we are dealing with an actual query (such as
269 <command>SELECT</> or <command>UPDATE</>), it is okay to
270 start a transaction if we're not already in one. Only then can the
271 transformation process be invoked.
275 The query tree created by the transformation process is structurally
276 similar to the raw parse tree in most places, but it has many differences
277 in detail. For example, a <structname>FuncCall</> node in the
278 parse tree represents something that looks syntactically like a function
279 call. This may be transformed to either a <structname>FuncExpr</>
280 or <structname>Aggref</> node depending on whether the referenced
281 name turns out to be an ordinary function or an aggregate function.
282 Also, information about the actual datatypes of columns and expression
283 results is added to the query tree.
288 <sect1 id="rule-system">
289 <title>The <productname>PostgreSQL</productname> Rule System</title>
292 <productname>PostgreSQL</productname> supports a powerful
293 <firstterm>rule system</firstterm> for the specification
294 of <firstterm>views</firstterm> and ambiguous <firstterm>view updates</firstterm>.
295 Originally the <productname>PostgreSQL</productname>
296 rule system consisted of two implementations:
301 The first one worked using <firstterm>tuple level</firstterm> processing and was
302 implemented deep in the <firstterm>executor</firstterm>. The rule system was
303 called whenever an individual tuple had been accessed. This
304 implementation was removed in 1995 when the last official release
305 of the <productname>Berkeley Postgres</productname> project was
306 transformed into <productname>Postgres95</productname>.
312 The second implementation of the rule system is a technique
313 called <firstterm>query rewriting</firstterm>.
314 The <firstterm>rewrite system</firstterm> is a module
315 that exists between the <firstterm>parser stage</firstterm> and the
316 <firstterm>planner/optimizer</firstterm>. This technique is still implemented.
323 The query rewriter is discussed in some detail in
324 <xref linkend="rules">, so there is no need to cover it here.
325 We will only point out that both the input and the output of the
326 rewriter are query trees, that is, there is no change in the
327 representation or level of semantic detail in the trees. Rewriting
328 can be thought of as a form of macro expansion.
333 <sect1 id="planner-optimizer">
334 <title>Planner/Optimizer</title>
337 The task of the <firstterm>planner/optimizer</firstterm> is to create an optimal
338 execution plan. It first considers all possible ways of
339 <firstterm>scanning</firstterm> and <firstterm>joining</firstterm>
340 the relations that appear in a
341 query. All the created paths lead to the same result and it's the
342 task of the optimizer to estimate the cost of executing each path and
343 find out which one is the cheapest.
347 After the cheapest path is determined, a <firstterm>plan tree</>
348 is built to pass to the executor. This represents the desired
349 execution plan in sufficient detail for the executor to run it.
353 <title>Generating Possible Plans</title>
356 The planner/optimizer decides which plans should be generated
357 based upon the types of indexes defined on the relations appearing in
358 a query. There is always the possibility of performing a
359 sequential scan on a relation, so a plan using only
360 sequential scans is always created. Assume an index is defined on a
361 relation (for example a B-tree index) and a query contains the
363 <literal>relation.attribute OPR constant</literal>. If
364 <literal>relation.attribute</literal> happens to match the key of the B-tree
365 index and <literal>OPR</literal> is one of the operators listed in
366 the index's <firstterm>operator class</>, another plan is created using
367 the B-tree index to scan the relation. If there are further indexes
368 present and the restrictions in the query happen to match a key of an
369 index further plans will be considered.
373 After all feasible plans have been found for scanning single relations,
374 plans for joining relations are created. The planner/optimizer
375 preferentially considers joins between any two relations for which there
376 exist a corresponding join clause in the WHERE qualification (i.e. for
377 which a restriction like <literal>where rel1.attr1=rel2.attr2</literal>
378 exists). Join pairs with no join clause are considered only when there
379 is no other choice, that is, a particular relation has no available
380 join clauses to any other relation. All possible plans are generated for
381 every join pair considered
382 by the planner/optimizer. The three possible join strategies are:
387 <firstterm>nested loop join</firstterm>: The right relation is scanned
388 once for every tuple found in the left relation. This strategy
389 is easy to implement but can be very time consuming. (However,
390 if the right relation can be scanned with an indexscan, this can
391 be a good strategy. It is possible to use values from the current
392 row of the left relation as keys for the indexscan of the right.)
398 <firstterm>merge sort join</firstterm>: Each relation is sorted on the join
399 attributes before the join starts. Then the two relations are
400 merged together taking into account that both relations are
401 ordered on the join attributes. This kind of join is more
402 attractive because each relation has to be scanned only once.
408 <firstterm>hash join</firstterm>: the right relation is first scanned
409 and loaded into a hash table, using its join attributes as hash keys.
410 Next the left relation is scanned and the
411 appropriate values of every tuple found are used as hash keys to
412 locate the matching tuples in the table.
419 The finished plan tree consists of sequential or index scans of the
420 base relations, plus nestloop, merge, or hash join nodes as needed,
421 plus any auxiliary steps needed, such as sort nodes or aggregate-function
422 calculation nodes. Most of these plan node types have the additional
423 ability to do <firstterm>selection</> (discarding rows that do
424 not meet a specified boolean condition) and <firstterm>projection</>
425 (computation of a derived column set based on given column values,
426 that is, evaluation of scalar expressions where needed). One of
427 the responsibilities of the planner is to attach selection conditions
428 from the WHERE clause and computation of required output expressions
429 to the most appropriate nodes of the plan tree.
434 <sect1 id="executor">
435 <title>Executor</title>
438 The <firstterm>executor</firstterm> takes the plan handed back by the
439 planner/optimizer and recursively processes it to extract the required set
440 of rows. This is essentially a demand-pull pipeline mechanism.
441 Each time a plan node is called, it must deliver one more tuple, or
442 report that it is done delivering tuples.
446 To provide a concrete example, assume that the top
447 node is a <literal>MergeJoin</literal> node.
448 Before any merge can be done two tuples have to be fetched (one from
449 each subplan). So the executor recursively calls itself to
450 process the subplans (it starts with the subplan attached to
451 <literal>lefttree</literal>). The new top node (the top node of the left
452 subplan) is, let's say, a
453 <literal>Sort</literal> node and again recursion is needed to obtain
454 an input tuple. The child node of the <literal>Sort</literal> might
455 be a <literal>SeqScan</> node, representing actual reading of a table.
456 Execution of this node causes the executor to fetch a row from the
457 table and return it up to the calling node. The <literal>Sort</literal>
458 node will repeatedly call its child to obtain all the rows to be sorted.
459 When the input is exhausted (as indicated by the child node returning
460 a NULL instead of a tuple), the <literal>Sort</literal> code performs
461 the sort, and finally is able to return its first output row, namely
462 the first one in sorted order. It keeps the remaining rows stored so
463 that it can deliver them in sorted order in response to later demands.
467 The <literal>MergeJoin</literal> node similarly demands the first row
468 from its right subplan. Then it compares the two rows to see if they
469 can be joined; if so, it returns a join row to its caller. On the next
470 call, or immediately if it cannot join the current pair of inputs,
471 it advances to the next row of one table
472 or the other (depending on how the comparison came out), and again
473 checks for a match. Eventually, one subplan or the other is exhausted,
474 and the <literal>MergeJoin</literal> node returns NULL to indicate that
475 no more join rows can be formed.
479 Complex queries may involve many levels of plan nodes, but the general
480 approach is the same: each node computes and returns its next output
481 row each time it is called. Each node is also responsible for applying
482 any selection or projection expressions that were assigned to it by
487 The executor mechanism is used to evaluate all four basic SQL query types:
488 <command>SELECT</>, <command>INSERT</>, <command>UPDATE</>, and
489 <command>DELETE</>. For <command>SELECT</>, the top-level executor
490 code only needs to send each row returned by the query plan tree off
491 to the client. For <command>INSERT</>, each returned row is inserted
492 into the target table specified for the <command>INSERT</>. (A simple
493 <command>INSERT ... VALUES</> command creates a trivial plan tree
494 consisting of a single <literal>Result</> node, which computes just one
495 result row. But <command>INSERT ... SELECT</> may demand the full power
496 of the executor mechanism.) For <command>UPDATE</>, the planner arranges
497 that each computed row includes all the updated column values, plus
498 the <firstterm>TID</> (tuple ID, or location) of the original target row;
499 the executor top level uses this information to create a new updated row
500 and mark the old row deleted. For <command>DELETE</>, the only column
501 that is actually returned by the plan is the TID, and the executor top
502 level simply uses the TID to visit the target rows and mark them deleted.
509 <!-- Keep this comment at the end of the file
514 sgml-minimize-attributes:nil
515 sgml-always-quote-attributes:t
518 sgml-parent-document:nil
519 sgml-default-dtd-file:"./reference.ced"
520 sgml-exposed-tags:nil
521 sgml-local-catalogs:("/usr/lib/sgml/catalog")
522 sgml-local-ecat-files:nil