granicus.if.org Git - postgresql/blob - doc/src/sgml/indexam.sgml

   1 <!--
   2 $PostgreSQL: pgsql/doc/src/sgml/indexam.sgml,v 2.4 2005/04/20 22:19:58 tgl Exp $
   3 -->
   4
   5 <chapter id="indexam">
   6  <title>Index Access Method Interface Definition</title>
   7
   8   <para>
   9    This chapter defines the interface between the core
  10    <productname>PostgreSQL</productname> system and <firstterm>index access
  11    methods</>, which manage individual index types.  The core system
  12    knows nothing about indexes beyond what is specified here, so it is
  13    possible to develop entirely new index types by writing add-on code.
  14   </para>
  15
  16   <para>
  17    All indexes in <productname>PostgreSQL</productname> are what are known
  18    technically as <firstterm>secondary indexes</>; that is, the index is
  19    physically separate from the table file that it describes.  Each index
  20    is stored as its own physical <firstterm>relation</> and so is described
  21    by an entry in the <structname>pg_class</> catalog.  The contents of an
  22    index are entirely under the control of its index access method.  In
  23    practice, all index access methods divide indexes into standard-size
  24    pages so that they can use the regular storage manager and buffer manager
  25    to access the index contents.  (All the existing index access methods
  26    furthermore use the standard page layout described in <xref
  27    linkend="storage-page-layout">, and they all use the same format for index
  28    tuple headers; but these decisions are not forced on an access method.)
  29   </para>
  30
  31   <para>
  32    An index is effectively a mapping from some data key values to
  33    <firstterm>tuple identifiers</>, or <acronym>TIDs</>, of row versions
  34    (tuples) in the index's parent table.  A TID consists of a
  35    block number and an item number within that block (see <xref
  36    linkend="storage-page-layout">).  This is sufficient
  37    information to fetch a particular row version from the table.
  38    Indexes are not directly aware that under MVCC, there may be multiple
  39    extant versions of the same logical row; to an index, each tuple is
  40    an independent object that needs its own index entry.  Thus, an
  41    update of a row always creates all-new index entries for the row, even if
  42    the key values did not change.  Index entries for dead tuples are
  43    reclaimed (by vacuuming) when the dead tuples themselves are reclaimed.
  44   </para>
  45
  46  <sect1 id="index-catalog">
  47   <title>Catalog Entries for Indexes</title>
  48
  49   <para>
  50    Each index access method is described by a row in the
  51    <structname>pg_am</structname> system catalog (see
  52    <xref linkend="catalog-pg-am">).  The principal contents of a
  53    <structname>pg_am</structname> row are references to
  54    <link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>
  55    entries that identify the index access
  56    functions supplied by the access method.  The APIs for these functions
  57    are defined later in this chapter.  In addition, the
  58    <structname>pg_am</structname> row specifies a few fixed properties of
  59    the access method, such as whether it can support multi-column indexes.
  60    There is not currently any special support
  61    for creating or deleting <structname>pg_am</structname> entries;
  62    anyone able to write a new access method is expected to be competent
  63    to insert an appropriate row for themselves.
  64   </para>
  65
  66   <para>
  67    To be useful, an index access method must also have one or more
  68    <firstterm>operator classes</> defined in
  69    <link linkend="catalog-pg-opclass"><structname>pg_opclass</structname></link>,
  70    <link linkend="catalog-pg-amop"><structname>pg_amop</structname></link>, and
  71    <link linkend="catalog-pg-amproc"><structname>pg_amproc</structname></link>.
  72    These entries allow the planner
  73    to determine what kinds of query qualifications can be used with
  74    indexes of this access method.  Operator classes are described
  75    in <xref linkend="xindex">, which is prerequisite material for reading
  76    this chapter.
  77   </para>
  78
  79   <para>
  80    An individual index is defined by a
  81    <link linkend="catalog-pg-class"><structname>pg_class</structname></link>
  82    entry that describes it as a physical relation, plus a
  83    <link linkend="catalog-pg-index"><structname>pg_index</structname></link>
  84    entry that shows the logical content of the index &mdash; that is, the set
  85    of index columns it has and the semantics of those columns, as captured by
  86    the associated operator classes.  The index columns (key values) can be
  87    either simple columns of the underlying table or expressions over the table
  88    rows.  The index access method normally has no interest in where the index
  89    key values come from (it is always handed precomputed key values) but it
  90    will be very interested in the operator class information in
  91    <structname>pg_index</structname>.  Both of these catalog entries can be
  92    accessed as part of the <structname>Relation</> data structure that is
  93    passed to all operations on the index.
  94   </para>
  95
  96   <para>
  97    Some of the flag columns of <structname>pg_am</structname> have nonobvious
  98    implications.  The requirements of <structfield>amcanunique</structfield>
  99    are discussed in <xref linkend="index-unique-checks">, and those of
 100    <structfield>amconcurrent</structfield> in <xref linkend="index-locking">.
 101    The <structfield>amcanmulticol</structfield> flag asserts that the
 102    access method supports multi-column indexes, while
 103    <structfield>amindexnulls</structfield> asserts that index entries are
 104    created for NULL key values.  Since most indexable operators are
 105    strict and hence cannot return TRUE for NULL inputs,
 106    it is at first sight attractive to not store index entries for NULLs:
 107    they could never be returned by an index scan anyway.  However, this
 108    argument fails for a full-table index scan (one with no scan keys);
 109    such a scan should include null rows.  In practice this means that
 110    indexes that support ordered scans (have <structfield>amorderstrategy</>
 111    nonzero) must index nulls, since the planner might decide to use such a
 112    scan as a substitute for sorting.  Such indexes must also be willing to
 113    run a scan with no scan keys at all.  Another restriction is that an index
 114    access method that supports multiple index columns <emphasis>must</>
 115    support indexing null values in columns after the first, because the planner
 116    will assume the index can be used for queries on just the first
 117    column(s).  For example, consider an index on (a,b) and a query with
 118    <literal>WHERE a = 4</literal>.  The system will assume the index can be
 119    used to scan for rows with <literal>a = 4</literal>, which is wrong if the
 120    index omits rows where <literal>b</> is null.
 121    It is, however, OK to omit rows where the first indexed column is null.
 122    (GiST currently does so.)  Thus,
 123    <structfield>amindexnulls</structfield> should be set true only if the
 124    index access method indexes all rows, including arbitrary combinations of
 125    null values.
 126   </para>
 127
 128  </sect1>
 129
 130  <sect1 id="index-functions">
 131   <title>Index Access Method Functions</title>
 132
 133   <para>
 134    The index construction and maintenance functions that an index access
 135    method must provide are:
 136   </para>
 137
 138   <para>
 139 <programlisting>
 140 void
 141 ambuild (Relation heapRelation,
 142          Relation indexRelation,
 143          IndexInfo *indexInfo);
 144 </programlisting>
 145    Build a new index.  The index relation has been physically created,
 146    but is empty.  It must be filled in with whatever fixed data the
 147    access method requires, plus entries for all tuples already existing
 148    in the table.  Ordinarily the <function>ambuild</> function will call
 149    <function>IndexBuildHeapScan()</> to scan the table for existing tuples
 150    and compute the keys that need to be inserted into the index.
 151   </para>
 152
 153   <para>
 154 <programlisting>
 155 bool
 156 aminsert (Relation indexRelation,
 157           Datum *values,
 158           bool *isnull,
 159           ItemPointer heap_tid,
 160           Relation heapRelation,
 161           bool check_uniqueness);
 162 </programlisting>
 163    Insert a new tuple into an existing index.  The <literal>values</> and
 164    <literal>isnull</> arrays give the key values to be indexed, and
 165    <literal>heap_tid</> is the TID to be indexed.
 166    If the access method supports unique indexes (its
 167    <structname>pg_am</>.<structfield>amcanunique</> flag is true) then
 168    <literal>check_uniqueness</> may be true, in which case the access method
 169    must verify that there is no conflicting row; this is the only situation in
 170    which the access method normally needs the <literal>heapRelation</>
 171    parameter.  See <xref linkend="index-unique-checks"> for details.
 172    The result is TRUE if an index entry was inserted, FALSE if not. (A FALSE
 173    result does not denote an error condition, but is used for cases such
 174    as an index AM refusing to index a NULL.)
 175   </para>
 176
 177   <para>
 178 <programlisting>
 179 IndexBulkDeleteResult *
 180 ambulkdelete (Relation indexRelation,
 181               IndexBulkDeleteCallback callback,
 182               void *callback_state);
 183 </programlisting>
 184    Delete tuple(s) from the index.  This is a <quote>bulk delete</> operation
 185    that is intended to be implemented by scanning the whole index and checking
 186    each entry to see if it should be deleted.
 187    The passed-in <literal>callback</> function may be called, in the style
 188    <literal>callback(<replaceable>TID</>, callback_state) returns bool</literal>,
 189    to determine whether any particular index entry, as identified by its
 190    referenced TID, is to be deleted.  Must return either NULL or a palloc'd
 191    struct containing statistics about the effects of the deletion operation.
 192   </para>
 193
 194   <para>
 195 <programlisting>
 196 IndexBulkDeleteResult *
 197 amvacuumcleanup (Relation indexRelation,
 198                  IndexVacuumCleanupInfo *info,
 199                  IndexBulkDeleteResult *stats);
 200 </programlisting>
 201    Clean up after a <command>VACUUM</command> operation (one or more
 202    <function>ambulkdelete</> calls).  An index access method does not have
 203    to provide this function (if so, the entry in <structname>pg_am</> must
 204    be zero).  If it is provided, it is typically used for bulk cleanup
 205    such as reclaiming empty index pages.  <literal>info</>
 206    provides some additional arguments such as a message level for statistical
 207    reports, and <literal>stats</> is whatever the last
 208    <function>ambulkdelete</> call returned.  <function>amvacuumcleanup</>
 209    may replace or modify this struct before returning it.  If the result
 210    is not NULL it must be a palloc'd struct.  The statistics it contains
 211    will be reported by <command>VACUUM</> if <literal>VERBOSE</> is given.
 212   </para>
 213
 214   <para>
 215    The purpose of an index, of course, is to support scans for tuples matching
 216    an indexable <literal>WHERE</> condition, often called a
 217    <firstterm>qualifier</> or <firstterm>scan key</>.  The semantics of
 218    index scanning are described more fully in <xref linkend="index-scanning">,
 219    below.  The scan-related functions that an index access method must provide
 220    are:
 221   </para>
 222
 223   <para>
 224 <programlisting>
 225 IndexScanDesc
 226 ambeginscan (Relation indexRelation,
 227              int nkeys,
 228              ScanKey key);
 229 </programlisting>
 230    Begin a new scan.  The <literal>key</> array (of length <literal>nkeys</>)
 231    describes the scan key(s) for the index scan.  The result must be a
 232    palloc'd struct. For implementation reasons the index access method
 233    <emphasis>must</> create this struct by calling
 234    <function>RelationGetIndexScan()</>.  In most cases
 235    <function>ambeginscan</> itself does little beyond making that call;
 236    the interesting parts of indexscan startup are in <function>amrescan</>.
 237   </para>
 238
 239   <para>
 240 <programlisting>
 241 boolean
 242 amgettuple (IndexScanDesc scan,
 243             ScanDirection direction);
 244 </programlisting>
 245    Fetch the next tuple in the given scan, moving in the given
 246    direction (forward or backward in the index).  Returns TRUE if a tuple was
 247    obtained, FALSE if no matching tuples remain.  In the TRUE case the tuple
 248    TID is stored into the <literal>scan</> structure.  Note that
 249    <quote>success</> means only that the index contains an entry that matches
 250    the scan keys, not that the tuple necessarily still exists in the heap or
 251    will pass the caller's snapshot test.
 252   </para>
 253
 254   <para>
 255 <programlisting>
 256 boolean
 257 amgetmulti (IndexScanDesc scan,
 258             ItemPointer tids,
 259             int32 max_tids,
 260             int32 *returned_tids);
 261 </programlisting>
 262    Fetch multiple tuples in the given scan.  Returns TRUE if the scan should
 263    continue, FALSE if no matching tuples remain.  <literal>tids</> points to
 264    a caller-supplied array of <literal>max_tids</>
 265    <structname>ItemPointerData</> records, which the call fills with TIDs of
 266    matching tuples.  <literal>*returned_tids</> is set to the number of TIDs
 267    actually returned.  This can be less than <literal>max_tids</>, or even
 268    zero, even when the return value is TRUE.  (This provision allows the
 269    access method to choose the most efficient stopping points in its scan,
 270    for example index page boundaries.)  <function>amgetmulti</> and
 271    <function>amgettuple</> cannot be used in the same index scan; there
 272    are other restrictions too when using <function>amgetmulti</>, as explained
 273    in <xref linkend="index-scanning">.
 274   </para>
 275
 276   <para>
 277 <programlisting>
 278 void
 279 amrescan (IndexScanDesc scan,
 280           ScanKey key);
 281 </programlisting>
 282    Restart the given scan, possibly with new scan keys (to continue using
 283    the old keys, NULL is passed for <literal>key</>).  Note that it is not
 284    possible for the number of keys to be changed.  In practice the restart
 285    feature is used when a new outer tuple is selected by a nestloop join
 286    and so a new key comparison value is needed, but the scan key structure
 287    remains the same.  This function is also called by
 288    <function>RelationGetIndexScan()</>, so it is used for initial setup
 289    of an indexscan as well as rescanning.
 290   </para>
 291
 292   <para>
 293 <programlisting>
 294 void
 295 amendscan (IndexScanDesc scan);
 296 </programlisting>
 297    End a scan and release resources.  The <literal>scan</> struct itself
 298    should not be freed, but any locks or pins taken internally by the
 299    access method must be released.
 300   </para>
 301
 302   <para>
 303 <programlisting>
 304 void
 305 ammarkpos (IndexScanDesc scan);
 306 </programlisting>
 307    Mark current scan position.  The access method need only support one
 308    remembered scan position per scan.
 309   </para>
 310
 311   <para>
 312 <programlisting>
 313 void
 314 amrestrpos (IndexScanDesc scan);
 315 </programlisting>
 316    Restore the scan to the most recently marked position.
 317   </para>
 318
 319   <para>
 320 <programlisting>
 321 void
 322 amcostestimate (Query *root,
 323                 IndexOptInfo *index,
 324                 List *indexQuals,
 325                 Cost *indexStartupCost,
 326                 Cost *indexTotalCost,
 327                 Selectivity *indexSelectivity,
 328                 double *indexCorrelation);
 329 </programlisting>
 330    Estimate the costs of an index scan.  This function is described fully
 331    in <xref linkend="index-cost-estimation">, below.
 332   </para>
 333
 334   <para>
 335    By convention, the <literal>pg_proc</literal> entry for any index
 336    access method function should show the correct number of arguments,
 337    but declare them all as type <type>internal</> (since most of the arguments
 338    have types that are not known to SQL, and we don't want users calling
 339    the functions directly anyway).  The return type is declared as
 340    <type>void</>, <type>internal</>, or <type>boolean</> as appropriate.
 341   </para>
 342
 343  </sect1>
 344
 345  <sect1 id="index-scanning">
 346   <title>Index Scanning</title>
 347
 348   <para>
 349    In an index scan, the index access method is responsible for regurgitating
 350    the TIDs of all the tuples it has been told about that match the
 351    <firstterm>scan keys</>.  The access method is <emphasis>not</> involved in
 352    actually fetching those tuples from the index's parent table, nor in
 353    determining whether they pass the scan's time qualification test or other
 354    conditions.
 355   </para>
 356
 357   <para>
 358    A scan key is the internal representation of a <literal>WHERE</> clause of
 359    the form <replaceable>index_key</> <replaceable>operator</>
 360    <replaceable>constant</>, where the index key is one of the columns of the
 361    index and the operator is one of the members of the operator class
 362    associated with that index column.  An index scan has zero or more scan
 363    keys, which are implicitly ANDed &mdash; the returned tuples are expected
 364    to satisfy all the indicated conditions.
 365   </para>
 366
 367   <para>
 368    The operator class may indicate that the index is <firstterm>lossy</> for a
 369    particular operator; this implies that the index scan will return all the
 370    entries that pass the scan key, plus possibly additional entries that do
 371    not.  The core system's indexscan machinery will then apply that operator
 372    again to the heap tuple to verify whether or not it really should be
 373    selected.  For non-lossy operators, the index scan must return exactly the
 374    set of matching entries, as there is no recheck.
 375   </para>
 376
 377   <para>
 378    Note that it is entirely up to the access method to ensure that it
 379    correctly finds all and only the entries passing all the given scan keys.
 380    Also, the core system will simply hand off all the <literal>WHERE</>
 381    clauses that match the index keys and operator classes, without any
 382    semantic analysis to determine whether they are redundant or
 383    contradictory.  As an example, given
 384    <literal>WHERE x &gt; 4 AND x &gt; 14</> where <literal>x</> is a b-tree
 385    indexed column, it is left to the b-tree <function>amrescan</> function
 386    to realize that the first scan key is redundant and can be discarded.
 387    The extent of preprocessing needed during <function>amrescan</> will
 388    depend on the extent to which the index access method needs to reduce
 389    the scan keys to a <quote>normalized</> form.
 390   </para>
 391
 392   <para>
 393    The <function>amgettuple</> function has a <literal>direction</> argument,
 394    which can be either <literal>ForwardScanDirection</> (the normal case)
 395    or  <literal>BackwardScanDirection</>.  If the first call after
 396    <function>amrescan</> specifies <literal>BackwardScanDirection</>, then the
 397    set of matching index entries is to be scanned back-to-front rather than in
 398    the normal front-to-back direction, so <function>amgettuple</> must return
 399    the last matching tuple in the index, rather than the first one as it
 400    normally would.  (This will only occur for access
 401    methods that advertise they support ordered scans by setting
 402    <structname>pg_am</>.<structfield>amorderstrategy</> nonzero.)  After the
 403    first call, <function>amgettuple</> must be prepared to advance the scan in
 404    either direction from the most recently returned entry.
 405   </para>
 406
 407   <para>
 408    The access method must support <quote>marking</> a position in a scan
 409    and later returning to the marked position.  The same position may be
 410    restored multiple times.  However, only one position need be remembered
 411    per scan; a new <function>ammarkpos</> call overrides the previously
 412    marked position.
 413   </para>
 414
 415   <para>
 416    Both the scan position and the mark position (if any) must be maintained
 417    consistently in the face of concurrent insertions or deletions in the
 418    index.  It is OK if a freshly-inserted entry is not returned by a scan that
 419    would have found the entry if it had existed when the scan started, or for
 420    the scan to return such an entry upon rescanning or backing
 421    up even though it had not been returned the first time through.  Similarly,
 422    a concurrent delete may or may not be reflected in the results of a scan.
 423    What is important is that insertions or deletions not cause the scan to
 424    miss or multiply return entries that were not themselves being inserted or
 425    deleted.  (For an index type that does not set
 426    <structname>pg_am</>.<structfield>amconcurrent</>, it is sufficient to
 427    handle these cases for insertions or deletions performed by the same
 428    backend that's doing the scan.  But when <structfield>amconcurrent</> is
 429    true, insertions or deletions from other backends must be handled as well.)
 430   </para>
 431
 432   <para>
 433    Instead of using <function>amgettuple</>, an index scan can be done with
 434    <function>amgetmulti</> to fetch multiple tuples per call.  This can be
 435    noticeably more efficient than <function>amgettuple</> because it allows
 436    avoiding lock/unlock cycles within the access method.  In principle
 437    <function>amgetmulti</> should have the same effects as repeated
 438    <function>amgettuple</> calls, but we impose several restrictions to
 439    simplify matters.  In the first place, <function>amgetmulti</> does not
 440    take a <literal>direction</> argument, and therefore it does not support
 441    backwards scan nor intrascan reversal of direction.  The access method
 442    need not support marking or restoring scan positions during an
 443    <function>amgetmulti</> scan, either.  (These restrictions cost little
 444    since it would be difficult to use these features in an
 445    <function>amgetmulti</> scan anyway: adjusting the caller's buffered
 446    list of TIDs would be complex.)  Finally, <function>amgetmulti</> does
 447    not guarantee any locking of the returned tuples, with implications
 448    spelled out in <xref linkend="index-locking">.
 449   </para>
 450
 451  </sect1>
 452
 453  <sect1 id="index-locking">
 454   <title>Index Locking Considerations</title>
 455
 456   <para>
 457    An index access method can choose whether it supports concurrent updates
 458    of the index by multiple processes.  If the method's
 459    <structname>pg_am</>.<structfield>amconcurrent</> flag is true, then
 460    the core <productname>PostgreSQL</productname> system obtains
 461    <literal>AccessShareLock</> on the index during an index scan, and
 462    <literal>RowExclusiveLock</> when updating the index.  Since these lock
 463    types do not conflict, the access method is responsible for handling any
 464    fine-grained locking it may need.  An exclusive lock on the index as a whole
 465    will be taken only during index creation, destruction, or
 466    <literal>REINDEX</>.  When <structfield>amconcurrent</> is false,
 467    <productname>PostgreSQL</productname> still obtains
 468    <literal>AccessShareLock</> during index scans, but it obtains
 469    <literal>AccessExclusiveLock</> during any update.  This ensures that
 470    updaters have sole use of the index.  Note that this implicitly assumes
 471    that index scans are read-only; an access method that might modify the
 472    index during a scan will still have to do its own locking to handle the
 473    case of concurrent scans.
 474   </para>
 475
 476   <para>
 477    Recall that a backend's own locks never conflict; therefore, even a
 478    non-concurrent index type must be prepared to handle the case where
 479    a backend is inserting or deleting entries in an index that it is itself
 480    scanning.  (This is of course necessary to support an <command>UPDATE</>
 481    that uses the index to find the rows to be updated.)
 482   </para>
 483
 484   <para>
 485    Building an index type that supports concurrent updates usually requires
 486    extensive and subtle analysis of the required behavior.  For the b-tree
 487    and hash index types, you can read about the design decisions involved in
 488    <filename>src/backend/access/nbtree/README</> and
 489    <filename>src/backend/access/hash/README</>.
 490   </para>
 491
 492   <para>
 493    Aside from the index's own internal consistency requirements, concurrent
 494    updates create issues about consistency between the parent table (the
 495    <firstterm>heap</>) and the index.  Because
 496    <productname>PostgreSQL</productname> separates accesses
 497    and updates of the heap from those of the index, there are windows in
 498    which the index may be inconsistent with the heap.  We handle this problem
 499    with the following rules:
 500
 501     <itemizedlist>
 502      <listitem>
 503       <para>
 504        A new heap entry is made before making its index entries.  (Therefore
 505        a concurrent index scan is likely to fail to see the heap entry.
 506        This is okay because the index reader would be uninterested in an
 507        uncommitted row anyway.  But see <xref linkend="index-unique-checks">.)
 508       </para>
 509      </listitem>
 510      <listitem>
 511       <para>
 512        When a heap entry is to be deleted (by <command>VACUUM</>), all its
 513        index entries must be removed first.
 514       </para>
 515      </listitem>
 516      <listitem>
 517       <para>
 518        For concurrent index types, an indexscan must maintain a pin
 519        on the index page holding the item last returned by
 520        <function>amgettuple</>, and <function>ambulkdelete</> cannot delete
 521        entries from pages that are pinned by other backends.  The need
 522        for this rule is explained below.
 523       </para>
 524      </listitem>
 525     </itemizedlist>
 526
 527    If an index is concurrent then it is possible for an index reader to
 528    see an index entry just before it is removed by <command>VACUUM</>, and
 529    then to arrive at the corresponding heap entry after that was removed by
 530    <command>VACUUM</>.  (With a nonconcurrent index, this is not possible
 531    because of the conflicting index-level locks that will be taken out.)
 532    This creates no serious problems if that item
 533    number is still unused when the reader reaches it, since an empty
 534    item slot will be ignored by <function>heap_fetch()</>.  But what if a
 535    third backend has already re-used the item slot for something else?
 536    When using an MVCC-compliant snapshot, there is no problem because
 537    the new occupant of the slot is certain to be too new to pass the
 538    snapshot test.  However, with a non-MVCC-compliant snapshot (such as
 539    <literal>SnapshotNow</>), it would be possible to accept and return
 540    a row that does not in fact match the scan keys.  We could defend
 541    against this scenario by requiring the scan keys to be rechecked
 542    against the heap row in all cases, but that is too expensive.  Instead,
 543    we use a pin on an index page as a proxy to indicate that the reader
 544    may still be <quote>in flight</> from the index entry to the matching
 545    heap entry.  Making <function>ambulkdelete</> block on such a pin ensures
 546    that <command>VACUUM</> cannot delete the heap entry before the reader
 547    is done with it.  This solution costs little in runtime, and adds blocking
 548    overhead only in the rare cases where there actually is a conflict.
 549   </para>
 550
 551   <para>
 552    This solution requires that index scans be <quote>synchronous</>: we have
 553    to fetch each heap tuple immediately after scanning the corresponding index
 554    entry.  This is expensive for a number of reasons.  An
 555    <quote>asynchronous</> scan in which we collect many TIDs from the index,
 556    and only visit the heap tuples sometime later, requires much less index
 557    locking overhead and may allow a more efficient heap access pattern.
 558    Per the above analysis, we must use the synchronous approach for
 559    non-MVCC-compliant snapshots, but an asynchronous scan is workable
 560    for a query using an MVCC snapshot.
 561   </para>
 562
 563   <para>
 564    In an <function>amgetmulti</> index scan, the access method need not
 565    guarantee to keep an index pin on any of the returned tuples.  (It would be
 566    impractical to pin more than the last one anyway.)  Therefore
 567    it is only safe to use such scans with MVCC-compliant snapshots.
 568   </para>
 569
 570  </sect1>
 571
 572  <sect1 id="index-unique-checks">
 573   <title>Index Uniqueness Checks</title>
 574
 575   <para>
 576    <productname>PostgreSQL</productname> enforces SQL uniqueness constraints
 577    using <firstterm>unique indexes</>, which are indexes that disallow
 578    multiple entries with identical keys.  An access method that supports this
 579    feature sets <structname>pg_am</>.<structfield>amcanunique</> true.
 580    (At present, only b-tree supports it.)
 581   </para>
 582
 583   <para>
 584    Because of MVCC, it is always necessary to allow duplicate entries to
 585    exist physically in an index: the entries might refer to successive
 586    versions of a single logical row.  The behavior we actually want to
 587    enforce is that no MVCC snapshot could include two rows with equal
 588    index keys.  This breaks down into the following cases that must be
 589    checked when inserting a new row into a unique index:
 590
 591     <itemizedlist>
 592      <listitem>
 593       <para>
 594        If a conflicting valid row has been deleted by the current transaction,
 595        it's okay.  (In particular, since an UPDATE always deletes the old row
 596        version before inserting the new version, this will allow an UPDATE on
 597        a row without changing the key.)
 598       </para>
 599      </listitem>
 600      <listitem>
 601       <para>
 602        If a conflicting row has been inserted by an as-yet-uncommitted
 603        transaction, the would-be inserter must wait to see if that transaction
 604        commits.  If it rolls back then there is no conflict.  If it commits
 605        without deleting the conflicting row again, there is a uniqueness
 606        violation.  (In practice we just wait for the other transaction to
 607        end and then redo the visibility check in toto.)
 608       </para>
 609      </listitem>
 610      <listitem>
 611       <para>
 612        Similarly, if a conflicting valid row has been deleted by an
 613        as-yet-uncommitted transaction, the would-be inserter must wait
 614        for that transaction to commit or abort, and then repeat the test.
 615       </para>
 616      </listitem>
 617     </itemizedlist>
 618   </para>
 619
 620   <para>
 621    We require the index access method to apply these tests itself, which
 622    means that it must reach into the heap to check the commit status of
 623    any row that is shown to have a duplicate key according to the index
 624    contents.  This is without a doubt ugly and non-modular, but it saves
 625    redundant work: if we did a separate probe then the index lookup for
 626    a conflicting row would be essentially repeated while finding the place to
 627    insert the new row's index entry.  What's more, there is no obvious way
 628    to avoid race conditions unless the conflict check is an integral part
 629    of insertion of the new index entry.
 630   </para>
 631
 632   <para>
 633    The main limitation of this scheme is that it has no convenient way
 634    to support deferred uniqueness checks.
 635   </para>
 636
 637  </sect1>
 638
 639  <sect1 id="index-cost-estimation">
 640   <title>Index Cost Estimation Functions</title>
 641
 642   <para>
 643    The amcostestimate function is given a list of WHERE clauses that have
 644    been determined to be usable with the index.  It must return estimates
 645    of the cost of accessing the index and the selectivity of the WHERE
 646    clauses (that is, the fraction of parent-table rows that will be
 647    retrieved during the index scan).  For simple cases, nearly all the
 648    work of the cost estimator can be done by calling standard routines
 649    in the optimizer; the point of having an amcostestimate function is
 650    to allow index access methods to provide index-type-specific knowledge,
 651    in case it is possible to improve on the standard estimates.
 652   </para>
 653
 654   <para>
 655    Each amcostestimate function must have the signature:
 656
 657 <programlisting>
 658 void
 659 amcostestimate (Query *root,
 660                 IndexOptInfo *index,
 661                 List *indexQuals,
 662                 Cost *indexStartupCost,
 663                 Cost *indexTotalCost,
 664                 Selectivity *indexSelectivity,
 665                 double *indexCorrelation);
 666 </programlisting>
 667
 668    The first four parameters are inputs:
 669
 670    <variablelist>
 671     <varlistentry>
 672      <term>root</term>
 673      <listitem>
 674       <para>
 675        The query being processed.
 676       </para>
 677      </listitem>
 678     </varlistentry>
 679
 680     <varlistentry>
 681      <term>index</term>
 682      <listitem>
 683       <para>
 684        The index being considered.
 685       </para>
 686      </listitem>
 687     </varlistentry>
 688
 689     <varlistentry>
 690      <term>indexQuals</term>
 691      <listitem>
 692       <para>
 693        List of index qual clauses (implicitly ANDed);
 694        a NIL list indicates no qualifiers are available.
 695        Note that the list contains expression trees, not ScanKeys.
 696       </para>
 697      </listitem>
 698     </varlistentry>
 699    </variablelist>
 700   </para>
 701
 702   <para>
 703    The last four parameters are pass-by-reference outputs:
 704
 705    <variablelist>
 706     <varlistentry>
 707      <term>*indexStartupCost</term>
 708      <listitem>
 709       <para>
 710        Set to cost of index start-up processing
 711       </para>
 712      </listitem>
 713     </varlistentry>
 714
 715     <varlistentry>
 716      <term>*indexTotalCost</term>
 717      <listitem>
 718       <para>
 719        Set to total cost of index processing
 720       </para>
 721      </listitem>
 722     </varlistentry>
 723
 724     <varlistentry>
 725      <term>*indexSelectivity</term>
 726      <listitem>
 727       <para>
 728        Set to index selectivity
 729       </para>
 730      </listitem>
 731     </varlistentry>
 732
 733     <varlistentry>
 734      <term>*indexCorrelation</term>
 735      <listitem>
 736       <para>
 737        Set to correlation coefficient between index scan order and
 738        underlying table's order
 739       </para>
 740      </listitem>
 741     </varlistentry>
 742    </variablelist>
 743   </para>
 744
 745   <para>
 746    Note that cost estimate functions must be written in C, not in SQL or
 747    any available procedural language, because they must access internal
 748    data structures of the planner/optimizer.
 749   </para>
 750
 751   <para>
 752    The index access costs should be computed in the units used by
 753    <filename>src/backend/optimizer/path/costsize.c</filename>: a sequential
 754    disk block fetch has cost 1.0, a nonsequential fetch has cost
 755    <varname>random_page_cost</>, and the cost of processing one index row
 756    should usually be taken as <varname>cpu_index_tuple_cost</>.  In addition,
 757    an appropriate multiple of <varname>cpu_operator_cost</> should be charged
 758    for any comparison operators invoked during index processing (especially
 759    evaluation of the indexQuals themselves).
 760   </para>
 761
 762   <para>
 763    The access costs should include all disk and CPU costs associated with
 764    scanning the index itself, but <emphasis>not</> the costs of retrieving or
 765    processing the parent-table rows that are identified by the index.
 766   </para>
 767
 768   <para>
 769    The <quote>start-up cost</quote> is the part of the total scan cost that must be expended
 770    before we can begin to fetch the first row.  For most indexes this can
 771    be taken as zero, but an index type with a high start-up cost might want
 772    to set it nonzero.
 773   </para>
 774
 775   <para>
 776    The indexSelectivity should be set to the estimated fraction of the parent
 777    table rows that will be retrieved during the index scan.  In the case
 778    of a lossy index, this will typically be higher than the fraction of
 779    rows that actually pass the given qual conditions.
 780   </para>
 781
 782   <para>
 783    The indexCorrelation should be set to the correlation (ranging between
 784    -1.0 and 1.0) between the index order and the table order.  This is used
 785    to adjust the estimate for the cost of fetching rows from the parent
 786    table.
 787   </para>
 788
 789   <procedure>
 790    <title>Cost Estimation</title>
 791    <para>
 792     A typical cost estimator will proceed as follows:
 793    </para>
 794
 795    <step>
 796     <para>
 797      Estimate and return the fraction of parent-table rows that will be visited
 798      based on the given qual conditions.  In the absence of any index-type-specific
 799      knowledge, use the standard optimizer function <function>clauselist_selectivity()</function>:
 800
 801 <programlisting>
 802 *indexSelectivity = clauselist_selectivity(root, indexQuals,
 803                                            index-&gt;rel-&gt;relid, JOIN_INNER);
 804 </programlisting>
 805     </para>
 806    </step>
 807
 808    <step>
 809     <para>
 810      Estimate the number of index rows that will be visited during the
 811      scan.  For many index types this is the same as indexSelectivity times
 812      the number of rows in the index, but it might be more.  (Note that the
 813      index's size in pages and rows is available from the IndexOptInfo struct.)
 814     </para>
 815    </step>
 816
 817    <step>
 818     <para>
 819      Estimate the number of index pages that will be retrieved during the scan.
 820      This might be just indexSelectivity times the index's size in pages.
 821     </para>
 822    </step>
 823
 824    <step>
 825     <para>
 826      Compute the index access cost.  A generic estimator might do this:
 827
 828 <programlisting>
 829     /*
 830      * Our generic assumption is that the index pages will be read
 831      * sequentially, so they have cost 1.0 each, not random_page_cost.
 832      * Also, we charge for evaluation of the indexquals at each index row.
 833      * All the costs are assumed to be paid incrementally during the scan.
 834      */
 835     cost_qual_eval(&amp;index_qual_cost, indexQuals);
 836     *indexStartupCost = index_qual_cost.startup;
 837     *indexTotalCost = numIndexPages +
 838         (cpu_index_tuple_cost + index_qual_cost.per_tuple) * numIndexTuples;
 839 </programlisting>
 840     </para>
 841    </step>
 842
 843    <step>
 844     <para>
 845      Estimate the index correlation.  For a simple ordered index on a single
 846      field, this can be retrieved from pg_statistic.  If the correlation
 847      is not known, the conservative estimate is zero (no correlation).
 848     </para>
 849    </step>
 850   </procedure>
 851
 852   <para>
 853    Examples of cost estimator functions can be found in
 854    <filename>src/backend/utils/adt/selfuncs.c</filename>.
 855   </para>
 856  </sect1>
 857 </chapter>
 858
 859 <!-- Keep this comment at the end of the file
 860 Local variables:
 861 mode:sgml
 862 sgml-omittag:nil
 863 sgml-shorttag:t
 864 sgml-minimize-attributes:nil
 865 sgml-always-quote-attributes:t
 866 sgml-indent-step:1
 867 sgml-indent-data:t
 868 sgml-parent-document:nil
 869 sgml-default-dtd-file:"./reference.ced"
 870 sgml-exposed-tags:nil
 871 sgml-local-catalogs:("/usr/lib/sgml/catalog")
 872 sgml-local-ecat-files:nil
 873 End:
 874 -->