granicus.if.org Git - postgresql/blob - doc/src/sgml/indexam.sgml

   1 <!-- $PostgreSQL: pgsql/doc/src/sgml/indexam.sgml,v 2.26 2008/04/14 17:05:32 tgl Exp $ -->
   2
   3 <chapter id="indexam">
   4  <title>Index Access Method Interface Definition</title>
   5
   6   <para>
   7    This chapter defines the interface between the core
   8    <productname>PostgreSQL</productname> system and <firstterm>index access
   9    methods</>, which manage individual index types.  The core system
  10    knows nothing about indexes beyond what is specified here, so it is
  11    possible to develop entirely new index types by writing add-on code.
  12   </para>
  13
  14   <para>
  15    All indexes in <productname>PostgreSQL</productname> are what are known
  16    technically as <firstterm>secondary indexes</>; that is, the index is
  17    physically separate from the table file that it describes.  Each index
  18    is stored as its own physical <firstterm>relation</> and so is described
  19    by an entry in the <structname>pg_class</> catalog.  The contents of an
  20    index are entirely under the control of its index access method.  In
  21    practice, all index access methods divide indexes into standard-size
  22    pages so that they can use the regular storage manager and buffer manager
  23    to access the index contents.  (All the existing index access methods
  24    furthermore use the standard page layout described in <xref
  25    linkend="storage-page-layout">, and they all use the same format for index
  26    tuple headers; but these decisions are not forced on an access method.)
  27   </para>
  28
  29   <para>
  30    An index is effectively a mapping from some data key values to
  31    <firstterm>tuple identifiers</>, or <acronym>TIDs</>, of row versions
  32    (tuples) in the index's parent table.  A TID consists of a
  33    block number and an item number within that block (see <xref
  34    linkend="storage-page-layout">).  This is sufficient
  35    information to fetch a particular row version from the table.
  36    Indexes are not directly aware that under MVCC, there might be multiple
  37    extant versions of the same logical row; to an index, each tuple is
  38    an independent object that needs its own index entry.  Thus, an
  39    update of a row always creates all-new index entries for the row, even if
  40    the key values did not change.  Index entries for dead tuples are
  41    reclaimed (by vacuuming) when the dead tuples themselves are reclaimed.
  42   </para>
  43
  44  <sect1 id="index-catalog">
  45   <title>Catalog Entries for Indexes</title>
  46
  47   <para>
  48    Each index access method is described by a row in the
  49    <structname>pg_am</structname> system catalog (see
  50    <xref linkend="catalog-pg-am">).  The principal contents of a
  51    <structname>pg_am</structname> row are references to
  52    <link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>
  53    entries that identify the index access
  54    functions supplied by the access method.  The APIs for these functions
  55    are defined later in this chapter.  In addition, the
  56    <structname>pg_am</structname> row specifies a few fixed properties of
  57    the access method, such as whether it can support multicolumn indexes.
  58    There is not currently any special support
  59    for creating or deleting <structname>pg_am</structname> entries;
  60    anyone able to write a new access method is expected to be competent
  61    to insert an appropriate row for themselves.
  62   </para>
  63
  64   <para>
  65    To be useful, an index access method must also have one or more
  66    <firstterm>operator families</> and
  67    <firstterm>operator classes</> defined in
  68    <link linkend="catalog-pg-opfamily"><structname>pg_opfamily</structname></link>,
  69    <link linkend="catalog-pg-opclass"><structname>pg_opclass</structname></link>,
  70    <link linkend="catalog-pg-amop"><structname>pg_amop</structname></link>, and
  71    <link linkend="catalog-pg-amproc"><structname>pg_amproc</structname></link>.
  72    These entries allow the planner
  73    to determine what kinds of query qualifications can be used with
  74    indexes of this access method.  Operator families and classes are described
  75    in <xref linkend="xindex">, which is prerequisite material for reading
  76    this chapter.
  77   </para>
  78
  79   <para>
  80    An individual index is defined by a
  81    <link linkend="catalog-pg-class"><structname>pg_class</structname></link>
  82    entry that describes it as a physical relation, plus a
  83    <link linkend="catalog-pg-index"><structname>pg_index</structname></link>
  84    entry that shows the logical content of the index &mdash; that is, the set
  85    of index columns it has and the semantics of those columns, as captured by
  86    the associated operator classes.  The index columns (key values) can be
  87    either simple columns of the underlying table or expressions over the table
  88    rows.  The index access method normally has no interest in where the index
  89    key values come from (it is always handed precomputed key values) but it
  90    will be very interested in the operator class information in
  91    <structname>pg_index</structname>.  Both of these catalog entries can be
  92    accessed as part of the <structname>Relation</> data structure that is
  93    passed to all operations on the index.
  94   </para>
  95
  96   <para>
  97    Some of the flag columns of <structname>pg_am</structname> have nonobvious
  98    implications.  The requirements of <structfield>amcanunique</structfield>
  99    are discussed in <xref linkend="index-unique-checks">.
 100    The <structfield>amcanmulticol</structfield> flag asserts that the
 101    access method supports multicolumn indexes, while
 102    <structfield>amoptionalkey</structfield> asserts that it allows scans
 103    where no indexable restriction clause is given for the first index column.
 104    When <structfield>amcanmulticol</structfield> is false,
 105    <structfield>amoptionalkey</structfield> essentially says whether the
 106    access method allows full-index scans without any restriction clause.
 107    Access methods that support multiple index columns <emphasis>must</>
 108    support scans that omit restrictions on any or all of the columns after
 109    the first; however they are permitted to require some restriction to
 110    appear for the first index column, and this is signaled by setting
 111    <structfield>amoptionalkey</structfield> false.
 112    <structfield>amindexnulls</structfield> asserts that index entries are
 113    created for NULL key values.  Since most indexable operators are
 114    strict and hence cannot return TRUE for NULL inputs,
 115    it is at first sight attractive to not store index entries for null values:
 116    they could never be returned by an index scan anyway.  However, this
 117    argument fails when an index scan has no restriction clause for a given
 118    index column.  In practice this means that
 119    indexes that have <structfield>amoptionalkey</structfield> true must
 120    index nulls, since the planner might decide to use such an index
 121    with no scan keys at all.  A related restriction is that an index
 122    access method that supports multiple index columns <emphasis>must</>
 123    support indexing null values in columns after the first, because the planner
 124    will assume the index can be used for queries that do not restrict
 125    these columns.  For example, consider an index on (a,b) and a query with
 126    <literal>WHERE a = 4</literal>.  The system will assume the index can be
 127    used to scan for rows with <literal>a = 4</literal>, which is wrong if the
 128    index omits rows where <literal>b</> is null.
 129    It is, however, OK to omit rows where the first indexed column is null.
 130    Thus, <structfield>amindexnulls</structfield> should be set true only if the
 131    index access method indexes all rows, including arbitrary combinations of
 132    null values.  An index access method that sets
 133    <structfield>amindexnulls</structfield> may also set
 134    <structfield>amsearchnulls</structfield>, indicating that it supports
 135    <literal>IS NULL</> clauses as search conditions.
 136   </para>
 137
 138  </sect1>
 139
 140  <sect1 id="index-functions">
 141   <title>Index Access Method Functions</title>
 142
 143   <para>
 144    The index construction and maintenance functions that an index access
 145    method must provide are:
 146   </para>
 147
 148   <para>
 149 <programlisting>
 150 IndexBuildResult *
 151 ambuild (Relation heapRelation,
 152          Relation indexRelation,
 153          IndexInfo *indexInfo);
 154 </programlisting>
 155    Build a new index.  The index relation has been physically created,
 156    but is empty.  It must be filled in with whatever fixed data the
 157    access method requires, plus entries for all tuples already existing
 158    in the table.  Ordinarily the <function>ambuild</> function will call
 159    <function>IndexBuildHeapScan()</> to scan the table for existing tuples
 160    and compute the keys that need to be inserted into the index.
 161    The function must return a palloc'd struct containing statistics about
 162    the new index.
 163   </para>
 164
 165   <para>
 166 <programlisting>
 167 bool
 168 aminsert (Relation indexRelation,
 169           Datum *values,
 170           bool *isnull,
 171           ItemPointer heap_tid,
 172           Relation heapRelation,
 173           bool check_uniqueness);
 174 </programlisting>
 175    Insert a new tuple into an existing index.  The <literal>values</> and
 176    <literal>isnull</> arrays give the key values to be indexed, and
 177    <literal>heap_tid</> is the TID to be indexed.
 178    If the access method supports unique indexes (its
 179    <structname>pg_am</>.<structfield>amcanunique</> flag is true) then
 180    <literal>check_uniqueness</> might be true, in which case the access method
 181    must verify that there is no conflicting row; this is the only situation in
 182    which the access method normally needs the <literal>heapRelation</>
 183    parameter.  See <xref linkend="index-unique-checks"> for details.
 184    The result is TRUE if an index entry was inserted, FALSE if not. (A FALSE
 185    result does not denote an error condition, but is used for cases such
 186    as an index method refusing to index a NULL.)
 187   </para>
 188
 189   <para>
 190 <programlisting>
 191 IndexBulkDeleteResult *
 192 ambulkdelete (IndexVacuumInfo *info,
 193               IndexBulkDeleteResult *stats,
 194               IndexBulkDeleteCallback callback,
 195               void *callback_state);
 196 </programlisting>
 197    Delete tuple(s) from the index.  This is a <quote>bulk delete</> operation
 198    that is intended to be implemented by scanning the whole index and checking
 199    each entry to see if it should be deleted.
 200    The passed-in <literal>callback</> function must be called, in the style
 201    <literal>callback(<replaceable>TID</>, callback_state) returns bool</literal>,
 202    to determine whether any particular index entry, as identified by its
 203    referenced TID, is to be deleted.  Must return either NULL or a palloc'd
 204    struct containing statistics about the effects of the deletion operation.
 205    It is OK to return NULL if no information needs to be passed on to
 206    <function>amvacuumcleanup</>.
 207   </para>
 208
 209   <para>
 210    Because of limited <varname>maintenance_work_mem</>,
 211    <function>ambulkdelete</> might need to be called more than once when many
 212    tuples are to be deleted.  The <literal>stats</> argument is the result
 213    of the previous call for this index (it is NULL for the first call within a
 214    <command>VACUUM</> operation).  This allows the AM to accumulate statistics
 215    across the whole operation.  Typically, <function>ambulkdelete</> will
 216    modify and return the same struct if the passed <literal>stats</> is not
 217    null.
 218   </para>
 219
 220   <para>
 221 <programlisting>
 222 IndexBulkDeleteResult *
 223 amvacuumcleanup (IndexVacuumInfo *info,
 224                  IndexBulkDeleteResult *stats);
 225 </programlisting>
 226    Clean up after a <command>VACUUM</command> operation (zero or more
 227    <function>ambulkdelete</> calls).  This does not have to do anything
 228    beyond returning index statistics, but it might perform bulk cleanup
 229    such as reclaiming empty index pages.  <literal>stats</> is whatever the
 230    last <function>ambulkdelete</> call returned, or NULL if
 231    <function>ambulkdelete</> was not called because no tuples needed to be
 232    deleted.  If the result is not NULL it must be a palloc'd struct.
 233    The statistics it contains will be used to update <structname>pg_class</>,
 234    and will be reported by <command>VACUUM</> if <literal>VERBOSE</> is given.
 235    It is OK to return NULL if the index was not changed at all during the
 236    <command>VACUUM</command> operation, but otherwise correct stats should
 237    be returned.
 238   </para>
 239
 240   <para>
 241 <programlisting>
 242 void
 243 amcostestimate (PlannerInfo *root,
 244                 IndexOptInfo *index,
 245                 List *indexQuals,
 246                 RelOptInfo *outer_rel,
 247                 Cost *indexStartupCost,
 248                 Cost *indexTotalCost,
 249                 Selectivity *indexSelectivity,
 250                 double *indexCorrelation);
 251 </programlisting>
 252    Estimate the costs of an index scan.  This function is described fully
 253    in <xref linkend="index-cost-estimation">, below.
 254   </para>
 255
 256   <para>
 257 <programlisting>
 258 bytea *
 259 amoptions (ArrayType *reloptions,
 260            bool validate);
 261 </programlisting>
 262    Parse and validate the reloptions array for an index.  This is called only
 263    when a non-null reloptions array exists for the index.
 264    <parameter>reloptions</> is a <type>text</> array containing entries of the
 265    form <replaceable>name</><literal>=</><replaceable>value</>.
 266    The function should construct a <type>bytea</> value, which will be copied
 267    into the <structfield>rd_options</> field of the index's relcache entry.
 268    The data contents of the <type>bytea</> value are open for the access
 269    method to define, but the standard access methods currently all use struct
 270    <structname>StdRdOptions</>.
 271    When <parameter>validate</> is true, the function should report a suitable
 272    error message if any of the options are unrecognized or have invalid
 273    values; when <parameter>validate</> is false, invalid entries should be
 274    silently ignored.  (<parameter>validate</> is false when loading options
 275    already stored in <structname>pg_catalog</>; an invalid entry could only
 276    be found if the access method has changed its rules for options, and in
 277    that case ignoring obsolete entries is appropriate.)
 278    It is OK to return NULL if default behavior is wanted.
 279   </para>
 280
 281   <para>
 282    The purpose of an index, of course, is to support scans for tuples matching
 283    an indexable <literal>WHERE</> condition, often called a
 284    <firstterm>qualifier</> or <firstterm>scan key</>.  The semantics of
 285    index scanning are described more fully in <xref linkend="index-scanning">,
 286    below.  The scan-related functions that an index access method must provide
 287    are:
 288   </para>
 289
 290   <para>
 291 <programlisting>
 292 IndexScanDesc
 293 ambeginscan (Relation indexRelation,
 294              int nkeys,
 295              ScanKey key);
 296 </programlisting>
 297    Begin a new scan.  The <literal>key</> array (of length <literal>nkeys</>)
 298    describes the scan key(s) for the index scan.  The result must be a
 299    palloc'd struct. For implementation reasons the index access method
 300    <emphasis>must</> create this struct by calling
 301    <function>RelationGetIndexScan()</>.  In most cases
 302    <function>ambeginscan</> itself does little beyond making that call;
 303    the interesting parts of index-scan startup are in <function>amrescan</>.
 304   </para>
 305
 306   <para>
 307 <programlisting>
 308 boolean
 309 amgettuple (IndexScanDesc scan,
 310             ScanDirection direction);
 311 </programlisting>
 312    Fetch the next tuple in the given scan, moving in the given
 313    direction (forward or backward in the index).  Returns TRUE if a tuple was
 314    obtained, FALSE if no matching tuples remain.  In the TRUE case the tuple
 315    TID is stored into the <literal>scan</> structure.  Note that
 316    <quote>success</> means only that the index contains an entry that matches
 317    the scan keys, not that the tuple necessarily still exists in the heap or
 318    will pass the caller's snapshot test.  On success, <function>amgettuple</>
 319    must also set <literal>scan-&gt;xs_recheck</> to TRUE or FALSE.
 320    FALSE means it is certain that the index entry matches the scan keys.
 321    TRUE means this is not certain, and the conditions represented by the
 322    scan keys must be rechecked against the heap tuple after fetching it.
 323    This provision supports <quote>lossy</> index operators.
 324    Note that rechecking will extend only to the scan conditions; a partial
 325    index predicate (if any) is never rechecked by <function>amgettuple</>
 326    callers.
 327   </para>
 328
 329   <para>
 330 <programlisting>
 331 int64
 332 amgetbitmap (IndexScanDesc scan,
 333              TIDBitmap *tbm);
 334 </programlisting>
 335    Fetch all tuples in the given scan and add them to the caller-supplied
 336    TIDBitmap (that is, OR the set of tuple IDs into whatever set is already
 337    in the bitmap).  The number of tuples fetched is returned.
 338    While inserting tuple IDs into the bitmap, <function>amgetbitmap</> can
 339    indicate that rechecking of the scan conditions is required for specific
 340    tuple IDs.  This is analogous to the <literal>xs_recheck</> output parameter
 341    of <function>amgettuple</>.  Note: in the current implementation, support
 342    for this feature is conflated with support for lossy storage of the bitmap
 343    itself, and therefore callers recheck both the scan conditions and the
 344    partial index predicate (if any) for recheckable tuples.  That might not
 345    always be true, however.
 346    <function>amgetbitmap</> and
 347    <function>amgettuple</> cannot be used in the same index scan; there
 348    are other restrictions too when using <function>amgetbitmap</>, as explained
 349    in <xref linkend="index-scanning">.
 350   </para>
 351
 352   <para>
 353 <programlisting>
 354 void
 355 amrescan (IndexScanDesc scan,
 356           ScanKey key);
 357 </programlisting>
 358    Restart the given scan, possibly with new scan keys (to continue using
 359    the old keys, NULL is passed for <literal>key</>).  Note that it is not
 360    possible for the number of keys to be changed.  In practice the restart
 361    feature is used when a new outer tuple is selected by a nested-loop join
 362    and so a new key comparison value is needed, but the scan key structure
 363    remains the same.  This function is also called by
 364    <function>RelationGetIndexScan()</>, so it is used for initial setup
 365    of an index scan as well as rescanning.
 366   </para>
 367
 368   <para>
 369 <programlisting>
 370 void
 371 amendscan (IndexScanDesc scan);
 372 </programlisting>
 373    End a scan and release resources.  The <literal>scan</> struct itself
 374    should not be freed, but any locks or pins taken internally by the
 375    access method must be released.
 376   </para>
 377
 378   <para>
 379 <programlisting>
 380 void
 381 ammarkpos (IndexScanDesc scan);
 382 </programlisting>
 383    Mark current scan position.  The access method need only support one
 384    remembered scan position per scan.
 385   </para>
 386
 387   <para>
 388 <programlisting>
 389 void
 390 amrestrpos (IndexScanDesc scan);
 391 </programlisting>
 392    Restore the scan to the most recently marked position.
 393   </para>
 394
 395   <para>
 396    By convention, the <literal>pg_proc</literal> entry for an index
 397    access method function should show the correct number of arguments,
 398    but declare them all as type <type>internal</> (since most of the arguments
 399    have types that are not known to SQL, and we don't want users calling
 400    the functions directly anyway).  The return type is declared as
 401    <type>void</>, <type>internal</>, or <type>boolean</> as appropriate.
 402    The only exception is <function>amoptions</>, which should be correctly
 403    declared as taking <type>text[]</> and <type>bool</> and returning
 404    <type>bytea</>.  This provision allows client code to execute
 405    <function>amoptions</> to test validity of options settings.
 406   </para>
 407
 408  </sect1>
 409
 410  <sect1 id="index-scanning">
 411   <title>Index Scanning</title>
 412
 413   <para>
 414    In an index scan, the index access method is responsible for regurgitating
 415    the TIDs of all the tuples it has been told about that match the
 416    <firstterm>scan keys</>.  The access method is <emphasis>not</> involved in
 417    actually fetching those tuples from the index's parent table, nor in
 418    determining whether they pass the scan's time qualification test or other
 419    conditions.
 420   </para>
 421
 422   <para>
 423    A scan key is the internal representation of a <literal>WHERE</> clause of
 424    the form <replaceable>index_key</> <replaceable>operator</>
 425    <replaceable>constant</>, where the index key is one of the columns of the
 426    index and the operator is one of the members of the operator family
 427    associated with that index column.  An index scan has zero or more scan
 428    keys, which are implicitly ANDed &mdash; the returned tuples are expected
 429    to satisfy all the indicated conditions.
 430   </para>
 431
 432   <para>
 433    The access method can report that the index is <firstterm>lossy</>, or
 434    requires rechecks, for a particular query.  This implies that the index
 435    scan will return all the entries that pass the scan key, plus possibly
 436    additional entries that do not.  The core system's index-scan machinery
 437    will then apply the index conditions again to the heap tuple to verify
 438    whether or not it really should be selected.  If the recheck option is not
 439    specified, the index scan must return exactly the set of matching entries.
 440   </para>
 441
 442   <para>
 443    Note that it is entirely up to the access method to ensure that it
 444    correctly finds all and only the entries passing all the given scan keys.
 445    Also, the core system will simply hand off all the <literal>WHERE</>
 446    clauses that match the index keys and operator families, without any
 447    semantic analysis to determine whether they are redundant or
 448    contradictory.  As an example, given
 449    <literal>WHERE x &gt; 4 AND x &gt; 14</> where <literal>x</> is a b-tree
 450    indexed column, it is left to the b-tree <function>amrescan</> function
 451    to realize that the first scan key is redundant and can be discarded.
 452    The extent of preprocessing needed during <function>amrescan</> will
 453    depend on the extent to which the index access method needs to reduce
 454    the scan keys to a <quote>normalized</> form.
 455   </para>
 456
 457   <para>
 458    Some access methods return index entries in a well-defined order, others
 459    do not.  If entries are returned in sorted order, the access method should
 460    set <structname>pg_am</>.<structfield>amcanorder</> true to indicate that
 461    it supports ordered scans.
 462    All such access methods must use btree-compatible strategy numbers for
 463    their equality and ordering operators.
 464   </para>
 465
 466   <para>
 467    The <function>amgettuple</> function has a <literal>direction</> argument,
 468    which can be either <literal>ForwardScanDirection</> (the normal case)
 469    or  <literal>BackwardScanDirection</>.  If the first call after
 470    <function>amrescan</> specifies <literal>BackwardScanDirection</>, then the
 471    set of matching index entries is to be scanned back-to-front rather than in
 472    the normal front-to-back direction, so <function>amgettuple</> must return
 473    the last matching tuple in the index, rather than the first one as it
 474    normally would.  (This will only occur for access
 475    methods that advertise they support ordered scans.)  After the
 476    first call, <function>amgettuple</> must be prepared to advance the scan in
 477    either direction from the most recently returned entry.
 478   </para>
 479
 480   <para>
 481    The access method must support <quote>marking</> a position in a scan
 482    and later returning to the marked position.  The same position might be
 483    restored multiple times.  However, only one position need be remembered
 484    per scan; a new <function>ammarkpos</> call overrides the previously
 485    marked position.
 486   </para>
 487
 488   <para>
 489    Both the scan position and the mark position (if any) must be maintained
 490    consistently in the face of concurrent insertions or deletions in the
 491    index.  It is OK if a freshly-inserted entry is not returned by a scan that
 492    would have found the entry if it had existed when the scan started, or for
 493    the scan to return such an entry upon rescanning or backing
 494    up even though it had not been returned the first time through.  Similarly,
 495    a concurrent delete might or might not be reflected in the results of a scan.
 496    What is important is that insertions or deletions not cause the scan to
 497    miss or multiply return entries that were not themselves being inserted or
 498    deleted.
 499   </para>
 500
 501   <para>
 502    Instead of using <function>amgettuple</>, an index scan can be done with
 503    <function>amgetbitmap</> to fetch all tuples in one call.  This can be
 504    noticeably more efficient than <function>amgettuple</> because it allows
 505    avoiding lock/unlock cycles within the access method.  In principle
 506    <function>amgetbitmap</> should have the same effects as repeated
 507    <function>amgettuple</> calls, but we impose several restrictions to
 508    simplify matters.  First of all, <function>amgetbitmap</> returns all
 509    tuples at once and marking or restoring scan positions isn't
 510    supported. Secondly, the tuples are returned in a bitmap which doesn't
 511    have any specific ordering, which is why <function>amgetbitmap</> doesn't
 512    take a <literal>direction</> argument.  Finally, <function>amgetbitmap</>
 513    does not guarantee any locking of the returned tuples, with implications
 514    spelled out in <xref linkend="index-locking">.
 515   </para>
 516
 517  </sect1>
 518
 519  <sect1 id="index-locking">
 520   <title>Index Locking Considerations</title>
 521
 522   <para>
 523    Index access methods must handle concurrent updates
 524    of the index by multiple processes.
 525    The core <productname>PostgreSQL</productname> system obtains
 526    <literal>AccessShareLock</> on the index during an index scan, and
 527    <literal>RowExclusiveLock</> when updating the index (including plain
 528    <command>VACUUM</>).  Since these lock
 529    types do not conflict, the access method is responsible for handling any
 530    fine-grained locking it might need.  An exclusive lock on the index as a whole
 531    will be taken only during index creation, destruction,
 532    <command>REINDEX</>, or <command>VACUUM FULL</>.
 533   </para>
 534
 535   <para>
 536    Building an index type that supports concurrent updates usually requires
 537    extensive and subtle analysis of the required behavior.  For the b-tree
 538    and hash index types, you can read about the design decisions involved in
 539    <filename>src/backend/access/nbtree/README</> and
 540    <filename>src/backend/access/hash/README</>.
 541   </para>
 542
 543   <para>
 544    Aside from the index's own internal consistency requirements, concurrent
 545    updates create issues about consistency between the parent table (the
 546    <firstterm>heap</>) and the index.  Because
 547    <productname>PostgreSQL</productname> separates accesses
 548    and updates of the heap from those of the index, there are windows in
 549    which the index might be inconsistent with the heap.  We handle this problem
 550    with the following rules:
 551
 552     <itemizedlist>
 553      <listitem>
 554       <para>
 555        A new heap entry is made before making its index entries.  (Therefore
 556        a concurrent index scan is likely to fail to see the heap entry.
 557        This is okay because the index reader would be uninterested in an
 558        uncommitted row anyway.  But see <xref linkend="index-unique-checks">.)
 559       </para>
 560      </listitem>
 561      <listitem>
 562       <para>
 563        When a heap entry is to be deleted (by <command>VACUUM</>), all its
 564        index entries must be removed first.
 565       </para>
 566      </listitem>
 567      <listitem>
 568       <para>
 569        An index scan must maintain a pin
 570        on the index page holding the item last returned by
 571        <function>amgettuple</>, and <function>ambulkdelete</> cannot delete
 572        entries from pages that are pinned by other backends.  The need
 573        for this rule is explained below.
 574       </para>
 575      </listitem>
 576     </itemizedlist>
 577
 578    Without the third rule, it is possible for an index reader to
 579    see an index entry just before it is removed by <command>VACUUM</>, and
 580    then to arrive at the corresponding heap entry after that was removed by
 581    <command>VACUUM</>.
 582    This creates no serious problems if that item
 583    number is still unused when the reader reaches it, since an empty
 584    item slot will be ignored by <function>heap_fetch()</>.  But what if a
 585    third backend has already re-used the item slot for something else?
 586    When using an MVCC-compliant snapshot, there is no problem because
 587    the new occupant of the slot is certain to be too new to pass the
 588    snapshot test.  However, with a non-MVCC-compliant snapshot (such as
 589    <literal>SnapshotNow</>), it would be possible to accept and return
 590    a row that does not in fact match the scan keys.  We could defend
 591    against this scenario by requiring the scan keys to be rechecked
 592    against the heap row in all cases, but that is too expensive.  Instead,
 593    we use a pin on an index page as a proxy to indicate that the reader
 594    might still be <quote>in flight</> from the index entry to the matching
 595    heap entry.  Making <function>ambulkdelete</> block on such a pin ensures
 596    that <command>VACUUM</> cannot delete the heap entry before the reader
 597    is done with it.  This solution costs little in run time, and adds blocking
 598    overhead only in the rare cases where there actually is a conflict.
 599   </para>
 600
 601   <para>
 602    This solution requires that index scans be <quote>synchronous</>: we have
 603    to fetch each heap tuple immediately after scanning the corresponding index
 604    entry.  This is expensive for a number of reasons.  An
 605    <quote>asynchronous</> scan in which we collect many TIDs from the index,
 606    and only visit the heap tuples sometime later, requires much less index
 607    locking overhead and can allow a more efficient heap access pattern.
 608    Per the above analysis, we must use the synchronous approach for
 609    non-MVCC-compliant snapshots, but an asynchronous scan is workable
 610    for a query using an MVCC snapshot.
 611   </para>
 612
 613   <para>
 614    In an <function>amgetbitmap</> index scan, the access method does not
 615    keep an index pin on any of the returned tuples.  Therefore
 616    it is only safe to use such scans with MVCC-compliant snapshots.
 617   </para>
 618
 619  </sect1>
 620
 621  <sect1 id="index-unique-checks">
 622   <title>Index Uniqueness Checks</title>
 623
 624   <para>
 625    <productname>PostgreSQL</productname> enforces SQL uniqueness constraints
 626    using <firstterm>unique indexes</>, which are indexes that disallow
 627    multiple entries with identical keys.  An access method that supports this
 628    feature sets <structname>pg_am</>.<structfield>amcanunique</> true.
 629    (At present, only b-tree supports it.)
 630   </para>
 631
 632   <para>
 633    Because of MVCC, it is always necessary to allow duplicate entries to
 634    exist physically in an index: the entries might refer to successive
 635    versions of a single logical row.  The behavior we actually want to
 636    enforce is that no MVCC snapshot could include two rows with equal
 637    index keys.  This breaks down into the following cases that must be
 638    checked when inserting a new row into a unique index:
 639
 640     <itemizedlist>
 641      <listitem>
 642       <para>
 643        If a conflicting valid row has been deleted by the current transaction,
 644        it's okay.  (In particular, since an UPDATE always deletes the old row
 645        version before inserting the new version, this will allow an UPDATE on
 646        a row without changing the key.)
 647       </para>
 648      </listitem>
 649      <listitem>
 650       <para>
 651        If a conflicting row has been inserted by an as-yet-uncommitted
 652        transaction, the would-be inserter must wait to see if that transaction
 653        commits.  If it rolls back then there is no conflict.  If it commits
 654        without deleting the conflicting row again, there is a uniqueness
 655        violation.  (In practice we just wait for the other transaction to
 656        end and then redo the visibility check in toto.)
 657       </para>
 658      </listitem>
 659      <listitem>
 660       <para>
 661        Similarly, if a conflicting valid row has been deleted by an
 662        as-yet-uncommitted transaction, the would-be inserter must wait
 663        for that transaction to commit or abort, and then repeat the test.
 664       </para>
 665      </listitem>
 666     </itemizedlist>
 667   </para>
 668
 669   <para>
 670    Furthermore, immediately before raising a uniqueness violation
 671    according to the above rules, the access method must recheck the
 672    liveness of the row being inserted.  If it is committed dead then
 673    no error should be raised.  (This case cannot occur during the
 674    ordinary scenario of inserting a row that's just been created by
 675    the current transaction.  It can happen during
 676    <command>CREATE UNIQUE INDEX CONCURRENTLY</>, however.)
 677   </para>
 678
 679   <para>
 680    We require the index access method to apply these tests itself, which
 681    means that it must reach into the heap to check the commit status of
 682    any row that is shown to have a duplicate key according to the index
 683    contents.  This is without a doubt ugly and non-modular, but it saves
 684    redundant work: if we did a separate probe then the index lookup for
 685    a conflicting row would be essentially repeated while finding the place to
 686    insert the new row's index entry.  What's more, there is no obvious way
 687    to avoid race conditions unless the conflict check is an integral part
 688    of insertion of the new index entry.
 689   </para>
 690
 691   <para>
 692    The main limitation of this scheme is that it has no convenient way
 693    to support deferred uniqueness checks.
 694   </para>
 695
 696  </sect1>
 697
 698  <sect1 id="index-cost-estimation">
 699   <title>Index Cost Estimation Functions</title>
 700
 701   <para>
 702    The amcostestimate function is given a list of WHERE clauses that have
 703    been determined to be usable with the index.  It must return estimates
 704    of the cost of accessing the index and the selectivity of the WHERE
 705    clauses (that is, the fraction of parent-table rows that will be
 706    retrieved during the index scan).  For simple cases, nearly all the
 707    work of the cost estimator can be done by calling standard routines
 708    in the optimizer; the point of having an amcostestimate function is
 709    to allow index access methods to provide index-type-specific knowledge,
 710    in case it is possible to improve on the standard estimates.
 711   </para>
 712
 713   <para>
 714    Each amcostestimate function must have the signature:
 715
 716 <programlisting>
 717 void
 718 amcostestimate (PlannerInfo *root,
 719                 IndexOptInfo *index,
 720                 List *indexQuals,
 721                 RelOptInfo *outer_rel,
 722                 Cost *indexStartupCost,
 723                 Cost *indexTotalCost,
 724                 Selectivity *indexSelectivity,
 725                 double *indexCorrelation);
 726 </programlisting>
 727
 728    The first four parameters are inputs:
 729
 730    <variablelist>
 731     <varlistentry>
 732      <term>root</term>
 733      <listitem>
 734       <para>
 735        The planner's information about the query being processed.
 736       </para>
 737      </listitem>
 738     </varlistentry>
 739
 740     <varlistentry>
 741      <term>index</term>
 742      <listitem>
 743       <para>
 744        The index being considered.
 745       </para>
 746      </listitem>
 747     </varlistentry>
 748
 749     <varlistentry>
 750      <term>indexQuals</term>
 751      <listitem>
 752       <para>
 753        List of index qual clauses (implicitly ANDed);
 754        a NIL list indicates no qualifiers are available.
 755        Note that the list contains expression trees, not ScanKeys.
 756       </para>
 757      </listitem>
 758     </varlistentry>
 759
 760     <varlistentry>
 761      <term>outer_rel</term>
 762      <listitem>
 763       <para>
 764        If the index is being considered for use in a join inner indexscan,
 765        the planner's information about the outer side of the join.  Otherwise
 766        NULL.  When non-NULL, some of the qual clauses will be join clauses
 767        with this rel rather than being simple restriction clauses.  Also,
 768        the cost estimator should expect that the index scan will be repeated
 769        for each row of the outer rel.
 770       </para>
 771      </listitem>
 772     </varlistentry>
 773    </variablelist>
 774   </para>
 775
 776   <para>
 777    The last four parameters are pass-by-reference outputs:
 778
 779    <variablelist>
 780     <varlistentry>
 781      <term>*indexStartupCost</term>
 782      <listitem>
 783       <para>
 784        Set to cost of index start-up processing
 785       </para>
 786      </listitem>
 787     </varlistentry>
 788
 789     <varlistentry>
 790      <term>*indexTotalCost</term>
 791      <listitem>
 792       <para>
 793        Set to total cost of index processing
 794       </para>
 795      </listitem>
 796     </varlistentry>
 797
 798     <varlistentry>
 799      <term>*indexSelectivity</term>
 800      <listitem>
 801       <para>
 802        Set to index selectivity
 803       </para>
 804      </listitem>
 805     </varlistentry>
 806
 807     <varlistentry>
 808      <term>*indexCorrelation</term>
 809      <listitem>
 810       <para>
 811        Set to correlation coefficient between index scan order and
 812        underlying table's order
 813       </para>
 814      </listitem>
 815     </varlistentry>
 816    </variablelist>
 817   </para>
 818
 819   <para>
 820    Note that cost estimate functions must be written in C, not in SQL or
 821    any available procedural language, because they must access internal
 822    data structures of the planner/optimizer.
 823   </para>
 824
 825   <para>
 826    The index access costs should be computed using the parameters used by
 827    <filename>src/backend/optimizer/path/costsize.c</filename>: a sequential
 828    disk block fetch has cost <varname>seq_page_cost</>, a nonsequential fetch
 829    has cost <varname>random_page_cost</>, and the cost of processing one index
 830    row should usually be taken as <varname>cpu_index_tuple_cost</>.  In
 831    addition, an appropriate multiple of <varname>cpu_operator_cost</> should
 832    be charged for any comparison operators invoked during index processing
 833    (especially evaluation of the indexQuals themselves).
 834   </para>
 835
 836   <para>
 837    The access costs should include all disk and CPU costs associated with
 838    scanning the index itself, but <emphasis>not</> the costs of retrieving or
 839    processing the parent-table rows that are identified by the index.
 840   </para>
 841
 842   <para>
 843    The <quote>start-up cost</quote> is the part of the total scan cost that
 844    must be expended before we can begin to fetch the first row.  For most
 845    indexes this can be taken as zero, but an index type with a high start-up
 846    cost might want to set it nonzero.
 847   </para>
 848
 849   <para>
 850    The indexSelectivity should be set to the estimated fraction of the parent
 851    table rows that will be retrieved during the index scan.  In the case
 852    of a lossy query, this will typically be higher than the fraction of
 853    rows that actually pass the given qual conditions.
 854   </para>
 855
 856   <para>
 857    The indexCorrelation should be set to the correlation (ranging between
 858    -1.0 and 1.0) between the index order and the table order.  This is used
 859    to adjust the estimate for the cost of fetching rows from the parent
 860    table.
 861   </para>
 862
 863   <para>
 864    In the join case, the returned numbers should be averages expected for
 865    any one scan of the index.
 866   </para>
 867
 868   <procedure>
 869    <title>Cost Estimation</title>
 870    <para>
 871     A typical cost estimator will proceed as follows:
 872    </para>
 873
 874    <step>
 875     <para>
 876      Estimate and return the fraction of parent-table rows that will be visited
 877      based on the given qual conditions.  In the absence of any index-type-specific
 878      knowledge, use the standard optimizer function <function>clauselist_selectivity()</function>:
 879
 880 <programlisting>
 881 *indexSelectivity = clauselist_selectivity(root, indexQuals,
 882                                            index-&gt;rel-&gt;relid, JOIN_INNER);
 883 </programlisting>
 884     </para>
 885    </step>
 886
 887    <step>
 888     <para>
 889      Estimate the number of index rows that will be visited during the
 890      scan.  For many index types this is the same as indexSelectivity times
 891      the number of rows in the index, but it might be more.  (Note that the
 892      index's size in pages and rows is available from the IndexOptInfo struct.)
 893     </para>
 894    </step>
 895
 896    <step>
 897     <para>
 898      Estimate the number of index pages that will be retrieved during the scan.
 899      This might be just indexSelectivity times the index's size in pages.
 900     </para>
 901    </step>
 902
 903    <step>
 904     <para>
 905      Compute the index access cost.  A generic estimator might do this:
 906
 907 <programlisting>
 908     /*
 909      * Our generic assumption is that the index pages will be read
 910      * sequentially, so they cost seq_page_cost each, not random_page_cost.
 911      * Also, we charge for evaluation of the indexquals at each index row.
 912      * All the costs are assumed to be paid incrementally during the scan.
 913      */
 914     cost_qual_eval(&amp;index_qual_cost, indexQuals, root);
 915     *indexStartupCost = index_qual_cost.startup;
 916     *indexTotalCost = seq_page_cost * numIndexPages +
 917         (cpu_index_tuple_cost + index_qual_cost.per_tuple) * numIndexTuples;
 918 </programlisting>
 919
 920      However, the above does not account for amortization of index reads
 921      across repeated index scans in the join case.
 922     </para>
 923    </step>
 924
 925    <step>
 926     <para>
 927      Estimate the index correlation.  For a simple ordered index on a single
 928      field, this can be retrieved from pg_statistic.  If the correlation
 929      is not known, the conservative estimate is zero (no correlation).
 930     </para>
 931    </step>
 932   </procedure>
 933
 934   <para>
 935    Examples of cost estimator functions can be found in
 936    <filename>src/backend/utils/adt/selfuncs.c</filename>.
 937   </para>
 938  </sect1>
 939 </chapter>