granicus.if.org Git - postgresql/blob - doc/src/sgml/indexam.sgml

   1 <!-- doc/src/sgml/indexam.sgml -->
   2
   3 <chapter id="indexam">
   4  <title>Index Access Method Interface Definition</title>
   5
   6   <para>
   7    This chapter defines the interface between the core
   8    <productname>PostgreSQL</productname> system and <firstterm>index access
   9    methods</>, which manage individual index types.  The core system
  10    knows nothing about indexes beyond what is specified here, so it is
  11    possible to develop entirely new index types by writing add-on code.
  12   </para>
  13
  14   <para>
  15    All indexes in <productname>PostgreSQL</productname> are what are known
  16    technically as <firstterm>secondary indexes</>; that is, the index is
  17    physically separate from the table file that it describes.  Each index
  18    is stored as its own physical <firstterm>relation</> and so is described
  19    by an entry in the <structname>pg_class</> catalog.  The contents of an
  20    index are entirely under the control of its index access method.  In
  21    practice, all index access methods divide indexes into standard-size
  22    pages so that they can use the regular storage manager and buffer manager
  23    to access the index contents.  (All the existing index access methods
  24    furthermore use the standard page layout described in <xref
  25    linkend="storage-page-layout">, and most use the same format for index
  26    tuple headers; but these decisions are not forced on an access method.)
  27   </para>
  28
  29   <para>
  30    An index is effectively a mapping from some data key values to
  31    <firstterm>tuple identifiers</>, or <acronym>TIDs</>, of row versions
  32    (tuples) in the index's parent table.  A TID consists of a
  33    block number and an item number within that block (see <xref
  34    linkend="storage-page-layout">).  This is sufficient
  35    information to fetch a particular row version from the table.
  36    Indexes are not directly aware that under MVCC, there might be multiple
  37    extant versions of the same logical row; to an index, each tuple is
  38    an independent object that needs its own index entry.  Thus, an
  39    update of a row always creates all-new index entries for the row, even if
  40    the key values did not change.  (HOT tuples are an exception to this
  41    statement; but indexes do not deal with those, either.)  Index entries for
  42    dead tuples are reclaimed (by vacuuming) when the dead tuples themselves
  43    are reclaimed.
  44   </para>
  45
  46  <sect1 id="index-api">
  47   <title>Basic API Structure for Indexes</title>
  48
  49   <para>
  50    Each index access method is described by a row in the
  51    <link linkend="catalog-pg-am"><structname>pg_am</structname></link>
  52    system catalog.  The <structname>pg_am</structname> entry
  53    specifies a name and a <firstterm>handler function</> for the access
  54    method.  These entries can be created and deleted using the
  55    <xref linkend="sql-create-access-method"> and
  56    <xref linkend="sql-drop-access-method"> SQL commands.
  57   </para>
  58
  59   <para>
  60    An index access method handler function must be declared to accept a
  61    single argument of type <type>internal</> and to return the
  62    pseudo-type <type>index_am_handler</>.  The argument is a dummy value that
  63    simply serves to prevent handler functions from being called directly from
  64    SQL commands.  The result of the function must be a palloc'd struct of
  65    type <structname>IndexAmRoutine</structname>, which contains everything
  66    that the core code needs to know to make use of the index access method.
  67    The <structname>IndexAmRoutine</structname> struct, also called the access
  68    method's <firstterm>API struct</>, includes fields specifying assorted
  69    fixed properties of the access method, such as whether it can support
  70    multicolumn indexes.  More importantly, it contains pointers to support
  71    functions for the access method, which do all of the real work to access
  72    indexes.  These support functions are plain C functions and are not
  73    visible or callable at the SQL level.  The support functions are described
  74    in <xref linkend="index-functions">.
  75   </para>
  76
  77   <para>
  78    The structure <structname>IndexAmRoutine</structname> is defined thus:
  79 <programlisting>
  80 typedef struct IndexAmRoutine
  81 {
  82     NodeTag     type;
  83
  84     /*
  85      * Total number of strategies (operators) by which we can traverse/search
  86      * this AM.  Zero if AM does not have a fixed set of strategy assignments.
  87      */
  88     uint16      amstrategies;
  89     /* total number of support functions that this AM uses */
  90     uint16      amsupport;
  91     /* does AM support ORDER BY indexed column's value? */
  92     bool        amcanorder;
  93     /* does AM support ORDER BY result of an operator on indexed column? */
  94     bool        amcanorderbyop;
  95     /* does AM support backward scanning? */
  96     bool        amcanbackward;
  97     /* does AM support UNIQUE indexes? */
  98     bool        amcanunique;
  99     /* does AM support multi-column indexes? */
 100     bool        amcanmulticol;
 101     /* does AM require scans to have a constraint on the first index column? */
 102     bool        amoptionalkey;
 103     /* does AM handle ScalarArrayOpExpr quals? */
 104     bool        amsearcharray;
 105     /* does AM handle IS NULL/IS NOT NULL quals? */
 106     bool        amsearchnulls;
 107     /* can index storage data type differ from column data type? */
 108     bool        amstorage;
 109     /* can an index of this type be clustered on? */
 110     bool        amclusterable;
 111     /* does AM handle predicate locks? */
 112     bool        ampredlocks;
 113     /* type of data stored in index, or InvalidOid if variable */
 114     Oid         amkeytype;
 115
 116     /* interface functions */
 117     ambuild_function ambuild;
 118     ambuildempty_function ambuildempty;
 119     aminsert_function aminsert;
 120     ambulkdelete_function ambulkdelete;
 121     amvacuumcleanup_function amvacuumcleanup;
 122     amcanreturn_function amcanreturn;   /* can be NULL */
 123     amcostestimate_function amcostestimate;
 124     amoptions_function amoptions;
 125     amproperty_function amproperty;     /* can be NULL */
 126     amvalidate_function amvalidate;
 127     ambeginscan_function ambeginscan;
 128     amrescan_function amrescan;
 129     amgettuple_function amgettuple;     /* can be NULL */
 130     amgetbitmap_function amgetbitmap;   /* can be NULL */
 131     amendscan_function amendscan;
 132     ammarkpos_function ammarkpos;       /* can be NULL */
 133     amrestrpos_function amrestrpos;     /* can be NULL */
 134
 135     /* interface functions to support parallel index scans */
 136     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
 137     aminitparallelscan_function aminitparallelscan;    /* can be NULL */
 138     amparallelrescan_function amparallelrescan;    /* can be NULL */
 139 } IndexAmRoutine;
 140 </programlisting>
 141   </para>
 142
 143   <para>
 144    To be useful, an index access method must also have one or more
 145    <firstterm>operator families</> and
 146    <firstterm>operator classes</> defined in
 147    <link linkend="catalog-pg-opfamily"><structname>pg_opfamily</structname></link>,
 148    <link linkend="catalog-pg-opclass"><structname>pg_opclass</structname></link>,
 149    <link linkend="catalog-pg-amop"><structname>pg_amop</structname></link>, and
 150    <link linkend="catalog-pg-amproc"><structname>pg_amproc</structname></link>.
 151    These entries allow the planner
 152    to determine what kinds of query qualifications can be used with
 153    indexes of this access method.  Operator families and classes are described
 154    in <xref linkend="xindex">, which is prerequisite material for reading
 155    this chapter.
 156   </para>
 157
 158   <para>
 159    An individual index is defined by a
 160    <link linkend="catalog-pg-class"><structname>pg_class</structname></link>
 161    entry that describes it as a physical relation, plus a
 162    <link linkend="catalog-pg-index"><structname>pg_index</structname></link>
 163    entry that shows the logical content of the index &mdash; that is, the set
 164    of index columns it has and the semantics of those columns, as captured by
 165    the associated operator classes.  The index columns (key values) can be
 166    either simple columns of the underlying table or expressions over the table
 167    rows.  The index access method normally has no interest in where the index
 168    key values come from (it is always handed precomputed key values) but it
 169    will be very interested in the operator class information in
 170    <structname>pg_index</structname>.  Both of these catalog entries can be
 171    accessed as part of the <structname>Relation</> data structure that is
 172    passed to all operations on the index.
 173   </para>
 174
 175   <para>
 176    Some of the flag fields of <structname>IndexAmRoutine</> have nonobvious
 177    implications.  The requirements of <structfield>amcanunique</structfield>
 178    are discussed in <xref linkend="index-unique-checks">.
 179    The <structfield>amcanmulticol</structfield> flag asserts that the
 180    access method supports multicolumn indexes, while
 181    <structfield>amoptionalkey</structfield> asserts that it allows scans
 182    where no indexable restriction clause is given for the first index column.
 183    When <structfield>amcanmulticol</structfield> is false,
 184    <structfield>amoptionalkey</structfield> essentially says whether the
 185    access method supports full-index scans without any restriction clause.
 186    Access methods that support multiple index columns <emphasis>must</>
 187    support scans that omit restrictions on any or all of the columns after
 188    the first; however they are permitted to require some restriction to
 189    appear for the first index column, and this is signaled by setting
 190    <structfield>amoptionalkey</structfield> false.
 191    One reason that an index AM might set
 192    <structfield>amoptionalkey</structfield> false is if it doesn't index
 193    null values.  Since most indexable operators are
 194    strict and hence cannot return true for null inputs,
 195    it is at first sight attractive to not store index entries for null values:
 196    they could never be returned by an index scan anyway.  However, this
 197    argument fails when an index scan has no restriction clause for a given
 198    index column.  In practice this means that
 199    indexes that have <structfield>amoptionalkey</structfield> true must
 200    index nulls, since the planner might decide to use such an index
 201    with no scan keys at all.  A related restriction is that an index
 202    access method that supports multiple index columns <emphasis>must</>
 203    support indexing null values in columns after the first, because the planner
 204    will assume the index can be used for queries that do not restrict
 205    these columns.  For example, consider an index on (a,b) and a query with
 206    <literal>WHERE a = 4</literal>.  The system will assume the index can be
 207    used to scan for rows with <literal>a = 4</literal>, which is wrong if the
 208    index omits rows where <literal>b</> is null.
 209    It is, however, OK to omit rows where the first indexed column is null.
 210    An index access method that does index nulls may also set
 211    <structfield>amsearchnulls</structfield>, indicating that it supports
 212    <literal>IS NULL</> and <literal>IS NOT NULL</> clauses as search
 213    conditions.
 214   </para>
 215
 216  </sect1>
 217
 218  <sect1 id="index-functions">
 219   <title>Index Access Method Functions</title>
 220
 221   <para>
 222    The index construction and maintenance functions that an index access
 223    method must provide in <structname>IndexAmRoutine</structname> are:
 224   </para>
 225
 226   <para>
 227 <programlisting>
 228 IndexBuildResult *
 229 ambuild (Relation heapRelation,
 230          Relation indexRelation,
 231          IndexInfo *indexInfo);
 232 </programlisting>
 233    Build a new index.  The index relation has been physically created,
 234    but is empty.  It must be filled in with whatever fixed data the
 235    access method requires, plus entries for all tuples already existing
 236    in the table.  Ordinarily the <function>ambuild</> function will call
 237    <function>IndexBuildHeapScan()</> to scan the table for existing tuples
 238    and compute the keys that need to be inserted into the index.
 239    The function must return a palloc'd struct containing statistics about
 240    the new index.
 241   </para>
 242
 243   <para>
 244 <programlisting>
 245 void
 246 ambuildempty (Relation indexRelation);
 247 </programlisting>
 248    Build an empty index, and write it to the initialization fork (<symbol>INIT_FORKNUM</symbol>)
 249    of the given relation.  This method is called only for unlogged indexes; the
 250    empty index written to the initialization fork will be copied over the main
 251    relation fork on each server restart.
 252   </para>
 253
 254   <para>
 255 <programlisting>
 256 bool
 257 aminsert (Relation indexRelation,
 258           Datum *values,
 259           bool *isnull,
 260           ItemPointer heap_tid,
 261           Relation heapRelation,
 262           IndexUniqueCheck checkUnique);
 263 </programlisting>
 264    Insert a new tuple into an existing index.  The <literal>values</> and
 265    <literal>isnull</> arrays give the key values to be indexed, and
 266    <literal>heap_tid</> is the TID to be indexed.
 267    If the access method supports unique indexes (its
 268    <structfield>amcanunique</> flag is true) then
 269    <literal>checkUnique</> indicates the type of uniqueness check to
 270    perform.  This varies depending on whether the unique constraint is
 271    deferrable; see <xref linkend="index-unique-checks"> for details.
 272    Normally the access method only needs the <literal>heapRelation</>
 273    parameter when performing uniqueness checking (since then it will have to
 274    look into the heap to verify tuple liveness).
 275   </para>
 276
 277   <para>
 278    The function's Boolean result value is significant only when
 279    <literal>checkUnique</> is <literal>UNIQUE_CHECK_PARTIAL</>.
 280    In this case a TRUE result means the new entry is known unique, whereas
 281    FALSE means it might be non-unique (and a deferred uniqueness check must
 282    be scheduled).  For other cases a constant FALSE result is recommended.
 283   </para>
 284
 285   <para>
 286    Some indexes might not index all tuples.  If the tuple is not to be
 287    indexed, <function>aminsert</> should just return without doing anything.
 288   </para>
 289
 290   <para>
 291 <programlisting>
 292 IndexBulkDeleteResult *
 293 ambulkdelete (IndexVacuumInfo *info,
 294               IndexBulkDeleteResult *stats,
 295               IndexBulkDeleteCallback callback,
 296               void *callback_state);
 297 </programlisting>
 298    Delete tuple(s) from the index.  This is a <quote>bulk delete</> operation
 299    that is intended to be implemented by scanning the whole index and checking
 300    each entry to see if it should be deleted.
 301    The passed-in <literal>callback</> function must be called, in the style
 302    <literal>callback(<replaceable>TID</>, callback_state) returns bool</literal>,
 303    to determine whether any particular index entry, as identified by its
 304    referenced TID, is to be deleted.  Must return either NULL or a palloc'd
 305    struct containing statistics about the effects of the deletion operation.
 306    It is OK to return NULL if no information needs to be passed on to
 307    <function>amvacuumcleanup</>.
 308   </para>
 309
 310   <para>
 311    Because of limited <varname>maintenance_work_mem</>,
 312    <function>ambulkdelete</> might need to be called more than once when many
 313    tuples are to be deleted.  The <literal>stats</> argument is the result
 314    of the previous call for this index (it is NULL for the first call within a
 315    <command>VACUUM</> operation).  This allows the AM to accumulate statistics
 316    across the whole operation.  Typically, <function>ambulkdelete</> will
 317    modify and return the same struct if the passed <literal>stats</> is not
 318    null.
 319   </para>
 320
 321   <para>
 322 <programlisting>
 323 IndexBulkDeleteResult *
 324 amvacuumcleanup (IndexVacuumInfo *info,
 325                  IndexBulkDeleteResult *stats);
 326 </programlisting>
 327    Clean up after a <command>VACUUM</command> operation (zero or more
 328    <function>ambulkdelete</> calls).  This does not have to do anything
 329    beyond returning index statistics, but it might perform bulk cleanup
 330    such as reclaiming empty index pages.  <literal>stats</> is whatever the
 331    last <function>ambulkdelete</> call returned, or NULL if
 332    <function>ambulkdelete</> was not called because no tuples needed to be
 333    deleted.  If the result is not NULL it must be a palloc'd struct.
 334    The statistics it contains will be used to update <structname>pg_class</>,
 335    and will be reported by <command>VACUUM</> if <literal>VERBOSE</> is given.
 336    It is OK to return NULL if the index was not changed at all during the
 337    <command>VACUUM</command> operation, but otherwise correct stats should
 338    be returned.
 339   </para>
 340
 341   <para>
 342    As of <productname>PostgreSQL</productname> 8.4,
 343    <function>amvacuumcleanup</> will also be called at completion of an
 344    <command>ANALYZE</> operation.  In this case <literal>stats</> is always
 345    NULL and any return value will be ignored.  This case can be distinguished
 346    by checking <literal>info-&gt;analyze_only</literal>.  It is recommended
 347    that the access method do nothing except post-insert cleanup in such a
 348    call, and that only in an autovacuum worker process.
 349   </para>
 350
 351   <para>
 352 <programlisting>
 353 bool
 354 amcanreturn (Relation indexRelation, int attno);
 355 </programlisting>
 356    Check whether the index can support <link
 357    linkend="indexes-index-only-scans"><firstterm>index-only scans</></link> on
 358    the given column, by returning the indexed column values for an index entry
 359    in the form of an <structname>IndexTuple</structname>.  The attribute number
 360    is 1-based, i.e. the first column's attno is 1. Returns TRUE if supported,
 361    else FALSE.  If the access method does not support index-only scans at all,
 362    the <structfield>amcanreturn</> field in its <structname>IndexAmRoutine</>
 363    struct can be set to NULL.
 364   </para>
 365
 366   <para>
 367 <programlisting>
 368 void
 369 amcostestimate (PlannerInfo *root,
 370                 IndexPath *path,
 371                 double loop_count,
 372                 Cost *indexStartupCost,
 373                 Cost *indexTotalCost,
 374                 Selectivity *indexSelectivity,
 375                 double *indexCorrelation);
 376 </programlisting>
 377    Estimate the costs of an index scan.  This function is described fully
 378    in <xref linkend="index-cost-estimation">, below.
 379   </para>
 380
 381   <para>
 382 <programlisting>
 383 bytea *
 384 amoptions (ArrayType *reloptions,
 385            bool validate);
 386 </programlisting>
 387    Parse and validate the reloptions array for an index.  This is called only
 388    when a non-null reloptions array exists for the index.
 389    <parameter>reloptions</> is a <type>text</> array containing entries of the
 390    form <replaceable>name</><literal>=</><replaceable>value</>.
 391    The function should construct a <type>bytea</> value, which will be copied
 392    into the <structfield>rd_options</> field of the index's relcache entry.
 393    The data contents of the <type>bytea</> value are open for the access
 394    method to define; most of the standard access methods use struct
 395    <structname>StdRdOptions</>.
 396    When <parameter>validate</> is true, the function should report a suitable
 397    error message if any of the options are unrecognized or have invalid
 398    values; when <parameter>validate</> is false, invalid entries should be
 399    silently ignored.  (<parameter>validate</> is false when loading options
 400    already stored in <structname>pg_catalog</>; an invalid entry could only
 401    be found if the access method has changed its rules for options, and in
 402    that case ignoring obsolete entries is appropriate.)
 403    It is OK to return NULL if default behavior is wanted.
 404   </para>
 405
 406   <para>
 407 <programlisting>
 408 bool
 409 amproperty (Oid index_oid, int attno,
 410             IndexAMProperty prop, const char *propname,
 411             bool *res, bool *isnull);
 412 </programlisting>
 413    The <function>amproperty</> method allows index access methods to override
 414    the default behavior of <function>pg_index_column_has_property</function>
 415    and related functions.
 416    If the access method does not have any special behavior for index property
 417    inquiries, the <structfield>amproperty</> field in
 418    its <structname>IndexAmRoutine</> struct can be set to NULL.
 419    Otherwise, the <function>amproperty</> method will be called with
 420    <parameter>index_oid</> and <parameter>attno</> both zero for
 421    <function>pg_indexam_has_property</function> calls,
 422    or with <parameter>index_oid</> valid and <parameter>attno</> zero for
 423    <function>pg_index_has_property</function> calls,
 424    or with <parameter>index_oid</> valid and <parameter>attno</> greater than
 425    zero for <function>pg_index_column_has_property</function> calls.
 426    <parameter>prop</> is an enum value identifying the property being tested,
 427    while <parameter>propname</> is the original property name string.
 428    If the core code does not recognize the property name
 429    then <parameter>prop</> is <literal>AMPROP_UNKNOWN</>.
 430    Access methods can define custom property names by
 431    checking <parameter>propname</> for a match (use <function>pg_strcasecmp</>
 432    to match, for consistency with the core code); for names known to the core
 433    code, it's better to inspect <parameter>prop</>.
 434    If the <structfield>amproperty</> method returns <literal>true</> then
 435    it has determined the property test result: it must set <literal>*res</>
 436    to the boolean value to return, or set <literal>*isnull</>
 437    to <literal>true</> to return a NULL.  (Both of the referenced variables
 438    are initialized to <literal>false</> before the call.)
 439    If the <structfield>amproperty</> method returns <literal>false</> then
 440    the core code will proceed with its normal logic for determining the
 441    property test result.
 442   </para>
 443
 444   <para>
 445    Access methods that support ordering operators should
 446    implement <literal>AMPROP_DISTANCE_ORDERABLE</> property testing, as the
 447    core code does not know how to do that and will return NULL.  It may
 448    also be advantageous to implement <literal>AMPROP_RETURNABLE</> testing,
 449    if that can be done more cheaply than by opening the index and calling
 450    <structfield>amcanreturn</>, which is the core code's default behavior.
 451    The default behavior should be satisfactory for all other standard
 452    properties.
 453   </para>
 454
 455   <para>
 456 <programlisting>
 457 bool
 458 amvalidate (Oid opclassoid);
 459 </programlisting>
 460    Validate the catalog entries for the specified operator class, so far as
 461    the access method can reasonably do that.  For example, this might include
 462    testing that all required support functions are provided.
 463    The <function>amvalidate</> function must return false if the opclass is
 464    invalid.  Problems should be reported with <function>ereport</> messages.
 465   </para>
 466
 467
 468   <para>
 469    The purpose of an index, of course, is to support scans for tuples matching
 470    an indexable <literal>WHERE</> condition, often called a
 471    <firstterm>qualifier</> or <firstterm>scan key</>.  The semantics of
 472    index scanning are described more fully in <xref linkend="index-scanning">,
 473    below.  An index access method can support <quote>plain</> index scans,
 474    <quote>bitmap</> index scans, or both.  The scan-related functions that an
 475    index access method must or may provide are:
 476   </para>
 477
 478   <para>
 479 <programlisting>
 480 IndexScanDesc
 481 ambeginscan (Relation indexRelation,
 482              int nkeys,
 483              int norderbys);
 484 </programlisting>
 485    Prepare for an index scan.  The <literal>nkeys</> and <literal>norderbys</>
 486    parameters indicate the number of quals and ordering operators that will be
 487    used in the scan; these may be useful for space allocation purposes.
 488    Note that the actual values of the scan keys aren't provided yet.
 489    The result must be a palloc'd struct.
 490    For implementation reasons the index access method
 491    <emphasis>must</> create this struct by calling
 492    <function>RelationGetIndexScan()</>.  In most cases
 493    <function>ambeginscan</> does little beyond making that call and perhaps
 494    acquiring locks;
 495    the interesting parts of index-scan startup are in <function>amrescan</>.
 496   </para>
 497
 498   <para>
 499 <programlisting>
 500 void
 501 amrescan (IndexScanDesc scan,
 502           ScanKey keys,
 503           int nkeys,
 504           ScanKey orderbys,
 505           int norderbys);
 506 </programlisting>
 507    Start or restart an index scan, possibly with new scan keys.  (To restart
 508    using previously-passed keys, NULL is passed for <literal>keys</> and/or
 509    <literal>orderbys</>.)  Note that it is not allowed for
 510    the number of keys or order-by operators to be larger than
 511    what was passed to <function>ambeginscan</>.  In practice the restart
 512    feature is used when a new outer tuple is selected by a nested-loop join
 513    and so a new key comparison value is needed, but the scan key structure
 514    remains the same.
 515   </para>
 516
 517   <para>
 518 <programlisting>
 519 boolean
 520 amgettuple (IndexScanDesc scan,
 521             ScanDirection direction);
 522 </programlisting>
 523    Fetch the next tuple in the given scan, moving in the given
 524    direction (forward or backward in the index).  Returns TRUE if a tuple was
 525    obtained, FALSE if no matching tuples remain.  In the TRUE case the tuple
 526    TID is stored into the <literal>scan</> structure.  Note that
 527    <quote>success</> means only that the index contains an entry that matches
 528    the scan keys, not that the tuple necessarily still exists in the heap or
 529    will pass the caller's snapshot test.  On success, <function>amgettuple</>
 530    must also set <literal>scan-&gt;xs_recheck</> to TRUE or FALSE.
 531    FALSE means it is certain that the index entry matches the scan keys.
 532    TRUE means this is not certain, and the conditions represented by the
 533    scan keys must be rechecked against the heap tuple after fetching it.
 534    This provision supports <quote>lossy</> index operators.
 535    Note that rechecking will extend only to the scan conditions; a partial
 536    index predicate (if any) is never rechecked by <function>amgettuple</>
 537    callers.
 538   </para>
 539
 540   <para>
 541    If the index supports <link linkend="indexes-index-only-scans">index-only
 542    scans</link> (i.e., <function>amcanreturn</function> returns TRUE for it),
 543    then on success the AM must also check
 544    <literal>scan-&gt;xs_want_itup</>, and if that is true it must return
 545    the original indexed data for the index entry, in the form of an
 546    <structname>IndexTuple</> pointer stored at <literal>scan-&gt;xs_itup</>,
 547    with tuple descriptor <literal>scan-&gt;xs_itupdesc</>.
 548    (Management of the data referenced by the pointer is the access method's
 549    responsibility.  The data must remain good at least until the next
 550    <function>amgettuple</>, <function>amrescan</>, or <function>amendscan</>
 551    call for the scan.)
 552   </para>
 553
 554   <para>
 555    The <function>amgettuple</> function need only be provided if the access
 556    method supports <quote>plain</> index scans.  If it doesn't, the
 557    <structfield>amgettuple</> field in its <structname>IndexAmRoutine</>
 558    struct must be set to NULL.
 559   </para>
 560
 561   <para>
 562 <programlisting>
 563 int64
 564 amgetbitmap (IndexScanDesc scan,
 565              TIDBitmap *tbm);
 566 </programlisting>
 567    Fetch all tuples in the given scan and add them to the caller-supplied
 568    <type>TIDBitmap</type> (that is, OR the set of tuple IDs into whatever set is already
 569    in the bitmap).  The number of tuples fetched is returned (this might be
 570    just an approximate count, for instance some AMs do not detect duplicates).
 571    While inserting tuple IDs into the bitmap, <function>amgetbitmap</> can
 572    indicate that rechecking of the scan conditions is required for specific
 573    tuple IDs.  This is analogous to the <literal>xs_recheck</> output parameter
 574    of <function>amgettuple</>.  Note: in the current implementation, support
 575    for this feature is conflated with support for lossy storage of the bitmap
 576    itself, and therefore callers recheck both the scan conditions and the
 577    partial index predicate (if any) for recheckable tuples.  That might not
 578    always be true, however.
 579    <function>amgetbitmap</> and
 580    <function>amgettuple</> cannot be used in the same index scan; there
 581    are other restrictions too when using <function>amgetbitmap</>, as explained
 582    in <xref linkend="index-scanning">.
 583   </para>
 584
 585   <para>
 586    The <function>amgetbitmap</> function need only be provided if the access
 587    method supports <quote>bitmap</> index scans.  If it doesn't, the
 588    <structfield>amgetbitmap</> field in its <structname>IndexAmRoutine</>
 589    struct must be set to NULL.
 590   </para>
 591
 592   <para>
 593 <programlisting>
 594 void
 595 amendscan (IndexScanDesc scan);
 596 </programlisting>
 597    End a scan and release resources.  The <literal>scan</> struct itself
 598    should not be freed, but any locks or pins taken internally by the
 599    access method must be released.
 600   </para>
 601
 602   <para>
 603 <programlisting>
 604 void
 605 ammarkpos (IndexScanDesc scan);
 606 </programlisting>
 607    Mark current scan position.  The access method need only support one
 608    remembered scan position per scan.
 609   </para>
 610
 611   <para>
 612    The <function>ammarkpos</> function need only be provided if the access
 613    method supports ordered scans.  If it doesn't,
 614    the <structfield>ammarkpos</> field in its <structname>IndexAmRoutine</>
 615    struct may be set to NULL.
 616   </para>
 617
 618   <para>
 619 <programlisting>
 620 void
 621 amrestrpos (IndexScanDesc scan);
 622 </programlisting>
 623    Restore the scan to the most recently marked position.
 624   </para>
 625
 626   <para>
 627    The <function>amrestrpos</> function need only be provided if the access
 628    method supports ordered scans.  If it doesn't,
 629    the <structfield>amrestrpos</> field in its <structname>IndexAmRoutine</>
 630    struct may be set to NULL.
 631   </para>
 632
 633   <para>
 634    In addition to supporting ordinary index scans, some types of index
 635    may wish to support <firstterm>parallel index scans</>, which allow
 636    multiple backends to cooperate in performing an index scan.  The
 637    index access method should arrange things so that each cooperating
 638    process returns a subset of the tuples that would be performed by
 639    an ordinary, non-parallel index scan, but in such a way that the
 640    union of those subsets is equal to the set of tuples that would be
 641    returned by an ordinary, non-parallel index scan.  Furthermore, while
 642    there need not be any global ordering of tuples returned by a parallel
 643    scan, the ordering of that subset of tuples returned within each
 644    cooperating backend must match the requested ordering.  The following
 645    functions may be implemented to support parallel index scans:
 646   </para>
 647
 648   <para>
 649 <programlisting>
 650 Size
 651 amestimateparallelscan (void);
 652 </programlisting>
 653    Estimate and return the number of bytes of dynamic shared memory which
 654    the access method will be needed to perform a parallel scan.  (This number
 655    is in addition to, not in lieu of, the amount of space needed for
 656    AM-independent data in <structname>ParallelIndexScanDescData</>.)
 657   </para>
 658
 659   <para>
 660    It is not necessary to implement this function for access methods which
 661    do not support parallel scans or for which the number of additional bytes
 662    of storage required is zero.
 663   </para>
 664
 665   <para>
 666 <programlisting>
 667 void
 668 aminitparallelscan (void *target);
 669 </programlisting>
 670    This function will be called to initialize dynamic shared memory at the
 671    beginning of a parallel scan.  <parameter>target</> will point to at least
 672    the number of bytes previously returned by
 673    <function>amestimateparallelscan</>, and this function may use that
 674    amount of space to store whatever data it wishes.
 675   </para>
 676
 677   <para>
 678    It is not necessary to implement this function for access methods which
 679    do not support parallel scans or in cases where the shared memory space
 680    required needs no initialization.
 681   </para>
 682
 683   <para>
 684 <programlisting>
 685 void
 686 amparallelrescan (IndexScanDesc scan);
 687 </programlisting>
 688    This function, if implemented, will be called when a parallel index scan
 689    must be restarted.  It should reset any shared state set up by
 690    <function>aminitparallelscan</> such that the scan will be restarted from
 691    the beginning.
 692   </para>
 693
 694  </sect1>
 695
 696  <sect1 id="index-scanning">
 697   <title>Index Scanning</title>
 698
 699   <para>
 700    In an index scan, the index access method is responsible for regurgitating
 701    the TIDs of all the tuples it has been told about that match the
 702    <firstterm>scan keys</>.  The access method is <emphasis>not</> involved in
 703    actually fetching those tuples from the index's parent table, nor in
 704    determining whether they pass the scan's time qualification test or other
 705    conditions.
 706   </para>
 707
 708   <para>
 709    A scan key is the internal representation of a <literal>WHERE</> clause of
 710    the form <replaceable>index_key</> <replaceable>operator</>
 711    <replaceable>constant</>, where the index key is one of the columns of the
 712    index and the operator is one of the members of the operator family
 713    associated with that index column.  An index scan has zero or more scan
 714    keys, which are implicitly ANDed &mdash; the returned tuples are expected
 715    to satisfy all the indicated conditions.
 716   </para>
 717
 718   <para>
 719    The access method can report that the index is <firstterm>lossy</>, or
 720    requires rechecks, for a particular query.  This implies that the index
 721    scan will return all the entries that pass the scan key, plus possibly
 722    additional entries that do not.  The core system's index-scan machinery
 723    will then apply the index conditions again to the heap tuple to verify
 724    whether or not it really should be selected.  If the recheck option is not
 725    specified, the index scan must return exactly the set of matching entries.
 726   </para>
 727
 728   <para>
 729    Note that it is entirely up to the access method to ensure that it
 730    correctly finds all and only the entries passing all the given scan keys.
 731    Also, the core system will simply hand off all the <literal>WHERE</>
 732    clauses that match the index keys and operator families, without any
 733    semantic analysis to determine whether they are redundant or
 734    contradictory.  As an example, given
 735    <literal>WHERE x &gt; 4 AND x &gt; 14</> where <literal>x</> is a b-tree
 736    indexed column, it is left to the b-tree <function>amrescan</> function
 737    to realize that the first scan key is redundant and can be discarded.
 738    The extent of preprocessing needed during <function>amrescan</> will
 739    depend on the extent to which the index access method needs to reduce
 740    the scan keys to a <quote>normalized</> form.
 741   </para>
 742
 743   <para>
 744    Some access methods return index entries in a well-defined order, others
 745    do not.  There are actually two different ways that an access method can
 746    support sorted output:
 747
 748     <itemizedlist>
 749      <listitem>
 750       <para>
 751        Access methods that always return entries in the natural ordering
 752        of their data (such as btree) should set
 753        <structfield>amcanorder</> to true.
 754        Currently, such access methods must use btree-compatible strategy
 755        numbers for their equality and ordering operators.
 756       </para>
 757      </listitem>
 758      <listitem>
 759       <para>
 760        Access methods that support ordering operators should set
 761        <structfield>amcanorderbyop</> to true.
 762        This indicates that the index is capable of returning entries in
 763        an order satisfying <literal>ORDER BY</> <replaceable>index_key</>
 764        <replaceable>operator</> <replaceable>constant</>.  Scan modifiers
 765        of that form can be passed to <function>amrescan</> as described
 766        previously.
 767       </para>
 768      </listitem>
 769     </itemizedlist>
 770   </para>
 771
 772   <para>
 773    The <function>amgettuple</> function has a <literal>direction</> argument,
 774    which can be either <literal>ForwardScanDirection</> (the normal case)
 775    or  <literal>BackwardScanDirection</>.  If the first call after
 776    <function>amrescan</> specifies <literal>BackwardScanDirection</>, then the
 777    set of matching index entries is to be scanned back-to-front rather than in
 778    the normal front-to-back direction, so <function>amgettuple</> must return
 779    the last matching tuple in the index, rather than the first one as it
 780    normally would.  (This will only occur for access
 781    methods that set <structfield>amcanorder</> to true.)  After the
 782    first call, <function>amgettuple</> must be prepared to advance the scan in
 783    either direction from the most recently returned entry.  (But if
 784    <structfield>amcanbackward</> is false, all subsequent
 785    calls will have the same direction as the first one.)
 786   </para>
 787
 788   <para>
 789    Access methods that support ordered scans must support <quote>marking</> a
 790    position in a scan and later returning to the marked position.  The same
 791    position might be restored multiple times.  However, only one position need
 792    be remembered per scan; a new <function>ammarkpos</> call overrides the
 793    previously marked position.  An access method that does not support ordered
 794    scans need not provide <function>ammarkpos</> and <function>amrestrpos</>
 795    functions in <structname>IndexAmRoutine</>; set those pointers to NULL
 796    instead.
 797   </para>
 798
 799   <para>
 800    Both the scan position and the mark position (if any) must be maintained
 801    consistently in the face of concurrent insertions or deletions in the
 802    index.  It is OK if a freshly-inserted entry is not returned by a scan that
 803    would have found the entry if it had existed when the scan started, or for
 804    the scan to return such an entry upon rescanning or backing
 805    up even though it had not been returned the first time through.  Similarly,
 806    a concurrent delete might or might not be reflected in the results of a scan.
 807    What is important is that insertions or deletions not cause the scan to
 808    miss or multiply return entries that were not themselves being inserted or
 809    deleted.
 810   </para>
 811
 812   <para>
 813    If the index stores the original indexed data values (and not some lossy
 814    representation of them), it is useful to
 815    support <link linkend="indexes-index-only-scans">index-only scans</link>, in
 816    which the index returns the actual data not just the TID of the heap tuple.
 817    This will only avoid I/O if the visibility map shows that the TID is on an
 818    all-visible page; else the heap tuple must be visited anyway to check
 819    MVCC visibility.  But that is no concern of the access method's.
 820   </para>
 821
 822   <para>
 823    Instead of using <function>amgettuple</>, an index scan can be done with
 824    <function>amgetbitmap</> to fetch all tuples in one call.  This can be
 825    noticeably more efficient than <function>amgettuple</> because it allows
 826    avoiding lock/unlock cycles within the access method.  In principle
 827    <function>amgetbitmap</> should have the same effects as repeated
 828    <function>amgettuple</> calls, but we impose several restrictions to
 829    simplify matters.  First of all, <function>amgetbitmap</> returns all
 830    tuples at once and marking or restoring scan positions isn't
 831    supported. Secondly, the tuples are returned in a bitmap which doesn't
 832    have any specific ordering, which is why <function>amgetbitmap</> doesn't
 833    take a <literal>direction</> argument.  (Ordering operators will never be
 834    supplied for such a scan, either.)
 835    Also, there is no provision for index-only scans with
 836    <function>amgetbitmap</>, since there is no way to return the contents of
 837    index tuples.
 838    Finally, <function>amgetbitmap</>
 839    does not guarantee any locking of the returned tuples, with implications
 840    spelled out in <xref linkend="index-locking">.
 841   </para>
 842
 843   <para>
 844    Note that it is permitted for an access method to implement only
 845    <function>amgetbitmap</> and not <function>amgettuple</>, or vice versa,
 846    if its internal implementation is unsuited to one API or the other.
 847   </para>
 848
 849  </sect1>
 850
 851  <sect1 id="index-locking">
 852   <title>Index Locking Considerations</title>
 853
 854   <para>
 855    Index access methods must handle concurrent updates
 856    of the index by multiple processes.
 857    The core <productname>PostgreSQL</productname> system obtains
 858    <literal>AccessShareLock</> on the index during an index scan, and
 859    <literal>RowExclusiveLock</> when updating the index (including plain
 860    <command>VACUUM</>).  Since these lock types do not conflict, the access
 861    method is responsible for handling any fine-grained locking it might need.
 862    An exclusive lock on the index as a whole will be taken only during index
 863    creation, destruction, or <command>REINDEX</>.
 864   </para>
 865
 866   <para>
 867    Building an index type that supports concurrent updates usually requires
 868    extensive and subtle analysis of the required behavior.  For the b-tree
 869    and hash index types, you can read about the design decisions involved in
 870    <filename>src/backend/access/nbtree/README</> and
 871    <filename>src/backend/access/hash/README</>.
 872   </para>
 873
 874   <para>
 875    Aside from the index's own internal consistency requirements, concurrent
 876    updates create issues about consistency between the parent table (the
 877    <firstterm>heap</>) and the index.  Because
 878    <productname>PostgreSQL</productname> separates accesses
 879    and updates of the heap from those of the index, there are windows in
 880    which the index might be inconsistent with the heap.  We handle this problem
 881    with the following rules:
 882
 883     <itemizedlist>
 884      <listitem>
 885       <para>
 886        A new heap entry is made before making its index entries.  (Therefore
 887        a concurrent index scan is likely to fail to see the heap entry.
 888        This is okay because the index reader would be uninterested in an
 889        uncommitted row anyway.  But see <xref linkend="index-unique-checks">.)
 890       </para>
 891      </listitem>
 892      <listitem>
 893       <para>
 894        When a heap entry is to be deleted (by <command>VACUUM</>), all its
 895        index entries must be removed first.
 896       </para>
 897      </listitem>
 898      <listitem>
 899       <para>
 900        An index scan must maintain a pin
 901        on the index page holding the item last returned by
 902        <function>amgettuple</>, and <function>ambulkdelete</> cannot delete
 903        entries from pages that are pinned by other backends.  The need
 904        for this rule is explained below.
 905       </para>
 906      </listitem>
 907     </itemizedlist>
 908
 909    Without the third rule, it is possible for an index reader to
 910    see an index entry just before it is removed by <command>VACUUM</>, and
 911    then to arrive at the corresponding heap entry after that was removed by
 912    <command>VACUUM</>.
 913    This creates no serious problems if that item
 914    number is still unused when the reader reaches it, since an empty
 915    item slot will be ignored by <function>heap_fetch()</>.  But what if a
 916    third backend has already re-used the item slot for something else?
 917    When using an MVCC-compliant snapshot, there is no problem because
 918    the new occupant of the slot is certain to be too new to pass the
 919    snapshot test.  However, with a non-MVCC-compliant snapshot (such as
 920    <literal>SnapshotAny</>), it would be possible to accept and return
 921    a row that does not in fact match the scan keys.  We could defend
 922    against this scenario by requiring the scan keys to be rechecked
 923    against the heap row in all cases, but that is too expensive.  Instead,
 924    we use a pin on an index page as a proxy to indicate that the reader
 925    might still be <quote>in flight</> from the index entry to the matching
 926    heap entry.  Making <function>ambulkdelete</> block on such a pin ensures
 927    that <command>VACUUM</> cannot delete the heap entry before the reader
 928    is done with it.  This solution costs little in run time, and adds blocking
 929    overhead only in the rare cases where there actually is a conflict.
 930   </para>
 931
 932   <para>
 933    This solution requires that index scans be <quote>synchronous</>: we have
 934    to fetch each heap tuple immediately after scanning the corresponding index
 935    entry.  This is expensive for a number of reasons.  An
 936    <quote>asynchronous</> scan in which we collect many TIDs from the index,
 937    and only visit the heap tuples sometime later, requires much less index
 938    locking overhead and can allow a more efficient heap access pattern.
 939    Per the above analysis, we must use the synchronous approach for
 940    non-MVCC-compliant snapshots, but an asynchronous scan is workable
 941    for a query using an MVCC snapshot.
 942   </para>
 943
 944   <para>
 945    In an <function>amgetbitmap</> index scan, the access method does not
 946    keep an index pin on any of the returned tuples.  Therefore
 947    it is only safe to use such scans with MVCC-compliant snapshots.
 948   </para>
 949
 950   <para>
 951    When the <structfield>ampredlocks</> flag is not set, any scan using that
 952    index access method within a serializable transaction will acquire a
 953    nonblocking predicate lock on the full index.  This will generate a
 954    read-write conflict with the insert of any tuple into that index by a
 955    concurrent serializable transaction.  If certain patterns of read-write
 956    conflicts are detected among a set of concurrent serializable
 957    transactions, one of those transactions may be canceled to protect data
 958    integrity.  When the flag is set, it indicates that the index access
 959    method implements finer-grained predicate locking, which will tend to
 960    reduce the frequency of such transaction cancellations.
 961   </para>
 962
 963  </sect1>
 964
 965  <sect1 id="index-unique-checks">
 966   <title>Index Uniqueness Checks</title>
 967
 968   <para>
 969    <productname>PostgreSQL</productname> enforces SQL uniqueness constraints
 970    using <firstterm>unique indexes</>, which are indexes that disallow
 971    multiple entries with identical keys.  An access method that supports this
 972    feature sets <structfield>amcanunique</> true.
 973    (At present, only b-tree supports it.)
 974   </para>
 975
 976   <para>
 977    Because of MVCC, it is always necessary to allow duplicate entries to
 978    exist physically in an index: the entries might refer to successive
 979    versions of a single logical row.  The behavior we actually want to
 980    enforce is that no MVCC snapshot could include two rows with equal
 981    index keys.  This breaks down into the following cases that must be
 982    checked when inserting a new row into a unique index:
 983
 984     <itemizedlist>
 985      <listitem>
 986       <para>
 987        If a conflicting valid row has been deleted by the current transaction,
 988        it's okay.  (In particular, since an UPDATE always deletes the old row
 989        version before inserting the new version, this will allow an UPDATE on
 990        a row without changing the key.)
 991       </para>
 992      </listitem>
 993      <listitem>
 994       <para>
 995        If a conflicting row has been inserted by an as-yet-uncommitted
 996        transaction, the would-be inserter must wait to see if that transaction
 997        commits.  If it rolls back then there is no conflict.  If it commits
 998        without deleting the conflicting row again, there is a uniqueness
 999        violation.  (In practice we just wait for the other transaction to
1000        end and then redo the visibility check in toto.)
1001       </para>
1002      </listitem>
1003      <listitem>
1004       <para>
1005        Similarly, if a conflicting valid row has been deleted by an
1006        as-yet-uncommitted transaction, the would-be inserter must wait
1007        for that transaction to commit or abort, and then repeat the test.
1008       </para>
1009      </listitem>
1010     </itemizedlist>
1011   </para>
1012
1013   <para>
1014    Furthermore, immediately before reporting a uniqueness violation
1015    according to the above rules, the access method must recheck the
1016    liveness of the row being inserted.  If it is committed dead then
1017    no violation should be reported.  (This case cannot occur during the
1018    ordinary scenario of inserting a row that's just been created by
1019    the current transaction.  It can happen during
1020    <command>CREATE UNIQUE INDEX CONCURRENTLY</>, however.)
1021   </para>
1022
1023   <para>
1024    We require the index access method to apply these tests itself, which
1025    means that it must reach into the heap to check the commit status of
1026    any row that is shown to have a duplicate key according to the index
1027    contents.  This is without a doubt ugly and non-modular, but it saves
1028    redundant work: if we did a separate probe then the index lookup for
1029    a conflicting row would be essentially repeated while finding the place to
1030    insert the new row's index entry.  What's more, there is no obvious way
1031    to avoid race conditions unless the conflict check is an integral part
1032    of insertion of the new index entry.
1033   </para>
1034
1035   <para>
1036    If the unique constraint is deferrable, there is additional complexity:
1037    we need to be able to insert an index entry for a new row, but defer any
1038    uniqueness-violation error until end of statement or even later.  To
1039    avoid unnecessary repeat searches of the index, the index access method
1040    should do a preliminary uniqueness check during the initial insertion.
1041    If this shows that there is definitely no conflicting live tuple, we
1042    are done.  Otherwise, we schedule a recheck to occur when it is time to
1043    enforce the constraint.  If, at the time of the recheck, both the inserted
1044    tuple and some other tuple with the same key are live, then the error
1045    must be reported.  (Note that for this purpose, <quote>live</> actually
1046    means <quote>any tuple in the index entry's HOT chain is live</>.)
1047    To implement this, the <function>aminsert</> function is passed a
1048    <literal>checkUnique</> parameter having one of the following values:
1049
1050     <itemizedlist>
1051      <listitem>
1052       <para>
1053        <literal>UNIQUE_CHECK_NO</> indicates that no uniqueness checking
1054        should be done (this is not a unique index).
1055       </para>
1056      </listitem>
1057      <listitem>
1058       <para>
1059        <literal>UNIQUE_CHECK_YES</> indicates that this is a non-deferrable
1060        unique index, and the uniqueness check must be done immediately, as
1061        described above.
1062       </para>
1063      </listitem>
1064      <listitem>
1065       <para>
1066        <literal>UNIQUE_CHECK_PARTIAL</> indicates that the unique
1067        constraint is deferrable. <productname>PostgreSQL</productname>
1068        will use this mode to insert each row's index entry.  The access
1069        method must allow duplicate entries into the index, and report any
1070        potential duplicates by returning FALSE from <function>aminsert</>.
1071        For each row for which FALSE is returned, a deferred recheck will
1072        be scheduled.
1073       </para>
1074
1075       <para>
1076        The access method must identify any rows which might violate the
1077        unique constraint, but it is not an error for it to report false
1078        positives. This allows the check to be done without waiting for other
1079        transactions to finish; conflicts reported here are not treated as
1080        errors and will be rechecked later, by which time they may no longer
1081        be conflicts.
1082       </para>
1083      </listitem>
1084      <listitem>
1085       <para>
1086        <literal>UNIQUE_CHECK_EXISTING</> indicates that this is a deferred
1087        recheck of a row that was reported as a potential uniqueness violation.
1088        Although this is implemented by calling <function>aminsert</>, the
1089        access method must <emphasis>not</> insert a new index entry in this
1090        case.  The index entry is already present.  Rather, the access method
1091        must check to see if there is another live index entry.  If so, and
1092        if the target row is also still live, report error.
1093       </para>
1094
1095       <para>
1096        It is recommended that in a <literal>UNIQUE_CHECK_EXISTING</> call,
1097        the access method further verify that the target row actually does
1098        have an existing entry in the index, and report error if not.  This
1099        is a good idea because the index tuple values passed to
1100        <function>aminsert</> will have been recomputed.  If the index
1101        definition involves functions that are not really immutable, we
1102        might be checking the wrong area of the index.  Checking that the
1103        target row is found in the recheck verifies that we are scanning
1104        for the same tuple values as were used in the original insertion.
1105       </para>
1106      </listitem>
1107     </itemizedlist>
1108   </para>
1109
1110  </sect1>
1111
1112  <sect1 id="index-cost-estimation">
1113   <title>Index Cost Estimation Functions</title>
1114
1115   <para>
1116    The <function>amcostestimate</> function is given information describing
1117    a possible index scan, including lists of WHERE and ORDER BY clauses that
1118    have been determined to be usable with the index.  It must return estimates
1119    of the cost of accessing the index and the selectivity of the WHERE
1120    clauses (that is, the fraction of parent-table rows that will be
1121    retrieved during the index scan).  For simple cases, nearly all the
1122    work of the cost estimator can be done by calling standard routines
1123    in the optimizer; the point of having an <function>amcostestimate</> function is
1124    to allow index access methods to provide index-type-specific knowledge,
1125    in case it is possible to improve on the standard estimates.
1126   </para>
1127
1128   <para>
1129    Each <function>amcostestimate</> function must have the signature:
1130
1131 <programlisting>
1132 void
1133 amcostestimate (PlannerInfo *root,
1134                 IndexPath *path,
1135                 double loop_count,
1136                 Cost *indexStartupCost,
1137                 Cost *indexTotalCost,
1138                 Selectivity *indexSelectivity,
1139                 double *indexCorrelation);
1140 </programlisting>
1141
1142    The first three parameters are inputs:
1143
1144    <variablelist>
1145     <varlistentry>
1146      <term><parameter>root</></term>
1147      <listitem>
1148       <para>
1149        The planner's information about the query being processed.
1150       </para>
1151      </listitem>
1152     </varlistentry>
1153
1154     <varlistentry>
1155      <term><parameter>path</></term>
1156      <listitem>
1157       <para>
1158        The index access path being considered.  All fields except cost and
1159        selectivity values are valid.
1160       </para>
1161      </listitem>
1162     </varlistentry>
1163
1164     <varlistentry>
1165      <term><parameter>loop_count</></term>
1166      <listitem>
1167       <para>
1168        The number of repetitions of the index scan that should be factored
1169        into the cost estimates.  This will typically be greater than one when
1170        considering a parameterized scan for use in the inside of a nestloop
1171        join.  Note that the cost estimates should still be for just one scan;
1172        a larger <parameter>loop_count</> means that it may be appropriate
1173        to allow for some caching effects across multiple scans.
1174       </para>
1175      </listitem>
1176     </varlistentry>
1177    </variablelist>
1178   </para>
1179
1180   <para>
1181    The last four parameters are pass-by-reference outputs:
1182
1183    <variablelist>
1184     <varlistentry>
1185      <term><parameter>*indexStartupCost</></term>
1186      <listitem>
1187       <para>
1188        Set to cost of index start-up processing
1189       </para>
1190      </listitem>
1191     </varlistentry>
1192
1193     <varlistentry>
1194      <term><parameter>*indexTotalCost</></term>
1195      <listitem>
1196       <para>
1197        Set to total cost of index processing
1198       </para>
1199      </listitem>
1200     </varlistentry>
1201
1202     <varlistentry>
1203      <term><parameter>*indexSelectivity</></term>
1204      <listitem>
1205       <para>
1206        Set to index selectivity
1207       </para>
1208      </listitem>
1209     </varlistentry>
1210
1211     <varlistentry>
1212      <term><parameter>*indexCorrelation</></term>
1213      <listitem>
1214       <para>
1215        Set to correlation coefficient between index scan order and
1216        underlying table's order
1217       </para>
1218      </listitem>
1219     </varlistentry>
1220    </variablelist>
1221   </para>
1222
1223   <para>
1224    Note that cost estimate functions must be written in C, not in SQL or
1225    any available procedural language, because they must access internal
1226    data structures of the planner/optimizer.
1227   </para>
1228
1229   <para>
1230    The index access costs should be computed using the parameters used by
1231    <filename>src/backend/optimizer/path/costsize.c</filename>: a sequential
1232    disk block fetch has cost <varname>seq_page_cost</>, a nonsequential fetch
1233    has cost <varname>random_page_cost</>, and the cost of processing one index
1234    row should usually be taken as <varname>cpu_index_tuple_cost</>.  In
1235    addition, an appropriate multiple of <varname>cpu_operator_cost</> should
1236    be charged for any comparison operators invoked during index processing
1237    (especially evaluation of the indexquals themselves).
1238   </para>
1239
1240   <para>
1241    The access costs should include all disk and CPU costs associated with
1242    scanning the index itself, but <emphasis>not</> the costs of retrieving or
1243    processing the parent-table rows that are identified by the index.
1244   </para>
1245
1246   <para>
1247    The <quote>start-up cost</quote> is the part of the total scan cost that
1248    must be expended before we can begin to fetch the first row.  For most
1249    indexes this can be taken as zero, but an index type with a high start-up
1250    cost might want to set it nonzero.
1251   </para>
1252
1253   <para>
1254    The <parameter>indexSelectivity</> should be set to the estimated fraction of the parent
1255    table rows that will be retrieved during the index scan.  In the case
1256    of a lossy query, this will typically be higher than the fraction of
1257    rows that actually pass the given qual conditions.
1258   </para>
1259
1260   <para>
1261    The <parameter>indexCorrelation</> should be set to the correlation (ranging between
1262    -1.0 and 1.0) between the index order and the table order.  This is used
1263    to adjust the estimate for the cost of fetching rows from the parent
1264    table.
1265   </para>
1266
1267   <para>
1268    When <parameter>loop_count</> is greater than one, the returned numbers
1269    should be averages expected for any one scan of the index.
1270   </para>
1271
1272   <procedure>
1273    <title>Cost Estimation</title>
1274    <para>
1275     A typical cost estimator will proceed as follows:
1276    </para>
1277
1278    <step>
1279     <para>
1280      Estimate and return the fraction of parent-table rows that will be visited
1281      based on the given qual conditions.  In the absence of any index-type-specific
1282      knowledge, use the standard optimizer function <function>clauselist_selectivity()</function>:
1283
1284 <programlisting>
1285 *indexSelectivity = clauselist_selectivity(root, path-&gt;indexquals,
1286                                            path-&gt;indexinfo-&gt;rel-&gt;relid,
1287                                            JOIN_INNER, NULL);
1288 </programlisting>
1289     </para>
1290    </step>
1291
1292    <step>
1293     <para>
1294      Estimate the number of index rows that will be visited during the
1295      scan.  For many index types this is the same as <parameter>indexSelectivity</> times
1296      the number of rows in the index, but it might be more.  (Note that the
1297      index's size in pages and rows is available from the
1298      <literal>path-&gt;indexinfo</> struct.)
1299     </para>
1300    </step>
1301
1302    <step>
1303     <para>
1304      Estimate the number of index pages that will be retrieved during the scan.
1305      This might be just <parameter>indexSelectivity</> times the index's size in pages.
1306     </para>
1307    </step>
1308
1309    <step>
1310     <para>
1311      Compute the index access cost.  A generic estimator might do this:
1312
1313 <programlisting>
1314 /*
1315  * Our generic assumption is that the index pages will be read
1316  * sequentially, so they cost seq_page_cost each, not random_page_cost.
1317  * Also, we charge for evaluation of the indexquals at each index row.
1318  * All the costs are assumed to be paid incrementally during the scan.
1319  */
1320 cost_qual_eval(&amp;index_qual_cost, path-&gt;indexquals, root);
1321 *indexStartupCost = index_qual_cost.startup;
1322 *indexTotalCost = seq_page_cost * numIndexPages +
1323     (cpu_index_tuple_cost + index_qual_cost.per_tuple) * numIndexTuples;
1324 </programlisting>
1325
1326      However, the above does not account for amortization of index reads
1327      across repeated index scans.
1328     </para>
1329    </step>
1330
1331    <step>
1332     <para>
1333      Estimate the index correlation.  For a simple ordered index on a single
1334      field, this can be retrieved from pg_statistic.  If the correlation
1335      is not known, the conservative estimate is zero (no correlation).
1336     </para>
1337    </step>
1338   </procedure>
1339
1340   <para>
1341    Examples of cost estimator functions can be found in
1342    <filename>src/backend/utils/adt/selfuncs.c</filename>.
1343   </para>
1344  </sect1>
1345 </chapter>