1 <!-- doc/src/sgml/indexam.sgml -->
4 <title>Index Access Method Interface Definition</title>
7 This chapter defines the interface between the core
8 <productname>PostgreSQL</productname> system and <firstterm>index access
9 methods</>, which manage individual index types. The core system
10 knows nothing about indexes beyond what is specified here, so it is
11 possible to develop entirely new index types by writing add-on code.
15 All indexes in <productname>PostgreSQL</productname> are what are known
16 technically as <firstterm>secondary indexes</>; that is, the index is
17 physically separate from the table file that it describes. Each index
18 is stored as its own physical <firstterm>relation</> and so is described
19 by an entry in the <structname>pg_class</> catalog. The contents of an
20 index are entirely under the control of its index access method. In
21 practice, all index access methods divide indexes into standard-size
22 pages so that they can use the regular storage manager and buffer manager
23 to access the index contents. (All the existing index access methods
24 furthermore use the standard page layout described in <xref
25 linkend="storage-page-layout">, and most use the same format for index
26 tuple headers; but these decisions are not forced on an access method.)
30 An index is effectively a mapping from some data key values to
31 <firstterm>tuple identifiers</>, or <acronym>TIDs</>, of row versions
32 (tuples) in the index's parent table. A TID consists of a
33 block number and an item number within that block (see <xref
34 linkend="storage-page-layout">). This is sufficient
35 information to fetch a particular row version from the table.
36 Indexes are not directly aware that under MVCC, there might be multiple
37 extant versions of the same logical row; to an index, each tuple is
38 an independent object that needs its own index entry. Thus, an
39 update of a row always creates all-new index entries for the row, even if
40 the key values did not change. (HOT tuples are an exception to this
41 statement; but indexes do not deal with those, either.) Index entries for
42 dead tuples are reclaimed (by vacuuming) when the dead tuples themselves
46 <sect1 id="index-api">
47 <title>Basic API Structure for Indexes</title>
50 Each index access method is described by a row in the
51 <link linkend="catalog-pg-am"><structname>pg_am</structname></link>
52 system catalog. The <structname>pg_am</structname> entry
53 specifies a name and a <firstterm>handler function</> for the access
54 method. These entries can be created and deleted using the
55 <xref linkend="sql-create-access-method"> and
56 <xref linkend="sql-drop-access-method"> SQL commands.
60 An index access method handler function must be declared to accept a
61 single argument of type <type>internal</> and to return the
62 pseudo-type <type>index_am_handler</>. The argument is a dummy value that
63 simply serves to prevent handler functions from being called directly from
64 SQL commands. The result of the function must be a palloc'd struct of
65 type <structname>IndexAmRoutine</structname>, which contains everything
66 that the core code needs to know to make use of the index access method.
67 The <structname>IndexAmRoutine</structname> struct, also called the access
68 method's <firstterm>API struct</>, includes fields specifying assorted
69 fixed properties of the access method, such as whether it can support
70 multicolumn indexes. More importantly, it contains pointers to support
71 functions for the access method, which do all of the real work to access
72 indexes. These support functions are plain C functions and are not
73 visible or callable at the SQL level. The support functions are described
74 in <xref linkend="index-functions">.
78 The structure <structname>IndexAmRoutine</structname> is defined thus:
80 typedef struct IndexAmRoutine
85 * Total number of strategies (operators) by which we can traverse/search
86 * this AM. Zero if AM does not have a fixed set of strategy assignments.
89 /* total number of support functions that this AM uses */
91 /* does AM support ORDER BY indexed column's value? */
93 /* does AM support ORDER BY result of an operator on indexed column? */
95 /* does AM support backward scanning? */
97 /* does AM support UNIQUE indexes? */
99 /* does AM support multi-column indexes? */
101 /* does AM require scans to have a constraint on the first index column? */
103 /* does AM handle ScalarArrayOpExpr quals? */
105 /* does AM handle IS NULL/IS NOT NULL quals? */
107 /* can index storage data type differ from column data type? */
109 /* can an index of this type be clustered on? */
111 /* does AM handle predicate locks? */
113 /* type of data stored in index, or InvalidOid if variable */
116 /* interface functions */
117 ambuild_function ambuild;
118 ambuildempty_function ambuildempty;
119 aminsert_function aminsert;
120 ambulkdelete_function ambulkdelete;
121 amvacuumcleanup_function amvacuumcleanup;
122 amcanreturn_function amcanreturn; /* can be NULL */
123 amcostestimate_function amcostestimate;
124 amoptions_function amoptions;
125 amproperty_function amproperty; /* can be NULL */
126 amvalidate_function amvalidate;
127 ambeginscan_function ambeginscan;
128 amrescan_function amrescan;
129 amgettuple_function amgettuple; /* can be NULL */
130 amgetbitmap_function amgetbitmap; /* can be NULL */
131 amendscan_function amendscan;
132 ammarkpos_function ammarkpos; /* can be NULL */
133 amrestrpos_function amrestrpos; /* can be NULL */
135 /* interface functions to support parallel index scans */
136 amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
137 aminitparallelscan_function aminitparallelscan; /* can be NULL */
138 amparallelrescan_function amparallelrescan; /* can be NULL */
144 To be useful, an index access method must also have one or more
145 <firstterm>operator families</> and
146 <firstterm>operator classes</> defined in
147 <link linkend="catalog-pg-opfamily"><structname>pg_opfamily</structname></link>,
148 <link linkend="catalog-pg-opclass"><structname>pg_opclass</structname></link>,
149 <link linkend="catalog-pg-amop"><structname>pg_amop</structname></link>, and
150 <link linkend="catalog-pg-amproc"><structname>pg_amproc</structname></link>.
151 These entries allow the planner
152 to determine what kinds of query qualifications can be used with
153 indexes of this access method. Operator families and classes are described
154 in <xref linkend="xindex">, which is prerequisite material for reading
159 An individual index is defined by a
160 <link linkend="catalog-pg-class"><structname>pg_class</structname></link>
161 entry that describes it as a physical relation, plus a
162 <link linkend="catalog-pg-index"><structname>pg_index</structname></link>
163 entry that shows the logical content of the index — that is, the set
164 of index columns it has and the semantics of those columns, as captured by
165 the associated operator classes. The index columns (key values) can be
166 either simple columns of the underlying table or expressions over the table
167 rows. The index access method normally has no interest in where the index
168 key values come from (it is always handed precomputed key values) but it
169 will be very interested in the operator class information in
170 <structname>pg_index</structname>. Both of these catalog entries can be
171 accessed as part of the <structname>Relation</> data structure that is
172 passed to all operations on the index.
176 Some of the flag fields of <structname>IndexAmRoutine</> have nonobvious
177 implications. The requirements of <structfield>amcanunique</structfield>
178 are discussed in <xref linkend="index-unique-checks">.
179 The <structfield>amcanmulticol</structfield> flag asserts that the
180 access method supports multicolumn indexes, while
181 <structfield>amoptionalkey</structfield> asserts that it allows scans
182 where no indexable restriction clause is given for the first index column.
183 When <structfield>amcanmulticol</structfield> is false,
184 <structfield>amoptionalkey</structfield> essentially says whether the
185 access method supports full-index scans without any restriction clause.
186 Access methods that support multiple index columns <emphasis>must</>
187 support scans that omit restrictions on any or all of the columns after
188 the first; however they are permitted to require some restriction to
189 appear for the first index column, and this is signaled by setting
190 <structfield>amoptionalkey</structfield> false.
191 One reason that an index AM might set
192 <structfield>amoptionalkey</structfield> false is if it doesn't index
193 null values. Since most indexable operators are
194 strict and hence cannot return true for null inputs,
195 it is at first sight attractive to not store index entries for null values:
196 they could never be returned by an index scan anyway. However, this
197 argument fails when an index scan has no restriction clause for a given
198 index column. In practice this means that
199 indexes that have <structfield>amoptionalkey</structfield> true must
200 index nulls, since the planner might decide to use such an index
201 with no scan keys at all. A related restriction is that an index
202 access method that supports multiple index columns <emphasis>must</>
203 support indexing null values in columns after the first, because the planner
204 will assume the index can be used for queries that do not restrict
205 these columns. For example, consider an index on (a,b) and a query with
206 <literal>WHERE a = 4</literal>. The system will assume the index can be
207 used to scan for rows with <literal>a = 4</literal>, which is wrong if the
208 index omits rows where <literal>b</> is null.
209 It is, however, OK to omit rows where the first indexed column is null.
210 An index access method that does index nulls may also set
211 <structfield>amsearchnulls</structfield>, indicating that it supports
212 <literal>IS NULL</> and <literal>IS NOT NULL</> clauses as search
218 <sect1 id="index-functions">
219 <title>Index Access Method Functions</title>
222 The index construction and maintenance functions that an index access
223 method must provide in <structname>IndexAmRoutine</structname> are:
229 ambuild (Relation heapRelation,
230 Relation indexRelation,
231 IndexInfo *indexInfo);
233 Build a new index. The index relation has been physically created,
234 but is empty. It must be filled in with whatever fixed data the
235 access method requires, plus entries for all tuples already existing
236 in the table. Ordinarily the <function>ambuild</> function will call
237 <function>IndexBuildHeapScan()</> to scan the table for existing tuples
238 and compute the keys that need to be inserted into the index.
239 The function must return a palloc'd struct containing statistics about
246 ambuildempty (Relation indexRelation);
248 Build an empty index, and write it to the initialization fork (<symbol>INIT_FORKNUM</symbol>)
249 of the given relation. This method is called only for unlogged indexes; the
250 empty index written to the initialization fork will be copied over the main
251 relation fork on each server restart.
257 aminsert (Relation indexRelation,
260 ItemPointer heap_tid,
261 Relation heapRelation,
262 IndexUniqueCheck checkUnique);
264 Insert a new tuple into an existing index. The <literal>values</> and
265 <literal>isnull</> arrays give the key values to be indexed, and
266 <literal>heap_tid</> is the TID to be indexed.
267 If the access method supports unique indexes (its
268 <structfield>amcanunique</> flag is true) then
269 <literal>checkUnique</> indicates the type of uniqueness check to
270 perform. This varies depending on whether the unique constraint is
271 deferrable; see <xref linkend="index-unique-checks"> for details.
272 Normally the access method only needs the <literal>heapRelation</>
273 parameter when performing uniqueness checking (since then it will have to
274 look into the heap to verify tuple liveness).
278 The function's Boolean result value is significant only when
279 <literal>checkUnique</> is <literal>UNIQUE_CHECK_PARTIAL</>.
280 In this case a TRUE result means the new entry is known unique, whereas
281 FALSE means it might be non-unique (and a deferred uniqueness check must
282 be scheduled). For other cases a constant FALSE result is recommended.
286 Some indexes might not index all tuples. If the tuple is not to be
287 indexed, <function>aminsert</> should just return without doing anything.
292 IndexBulkDeleteResult *
293 ambulkdelete (IndexVacuumInfo *info,
294 IndexBulkDeleteResult *stats,
295 IndexBulkDeleteCallback callback,
296 void *callback_state);
298 Delete tuple(s) from the index. This is a <quote>bulk delete</> operation
299 that is intended to be implemented by scanning the whole index and checking
300 each entry to see if it should be deleted.
301 The passed-in <literal>callback</> function must be called, in the style
302 <literal>callback(<replaceable>TID</>, callback_state) returns bool</literal>,
303 to determine whether any particular index entry, as identified by its
304 referenced TID, is to be deleted. Must return either NULL or a palloc'd
305 struct containing statistics about the effects of the deletion operation.
306 It is OK to return NULL if no information needs to be passed on to
307 <function>amvacuumcleanup</>.
311 Because of limited <varname>maintenance_work_mem</>,
312 <function>ambulkdelete</> might need to be called more than once when many
313 tuples are to be deleted. The <literal>stats</> argument is the result
314 of the previous call for this index (it is NULL for the first call within a
315 <command>VACUUM</> operation). This allows the AM to accumulate statistics
316 across the whole operation. Typically, <function>ambulkdelete</> will
317 modify and return the same struct if the passed <literal>stats</> is not
323 IndexBulkDeleteResult *
324 amvacuumcleanup (IndexVacuumInfo *info,
325 IndexBulkDeleteResult *stats);
327 Clean up after a <command>VACUUM</command> operation (zero or more
328 <function>ambulkdelete</> calls). This does not have to do anything
329 beyond returning index statistics, but it might perform bulk cleanup
330 such as reclaiming empty index pages. <literal>stats</> is whatever the
331 last <function>ambulkdelete</> call returned, or NULL if
332 <function>ambulkdelete</> was not called because no tuples needed to be
333 deleted. If the result is not NULL it must be a palloc'd struct.
334 The statistics it contains will be used to update <structname>pg_class</>,
335 and will be reported by <command>VACUUM</> if <literal>VERBOSE</> is given.
336 It is OK to return NULL if the index was not changed at all during the
337 <command>VACUUM</command> operation, but otherwise correct stats should
342 As of <productname>PostgreSQL</productname> 8.4,
343 <function>amvacuumcleanup</> will also be called at completion of an
344 <command>ANALYZE</> operation. In this case <literal>stats</> is always
345 NULL and any return value will be ignored. This case can be distinguished
346 by checking <literal>info->analyze_only</literal>. It is recommended
347 that the access method do nothing except post-insert cleanup in such a
348 call, and that only in an autovacuum worker process.
354 amcanreturn (Relation indexRelation, int attno);
356 Check whether the index can support <link
357 linkend="indexes-index-only-scans"><firstterm>index-only scans</></link> on
358 the given column, by returning the indexed column values for an index entry
359 in the form of an <structname>IndexTuple</structname>. The attribute number
360 is 1-based, i.e. the first column's attno is 1. Returns TRUE if supported,
361 else FALSE. If the access method does not support index-only scans at all,
362 the <structfield>amcanreturn</> field in its <structname>IndexAmRoutine</>
363 struct can be set to NULL.
369 amcostestimate (PlannerInfo *root,
372 Cost *indexStartupCost,
373 Cost *indexTotalCost,
374 Selectivity *indexSelectivity,
375 double *indexCorrelation);
377 Estimate the costs of an index scan. This function is described fully
378 in <xref linkend="index-cost-estimation">, below.
384 amoptions (ArrayType *reloptions,
387 Parse and validate the reloptions array for an index. This is called only
388 when a non-null reloptions array exists for the index.
389 <parameter>reloptions</> is a <type>text</> array containing entries of the
390 form <replaceable>name</><literal>=</><replaceable>value</>.
391 The function should construct a <type>bytea</> value, which will be copied
392 into the <structfield>rd_options</> field of the index's relcache entry.
393 The data contents of the <type>bytea</> value are open for the access
394 method to define; most of the standard access methods use struct
395 <structname>StdRdOptions</>.
396 When <parameter>validate</> is true, the function should report a suitable
397 error message if any of the options are unrecognized or have invalid
398 values; when <parameter>validate</> is false, invalid entries should be
399 silently ignored. (<parameter>validate</> is false when loading options
400 already stored in <structname>pg_catalog</>; an invalid entry could only
401 be found if the access method has changed its rules for options, and in
402 that case ignoring obsolete entries is appropriate.)
403 It is OK to return NULL if default behavior is wanted.
409 amproperty (Oid index_oid, int attno,
410 IndexAMProperty prop, const char *propname,
411 bool *res, bool *isnull);
413 The <function>amproperty</> method allows index access methods to override
414 the default behavior of <function>pg_index_column_has_property</function>
415 and related functions.
416 If the access method does not have any special behavior for index property
417 inquiries, the <structfield>amproperty</> field in
418 its <structname>IndexAmRoutine</> struct can be set to NULL.
419 Otherwise, the <function>amproperty</> method will be called with
420 <parameter>index_oid</> and <parameter>attno</> both zero for
421 <function>pg_indexam_has_property</function> calls,
422 or with <parameter>index_oid</> valid and <parameter>attno</> zero for
423 <function>pg_index_has_property</function> calls,
424 or with <parameter>index_oid</> valid and <parameter>attno</> greater than
425 zero for <function>pg_index_column_has_property</function> calls.
426 <parameter>prop</> is an enum value identifying the property being tested,
427 while <parameter>propname</> is the original property name string.
428 If the core code does not recognize the property name
429 then <parameter>prop</> is <literal>AMPROP_UNKNOWN</>.
430 Access methods can define custom property names by
431 checking <parameter>propname</> for a match (use <function>pg_strcasecmp</>
432 to match, for consistency with the core code); for names known to the core
433 code, it's better to inspect <parameter>prop</>.
434 If the <structfield>amproperty</> method returns <literal>true</> then
435 it has determined the property test result: it must set <literal>*res</>
436 to the boolean value to return, or set <literal>*isnull</>
437 to <literal>true</> to return a NULL. (Both of the referenced variables
438 are initialized to <literal>false</> before the call.)
439 If the <structfield>amproperty</> method returns <literal>false</> then
440 the core code will proceed with its normal logic for determining the
441 property test result.
445 Access methods that support ordering operators should
446 implement <literal>AMPROP_DISTANCE_ORDERABLE</> property testing, as the
447 core code does not know how to do that and will return NULL. It may
448 also be advantageous to implement <literal>AMPROP_RETURNABLE</> testing,
449 if that can be done more cheaply than by opening the index and calling
450 <structfield>amcanreturn</>, which is the core code's default behavior.
451 The default behavior should be satisfactory for all other standard
458 amvalidate (Oid opclassoid);
460 Validate the catalog entries for the specified operator class, so far as
461 the access method can reasonably do that. For example, this might include
462 testing that all required support functions are provided.
463 The <function>amvalidate</> function must return false if the opclass is
464 invalid. Problems should be reported with <function>ereport</> messages.
469 The purpose of an index, of course, is to support scans for tuples matching
470 an indexable <literal>WHERE</> condition, often called a
471 <firstterm>qualifier</> or <firstterm>scan key</>. The semantics of
472 index scanning are described more fully in <xref linkend="index-scanning">,
473 below. An index access method can support <quote>plain</> index scans,
474 <quote>bitmap</> index scans, or both. The scan-related functions that an
475 index access method must or may provide are:
481 ambeginscan (Relation indexRelation,
485 Prepare for an index scan. The <literal>nkeys</> and <literal>norderbys</>
486 parameters indicate the number of quals and ordering operators that will be
487 used in the scan; these may be useful for space allocation purposes.
488 Note that the actual values of the scan keys aren't provided yet.
489 The result must be a palloc'd struct.
490 For implementation reasons the index access method
491 <emphasis>must</> create this struct by calling
492 <function>RelationGetIndexScan()</>. In most cases
493 <function>ambeginscan</> does little beyond making that call and perhaps
495 the interesting parts of index-scan startup are in <function>amrescan</>.
501 amrescan (IndexScanDesc scan,
507 Start or restart an index scan, possibly with new scan keys. (To restart
508 using previously-passed keys, NULL is passed for <literal>keys</> and/or
509 <literal>orderbys</>.) Note that it is not allowed for
510 the number of keys or order-by operators to be larger than
511 what was passed to <function>ambeginscan</>. In practice the restart
512 feature is used when a new outer tuple is selected by a nested-loop join
513 and so a new key comparison value is needed, but the scan key structure
520 amgettuple (IndexScanDesc scan,
521 ScanDirection direction);
523 Fetch the next tuple in the given scan, moving in the given
524 direction (forward or backward in the index). Returns TRUE if a tuple was
525 obtained, FALSE if no matching tuples remain. In the TRUE case the tuple
526 TID is stored into the <literal>scan</> structure. Note that
527 <quote>success</> means only that the index contains an entry that matches
528 the scan keys, not that the tuple necessarily still exists in the heap or
529 will pass the caller's snapshot test. On success, <function>amgettuple</>
530 must also set <literal>scan->xs_recheck</> to TRUE or FALSE.
531 FALSE means it is certain that the index entry matches the scan keys.
532 TRUE means this is not certain, and the conditions represented by the
533 scan keys must be rechecked against the heap tuple after fetching it.
534 This provision supports <quote>lossy</> index operators.
535 Note that rechecking will extend only to the scan conditions; a partial
536 index predicate (if any) is never rechecked by <function>amgettuple</>
541 If the index supports <link linkend="indexes-index-only-scans">index-only
542 scans</link> (i.e., <function>amcanreturn</function> returns TRUE for it),
543 then on success the AM must also check
544 <literal>scan->xs_want_itup</>, and if that is true it must return
545 the original indexed data for the index entry, in the form of an
546 <structname>IndexTuple</> pointer stored at <literal>scan->xs_itup</>,
547 with tuple descriptor <literal>scan->xs_itupdesc</>.
548 (Management of the data referenced by the pointer is the access method's
549 responsibility. The data must remain good at least until the next
550 <function>amgettuple</>, <function>amrescan</>, or <function>amendscan</>
555 The <function>amgettuple</> function need only be provided if the access
556 method supports <quote>plain</> index scans. If it doesn't, the
557 <structfield>amgettuple</> field in its <structname>IndexAmRoutine</>
558 struct must be set to NULL.
564 amgetbitmap (IndexScanDesc scan,
567 Fetch all tuples in the given scan and add them to the caller-supplied
568 <type>TIDBitmap</type> (that is, OR the set of tuple IDs into whatever set is already
569 in the bitmap). The number of tuples fetched is returned (this might be
570 just an approximate count, for instance some AMs do not detect duplicates).
571 While inserting tuple IDs into the bitmap, <function>amgetbitmap</> can
572 indicate that rechecking of the scan conditions is required for specific
573 tuple IDs. This is analogous to the <literal>xs_recheck</> output parameter
574 of <function>amgettuple</>. Note: in the current implementation, support
575 for this feature is conflated with support for lossy storage of the bitmap
576 itself, and therefore callers recheck both the scan conditions and the
577 partial index predicate (if any) for recheckable tuples. That might not
578 always be true, however.
579 <function>amgetbitmap</> and
580 <function>amgettuple</> cannot be used in the same index scan; there
581 are other restrictions too when using <function>amgetbitmap</>, as explained
582 in <xref linkend="index-scanning">.
586 The <function>amgetbitmap</> function need only be provided if the access
587 method supports <quote>bitmap</> index scans. If it doesn't, the
588 <structfield>amgetbitmap</> field in its <structname>IndexAmRoutine</>
589 struct must be set to NULL.
595 amendscan (IndexScanDesc scan);
597 End a scan and release resources. The <literal>scan</> struct itself
598 should not be freed, but any locks or pins taken internally by the
599 access method must be released.
605 ammarkpos (IndexScanDesc scan);
607 Mark current scan position. The access method need only support one
608 remembered scan position per scan.
612 The <function>ammarkpos</> function need only be provided if the access
613 method supports ordered scans. If it doesn't,
614 the <structfield>ammarkpos</> field in its <structname>IndexAmRoutine</>
615 struct may be set to NULL.
621 amrestrpos (IndexScanDesc scan);
623 Restore the scan to the most recently marked position.
627 The <function>amrestrpos</> function need only be provided if the access
628 method supports ordered scans. If it doesn't,
629 the <structfield>amrestrpos</> field in its <structname>IndexAmRoutine</>
630 struct may be set to NULL.
634 In addition to supporting ordinary index scans, some types of index
635 may wish to support <firstterm>parallel index scans</>, which allow
636 multiple backends to cooperate in performing an index scan. The
637 index access method should arrange things so that each cooperating
638 process returns a subset of the tuples that would be performed by
639 an ordinary, non-parallel index scan, but in such a way that the
640 union of those subsets is equal to the set of tuples that would be
641 returned by an ordinary, non-parallel index scan. Furthermore, while
642 there need not be any global ordering of tuples returned by a parallel
643 scan, the ordering of that subset of tuples returned within each
644 cooperating backend must match the requested ordering. The following
645 functions may be implemented to support parallel index scans:
651 amestimateparallelscan (void);
653 Estimate and return the number of bytes of dynamic shared memory which
654 the access method will be needed to perform a parallel scan. (This number
655 is in addition to, not in lieu of, the amount of space needed for
656 AM-independent data in <structname>ParallelIndexScanDescData</>.)
660 It is not necessary to implement this function for access methods which
661 do not support parallel scans or for which the number of additional bytes
662 of storage required is zero.
668 aminitparallelscan (void *target);
670 This function will be called to initialize dynamic shared memory at the
671 beginning of a parallel scan. <parameter>target</> will point to at least
672 the number of bytes previously returned by
673 <function>amestimateparallelscan</>, and this function may use that
674 amount of space to store whatever data it wishes.
678 It is not necessary to implement this function for access methods which
679 do not support parallel scans or in cases where the shared memory space
680 required needs no initialization.
686 amparallelrescan (IndexScanDesc scan);
688 This function, if implemented, will be called when a parallel index scan
689 must be restarted. It should reset any shared state set up by
690 <function>aminitparallelscan</> such that the scan will be restarted from
696 <sect1 id="index-scanning">
697 <title>Index Scanning</title>
700 In an index scan, the index access method is responsible for regurgitating
701 the TIDs of all the tuples it has been told about that match the
702 <firstterm>scan keys</>. The access method is <emphasis>not</> involved in
703 actually fetching those tuples from the index's parent table, nor in
704 determining whether they pass the scan's time qualification test or other
709 A scan key is the internal representation of a <literal>WHERE</> clause of
710 the form <replaceable>index_key</> <replaceable>operator</>
711 <replaceable>constant</>, where the index key is one of the columns of the
712 index and the operator is one of the members of the operator family
713 associated with that index column. An index scan has zero or more scan
714 keys, which are implicitly ANDed — the returned tuples are expected
715 to satisfy all the indicated conditions.
719 The access method can report that the index is <firstterm>lossy</>, or
720 requires rechecks, for a particular query. This implies that the index
721 scan will return all the entries that pass the scan key, plus possibly
722 additional entries that do not. The core system's index-scan machinery
723 will then apply the index conditions again to the heap tuple to verify
724 whether or not it really should be selected. If the recheck option is not
725 specified, the index scan must return exactly the set of matching entries.
729 Note that it is entirely up to the access method to ensure that it
730 correctly finds all and only the entries passing all the given scan keys.
731 Also, the core system will simply hand off all the <literal>WHERE</>
732 clauses that match the index keys and operator families, without any
733 semantic analysis to determine whether they are redundant or
734 contradictory. As an example, given
735 <literal>WHERE x > 4 AND x > 14</> where <literal>x</> is a b-tree
736 indexed column, it is left to the b-tree <function>amrescan</> function
737 to realize that the first scan key is redundant and can be discarded.
738 The extent of preprocessing needed during <function>amrescan</> will
739 depend on the extent to which the index access method needs to reduce
740 the scan keys to a <quote>normalized</> form.
744 Some access methods return index entries in a well-defined order, others
745 do not. There are actually two different ways that an access method can
746 support sorted output:
751 Access methods that always return entries in the natural ordering
752 of their data (such as btree) should set
753 <structfield>amcanorder</> to true.
754 Currently, such access methods must use btree-compatible strategy
755 numbers for their equality and ordering operators.
760 Access methods that support ordering operators should set
761 <structfield>amcanorderbyop</> to true.
762 This indicates that the index is capable of returning entries in
763 an order satisfying <literal>ORDER BY</> <replaceable>index_key</>
764 <replaceable>operator</> <replaceable>constant</>. Scan modifiers
765 of that form can be passed to <function>amrescan</> as described
773 The <function>amgettuple</> function has a <literal>direction</> argument,
774 which can be either <literal>ForwardScanDirection</> (the normal case)
775 or <literal>BackwardScanDirection</>. If the first call after
776 <function>amrescan</> specifies <literal>BackwardScanDirection</>, then the
777 set of matching index entries is to be scanned back-to-front rather than in
778 the normal front-to-back direction, so <function>amgettuple</> must return
779 the last matching tuple in the index, rather than the first one as it
780 normally would. (This will only occur for access
781 methods that set <structfield>amcanorder</> to true.) After the
782 first call, <function>amgettuple</> must be prepared to advance the scan in
783 either direction from the most recently returned entry. (But if
784 <structfield>amcanbackward</> is false, all subsequent
785 calls will have the same direction as the first one.)
789 Access methods that support ordered scans must support <quote>marking</> a
790 position in a scan and later returning to the marked position. The same
791 position might be restored multiple times. However, only one position need
792 be remembered per scan; a new <function>ammarkpos</> call overrides the
793 previously marked position. An access method that does not support ordered
794 scans need not provide <function>ammarkpos</> and <function>amrestrpos</>
795 functions in <structname>IndexAmRoutine</>; set those pointers to NULL
800 Both the scan position and the mark position (if any) must be maintained
801 consistently in the face of concurrent insertions or deletions in the
802 index. It is OK if a freshly-inserted entry is not returned by a scan that
803 would have found the entry if it had existed when the scan started, or for
804 the scan to return such an entry upon rescanning or backing
805 up even though it had not been returned the first time through. Similarly,
806 a concurrent delete might or might not be reflected in the results of a scan.
807 What is important is that insertions or deletions not cause the scan to
808 miss or multiply return entries that were not themselves being inserted or
813 If the index stores the original indexed data values (and not some lossy
814 representation of them), it is useful to
815 support <link linkend="indexes-index-only-scans">index-only scans</link>, in
816 which the index returns the actual data not just the TID of the heap tuple.
817 This will only avoid I/O if the visibility map shows that the TID is on an
818 all-visible page; else the heap tuple must be visited anyway to check
819 MVCC visibility. But that is no concern of the access method's.
823 Instead of using <function>amgettuple</>, an index scan can be done with
824 <function>amgetbitmap</> to fetch all tuples in one call. This can be
825 noticeably more efficient than <function>amgettuple</> because it allows
826 avoiding lock/unlock cycles within the access method. In principle
827 <function>amgetbitmap</> should have the same effects as repeated
828 <function>amgettuple</> calls, but we impose several restrictions to
829 simplify matters. First of all, <function>amgetbitmap</> returns all
830 tuples at once and marking or restoring scan positions isn't
831 supported. Secondly, the tuples are returned in a bitmap which doesn't
832 have any specific ordering, which is why <function>amgetbitmap</> doesn't
833 take a <literal>direction</> argument. (Ordering operators will never be
834 supplied for such a scan, either.)
835 Also, there is no provision for index-only scans with
836 <function>amgetbitmap</>, since there is no way to return the contents of
838 Finally, <function>amgetbitmap</>
839 does not guarantee any locking of the returned tuples, with implications
840 spelled out in <xref linkend="index-locking">.
844 Note that it is permitted for an access method to implement only
845 <function>amgetbitmap</> and not <function>amgettuple</>, or vice versa,
846 if its internal implementation is unsuited to one API or the other.
851 <sect1 id="index-locking">
852 <title>Index Locking Considerations</title>
855 Index access methods must handle concurrent updates
856 of the index by multiple processes.
857 The core <productname>PostgreSQL</productname> system obtains
858 <literal>AccessShareLock</> on the index during an index scan, and
859 <literal>RowExclusiveLock</> when updating the index (including plain
860 <command>VACUUM</>). Since these lock types do not conflict, the access
861 method is responsible for handling any fine-grained locking it might need.
862 An exclusive lock on the index as a whole will be taken only during index
863 creation, destruction, or <command>REINDEX</>.
867 Building an index type that supports concurrent updates usually requires
868 extensive and subtle analysis of the required behavior. For the b-tree
869 and hash index types, you can read about the design decisions involved in
870 <filename>src/backend/access/nbtree/README</> and
871 <filename>src/backend/access/hash/README</>.
875 Aside from the index's own internal consistency requirements, concurrent
876 updates create issues about consistency between the parent table (the
877 <firstterm>heap</>) and the index. Because
878 <productname>PostgreSQL</productname> separates accesses
879 and updates of the heap from those of the index, there are windows in
880 which the index might be inconsistent with the heap. We handle this problem
881 with the following rules:
886 A new heap entry is made before making its index entries. (Therefore
887 a concurrent index scan is likely to fail to see the heap entry.
888 This is okay because the index reader would be uninterested in an
889 uncommitted row anyway. But see <xref linkend="index-unique-checks">.)
894 When a heap entry is to be deleted (by <command>VACUUM</>), all its
895 index entries must be removed first.
900 An index scan must maintain a pin
901 on the index page holding the item last returned by
902 <function>amgettuple</>, and <function>ambulkdelete</> cannot delete
903 entries from pages that are pinned by other backends. The need
904 for this rule is explained below.
909 Without the third rule, it is possible for an index reader to
910 see an index entry just before it is removed by <command>VACUUM</>, and
911 then to arrive at the corresponding heap entry after that was removed by
913 This creates no serious problems if that item
914 number is still unused when the reader reaches it, since an empty
915 item slot will be ignored by <function>heap_fetch()</>. But what if a
916 third backend has already re-used the item slot for something else?
917 When using an MVCC-compliant snapshot, there is no problem because
918 the new occupant of the slot is certain to be too new to pass the
919 snapshot test. However, with a non-MVCC-compliant snapshot (such as
920 <literal>SnapshotAny</>), it would be possible to accept and return
921 a row that does not in fact match the scan keys. We could defend
922 against this scenario by requiring the scan keys to be rechecked
923 against the heap row in all cases, but that is too expensive. Instead,
924 we use a pin on an index page as a proxy to indicate that the reader
925 might still be <quote>in flight</> from the index entry to the matching
926 heap entry. Making <function>ambulkdelete</> block on such a pin ensures
927 that <command>VACUUM</> cannot delete the heap entry before the reader
928 is done with it. This solution costs little in run time, and adds blocking
929 overhead only in the rare cases where there actually is a conflict.
933 This solution requires that index scans be <quote>synchronous</>: we have
934 to fetch each heap tuple immediately after scanning the corresponding index
935 entry. This is expensive for a number of reasons. An
936 <quote>asynchronous</> scan in which we collect many TIDs from the index,
937 and only visit the heap tuples sometime later, requires much less index
938 locking overhead and can allow a more efficient heap access pattern.
939 Per the above analysis, we must use the synchronous approach for
940 non-MVCC-compliant snapshots, but an asynchronous scan is workable
941 for a query using an MVCC snapshot.
945 In an <function>amgetbitmap</> index scan, the access method does not
946 keep an index pin on any of the returned tuples. Therefore
947 it is only safe to use such scans with MVCC-compliant snapshots.
951 When the <structfield>ampredlocks</> flag is not set, any scan using that
952 index access method within a serializable transaction will acquire a
953 nonblocking predicate lock on the full index. This will generate a
954 read-write conflict with the insert of any tuple into that index by a
955 concurrent serializable transaction. If certain patterns of read-write
956 conflicts are detected among a set of concurrent serializable
957 transactions, one of those transactions may be canceled to protect data
958 integrity. When the flag is set, it indicates that the index access
959 method implements finer-grained predicate locking, which will tend to
960 reduce the frequency of such transaction cancellations.
965 <sect1 id="index-unique-checks">
966 <title>Index Uniqueness Checks</title>
969 <productname>PostgreSQL</productname> enforces SQL uniqueness constraints
970 using <firstterm>unique indexes</>, which are indexes that disallow
971 multiple entries with identical keys. An access method that supports this
972 feature sets <structfield>amcanunique</> true.
973 (At present, only b-tree supports it.)
977 Because of MVCC, it is always necessary to allow duplicate entries to
978 exist physically in an index: the entries might refer to successive
979 versions of a single logical row. The behavior we actually want to
980 enforce is that no MVCC snapshot could include two rows with equal
981 index keys. This breaks down into the following cases that must be
982 checked when inserting a new row into a unique index:
987 If a conflicting valid row has been deleted by the current transaction,
988 it's okay. (In particular, since an UPDATE always deletes the old row
989 version before inserting the new version, this will allow an UPDATE on
990 a row without changing the key.)
995 If a conflicting row has been inserted by an as-yet-uncommitted
996 transaction, the would-be inserter must wait to see if that transaction
997 commits. If it rolls back then there is no conflict. If it commits
998 without deleting the conflicting row again, there is a uniqueness
999 violation. (In practice we just wait for the other transaction to
1000 end and then redo the visibility check in toto.)
1005 Similarly, if a conflicting valid row has been deleted by an
1006 as-yet-uncommitted transaction, the would-be inserter must wait
1007 for that transaction to commit or abort, and then repeat the test.
1014 Furthermore, immediately before reporting a uniqueness violation
1015 according to the above rules, the access method must recheck the
1016 liveness of the row being inserted. If it is committed dead then
1017 no violation should be reported. (This case cannot occur during the
1018 ordinary scenario of inserting a row that's just been created by
1019 the current transaction. It can happen during
1020 <command>CREATE UNIQUE INDEX CONCURRENTLY</>, however.)
1024 We require the index access method to apply these tests itself, which
1025 means that it must reach into the heap to check the commit status of
1026 any row that is shown to have a duplicate key according to the index
1027 contents. This is without a doubt ugly and non-modular, but it saves
1028 redundant work: if we did a separate probe then the index lookup for
1029 a conflicting row would be essentially repeated while finding the place to
1030 insert the new row's index entry. What's more, there is no obvious way
1031 to avoid race conditions unless the conflict check is an integral part
1032 of insertion of the new index entry.
1036 If the unique constraint is deferrable, there is additional complexity:
1037 we need to be able to insert an index entry for a new row, but defer any
1038 uniqueness-violation error until end of statement or even later. To
1039 avoid unnecessary repeat searches of the index, the index access method
1040 should do a preliminary uniqueness check during the initial insertion.
1041 If this shows that there is definitely no conflicting live tuple, we
1042 are done. Otherwise, we schedule a recheck to occur when it is time to
1043 enforce the constraint. If, at the time of the recheck, both the inserted
1044 tuple and some other tuple with the same key are live, then the error
1045 must be reported. (Note that for this purpose, <quote>live</> actually
1046 means <quote>any tuple in the index entry's HOT chain is live</>.)
1047 To implement this, the <function>aminsert</> function is passed a
1048 <literal>checkUnique</> parameter having one of the following values:
1053 <literal>UNIQUE_CHECK_NO</> indicates that no uniqueness checking
1054 should be done (this is not a unique index).
1059 <literal>UNIQUE_CHECK_YES</> indicates that this is a non-deferrable
1060 unique index, and the uniqueness check must be done immediately, as
1066 <literal>UNIQUE_CHECK_PARTIAL</> indicates that the unique
1067 constraint is deferrable. <productname>PostgreSQL</productname>
1068 will use this mode to insert each row's index entry. The access
1069 method must allow duplicate entries into the index, and report any
1070 potential duplicates by returning FALSE from <function>aminsert</>.
1071 For each row for which FALSE is returned, a deferred recheck will
1076 The access method must identify any rows which might violate the
1077 unique constraint, but it is not an error for it to report false
1078 positives. This allows the check to be done without waiting for other
1079 transactions to finish; conflicts reported here are not treated as
1080 errors and will be rechecked later, by which time they may no longer
1086 <literal>UNIQUE_CHECK_EXISTING</> indicates that this is a deferred
1087 recheck of a row that was reported as a potential uniqueness violation.
1088 Although this is implemented by calling <function>aminsert</>, the
1089 access method must <emphasis>not</> insert a new index entry in this
1090 case. The index entry is already present. Rather, the access method
1091 must check to see if there is another live index entry. If so, and
1092 if the target row is also still live, report error.
1096 It is recommended that in a <literal>UNIQUE_CHECK_EXISTING</> call,
1097 the access method further verify that the target row actually does
1098 have an existing entry in the index, and report error if not. This
1099 is a good idea because the index tuple values passed to
1100 <function>aminsert</> will have been recomputed. If the index
1101 definition involves functions that are not really immutable, we
1102 might be checking the wrong area of the index. Checking that the
1103 target row is found in the recheck verifies that we are scanning
1104 for the same tuple values as were used in the original insertion.
1112 <sect1 id="index-cost-estimation">
1113 <title>Index Cost Estimation Functions</title>
1116 The <function>amcostestimate</> function is given information describing
1117 a possible index scan, including lists of WHERE and ORDER BY clauses that
1118 have been determined to be usable with the index. It must return estimates
1119 of the cost of accessing the index and the selectivity of the WHERE
1120 clauses (that is, the fraction of parent-table rows that will be
1121 retrieved during the index scan). For simple cases, nearly all the
1122 work of the cost estimator can be done by calling standard routines
1123 in the optimizer; the point of having an <function>amcostestimate</> function is
1124 to allow index access methods to provide index-type-specific knowledge,
1125 in case it is possible to improve on the standard estimates.
1129 Each <function>amcostestimate</> function must have the signature:
1133 amcostestimate (PlannerInfo *root,
1136 Cost *indexStartupCost,
1137 Cost *indexTotalCost,
1138 Selectivity *indexSelectivity,
1139 double *indexCorrelation);
1142 The first three parameters are inputs:
1146 <term><parameter>root</></term>
1149 The planner's information about the query being processed.
1155 <term><parameter>path</></term>
1158 The index access path being considered. All fields except cost and
1159 selectivity values are valid.
1165 <term><parameter>loop_count</></term>
1168 The number of repetitions of the index scan that should be factored
1169 into the cost estimates. This will typically be greater than one when
1170 considering a parameterized scan for use in the inside of a nestloop
1171 join. Note that the cost estimates should still be for just one scan;
1172 a larger <parameter>loop_count</> means that it may be appropriate
1173 to allow for some caching effects across multiple scans.
1181 The last four parameters are pass-by-reference outputs:
1185 <term><parameter>*indexStartupCost</></term>
1188 Set to cost of index start-up processing
1194 <term><parameter>*indexTotalCost</></term>
1197 Set to total cost of index processing
1203 <term><parameter>*indexSelectivity</></term>
1206 Set to index selectivity
1212 <term><parameter>*indexCorrelation</></term>
1215 Set to correlation coefficient between index scan order and
1216 underlying table's order
1224 Note that cost estimate functions must be written in C, not in SQL or
1225 any available procedural language, because they must access internal
1226 data structures of the planner/optimizer.
1230 The index access costs should be computed using the parameters used by
1231 <filename>src/backend/optimizer/path/costsize.c</filename>: a sequential
1232 disk block fetch has cost <varname>seq_page_cost</>, a nonsequential fetch
1233 has cost <varname>random_page_cost</>, and the cost of processing one index
1234 row should usually be taken as <varname>cpu_index_tuple_cost</>. In
1235 addition, an appropriate multiple of <varname>cpu_operator_cost</> should
1236 be charged for any comparison operators invoked during index processing
1237 (especially evaluation of the indexquals themselves).
1241 The access costs should include all disk and CPU costs associated with
1242 scanning the index itself, but <emphasis>not</> the costs of retrieving or
1243 processing the parent-table rows that are identified by the index.
1247 The <quote>start-up cost</quote> is the part of the total scan cost that
1248 must be expended before we can begin to fetch the first row. For most
1249 indexes this can be taken as zero, but an index type with a high start-up
1250 cost might want to set it nonzero.
1254 The <parameter>indexSelectivity</> should be set to the estimated fraction of the parent
1255 table rows that will be retrieved during the index scan. In the case
1256 of a lossy query, this will typically be higher than the fraction of
1257 rows that actually pass the given qual conditions.
1261 The <parameter>indexCorrelation</> should be set to the correlation (ranging between
1262 -1.0 and 1.0) between the index order and the table order. This is used
1263 to adjust the estimate for the cost of fetching rows from the parent
1268 When <parameter>loop_count</> is greater than one, the returned numbers
1269 should be averages expected for any one scan of the index.
1273 <title>Cost Estimation</title>
1275 A typical cost estimator will proceed as follows:
1280 Estimate and return the fraction of parent-table rows that will be visited
1281 based on the given qual conditions. In the absence of any index-type-specific
1282 knowledge, use the standard optimizer function <function>clauselist_selectivity()</function>:
1285 *indexSelectivity = clauselist_selectivity(root, path->indexquals,
1286 path->indexinfo->rel->relid,
1294 Estimate the number of index rows that will be visited during the
1295 scan. For many index types this is the same as <parameter>indexSelectivity</> times
1296 the number of rows in the index, but it might be more. (Note that the
1297 index's size in pages and rows is available from the
1298 <literal>path->indexinfo</> struct.)
1304 Estimate the number of index pages that will be retrieved during the scan.
1305 This might be just <parameter>indexSelectivity</> times the index's size in pages.
1311 Compute the index access cost. A generic estimator might do this:
1315 * Our generic assumption is that the index pages will be read
1316 * sequentially, so they cost seq_page_cost each, not random_page_cost.
1317 * Also, we charge for evaluation of the indexquals at each index row.
1318 * All the costs are assumed to be paid incrementally during the scan.
1320 cost_qual_eval(&index_qual_cost, path->indexquals, root);
1321 *indexStartupCost = index_qual_cost.startup;
1322 *indexTotalCost = seq_page_cost * numIndexPages +
1323 (cpu_index_tuple_cost + index_qual_cost.per_tuple) * numIndexTuples;
1326 However, the above does not account for amortization of index reads
1327 across repeated index scans.
1333 Estimate the index correlation. For a simple ordered index on a single
1334 field, this can be retrieved from pg_statistic. If the correlation
1335 is not known, the conservative estimate is zero (no correlation).
1341 Examples of cost estimator functions can be found in
1342 <filename>src/backend/utils/adt/selfuncs.c</filename>.