reuse, and a clean interface.
</para>
-</sect1>
-
-<sect1 id="gist-implementation">
- <title>Implementation</title>
-
<para>
There are seven methods that an index operator class for
<acronym>GiST</acronym> must provide, and an eighth that is optional.
</variablelist>
+ <para>
+ All the GiST support methods are normally called in short-lived memory
+ contexts; that is, <varname>CurrentMemoryContext</> will get reset after
+ each tuple is processed. It is therefore not very important to worry about
+ pfree'ing everything you palloc. However, in some cases it's useful for a
+ support method to cache data across repeated calls. To do that, allocate
+ the longer-lived data in <literal>fcinfo->flinfo->fn_mcxt</>, and
+ keep a pointer to it in <literal>fcinfo->flinfo->fn_extra</>. Such
+ data will survive for the life of the index operation (e.g., a single GiST
+ index scan, index build, or index tuple insertion). Be careful to pfree
+ the previous value when replacing a <literal>fn_extra</> value, or the leak
+ will accumulate for the duration of the operation.
+ </para>
+
+</sect1>
+
+<sect1 id="gist-implementation">
+ <title>Implementation</title>
+
<sect2 id="gist-buffering-build">
<title>GiST buffering build</title>
<para>
Building large GiST indexes by simply inserting all the tuples tends to be
slow, because if the index tuples are scattered across the index and the
index is large enough to not fit in cache, the insertions need to perform
- a lot of random I/O. PostgreSQL from version 9.2 supports a more efficient
- method to build GiST indexes based on buffering, which can dramatically
- reduce number of random I/O needed for non-ordered data sets. For
- well-ordered datasets the benefit is smaller or non-existent, because
- only a small number of pages receive new tuples at a time, and those pages
- fit in cache even if the index as whole does not.
+ a lot of random I/O. Beginning in version 9.2, PostgreSQL supports a more
+ efficient method to build GiST indexes based on buffering, which can
+ dramatically reduce the number of random I/Os needed for non-ordered data
+ sets. For well-ordered datasets the benefit is smaller or non-existent,
+ because only a small number of pages receive new tuples at a time, and
+ those pages fit in cache even if the index as whole does not.
</para>
<para>
However, buffering index build needs to call the <function>penalty</>
function more often, which consumes some extra CPU resources. Also, the
buffers used in the buffering build need temporary disk space, up to
- the size of the resulting index. Buffering can also infuence the quality
- of the produced index, in both positive and negative directions. That
+ the size of the resulting index. Buffering can also influence the quality
+ of the resulting index, in both positive and negative directions. That
influence depends on various factors, like the distribution of the input
- data and operator class implementation.
+ data and the operator class implementation.
</para>
<para>
- By default, the index build switches to the buffering method when the
+ By default, a GiST index build switches to the buffering method when the
index size reaches <xref linkend="guc-effective-cache-size">. It can
be manually turned on or off by the <literal>BUFFERING</literal> parameter
- to the CREATE INDEX clause. The default behavior is good for most cases,
+ to the CREATE INDEX command. The default behavior is good for most cases,
but turning buffering off might speed up the build somewhat if the input
data is ordered.
</para>
IndexUniqueCheck checkUnique = (IndexUniqueCheck) PG_GETARG_INT32(5);
#endif
IndexTuple itup;
- GISTSTATE giststate;
- MemoryContext oldCtx;
- MemoryContext insertCtx;
+ GISTSTATE *giststate;
+ MemoryContext oldCxt;
- insertCtx = createTempGistContext();
- oldCtx = MemoryContextSwitchTo(insertCtx);
+ giststate = initGISTstate(r);
- initGISTstate(&giststate, r);
+ /*
+ * We use the giststate's scan context as temp context too. This means
+ * that any memory leaked by the support functions is not reclaimed until
+ * end of insert. In most cases, we aren't going to call the support
+ * functions very many times before finishing the insert, so this seems
+ * cheaper than resetting a temp context for each function call.
+ */
+ oldCxt = MemoryContextSwitchTo(giststate->tempCxt);
- itup = gistFormTuple(&giststate, r,
+ itup = gistFormTuple(giststate, r,
values, isnull, true /* size is currently bogus */ );
itup->t_tid = *ht_ctid;
- gistdoinsert(r, itup, 0, &giststate);
+ gistdoinsert(r, itup, 0, giststate);
/* cleanup */
- freeGISTstate(&giststate);
- MemoryContextSwitchTo(oldCtx);
- MemoryContextDelete(insertCtx);
+ MemoryContextSwitchTo(oldCxt);
+ freeGISTstate(giststate);
PG_RETURN_BOOL(false);
}
}
/*
- * Fill a GISTSTATE with information about the index
+ * Create a GISTSTATE and fill it with information about the index
*/
-void
-initGISTstate(GISTSTATE *giststate, Relation index)
+GISTSTATE *
+initGISTstate(Relation index)
{
+ GISTSTATE *giststate;
+ MemoryContext scanCxt;
+ MemoryContext oldCxt;
int i;
+ /* safety check to protect fixed-size arrays in GISTSTATE */
if (index->rd_att->natts > INDEX_MAX_KEYS)
elog(ERROR, "numberOfAttributes %d > %d",
index->rd_att->natts, INDEX_MAX_KEYS);
+ /* Create the memory context that will hold the GISTSTATE */
+ scanCxt = AllocSetContextCreate(CurrentMemoryContext,
+ "GiST scan context",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+ oldCxt = MemoryContextSwitchTo(scanCxt);
+
+ /* Create and fill in the GISTSTATE */
+ giststate = (GISTSTATE *) palloc(sizeof(GISTSTATE));
+
+ giststate->scanCxt = scanCxt;
+ giststate->tempCxt = scanCxt; /* caller must change this if needed */
giststate->tupdesc = index->rd_att;
for (i = 0; i < index->rd_att->natts; i++)
{
fmgr_info_copy(&(giststate->consistentFn[i]),
index_getprocinfo(index, i + 1, GIST_CONSISTENT_PROC),
- CurrentMemoryContext);
+ scanCxt);
fmgr_info_copy(&(giststate->unionFn[i]),
index_getprocinfo(index, i + 1, GIST_UNION_PROC),
- CurrentMemoryContext);
+ scanCxt);
fmgr_info_copy(&(giststate->compressFn[i]),
index_getprocinfo(index, i + 1, GIST_COMPRESS_PROC),
- CurrentMemoryContext);
+ scanCxt);
fmgr_info_copy(&(giststate->decompressFn[i]),
index_getprocinfo(index, i + 1, GIST_DECOMPRESS_PROC),
- CurrentMemoryContext);
+ scanCxt);
fmgr_info_copy(&(giststate->penaltyFn[i]),
index_getprocinfo(index, i + 1, GIST_PENALTY_PROC),
- CurrentMemoryContext);
+ scanCxt);
fmgr_info_copy(&(giststate->picksplitFn[i]),
index_getprocinfo(index, i + 1, GIST_PICKSPLIT_PROC),
- CurrentMemoryContext);
+ scanCxt);
fmgr_info_copy(&(giststate->equalFn[i]),
index_getprocinfo(index, i + 1, GIST_EQUAL_PROC),
- CurrentMemoryContext);
+ scanCxt);
/* opclasses are not required to provide a Distance method */
if (OidIsValid(index_getprocid(index, i + 1, GIST_DISTANCE_PROC)))
fmgr_info_copy(&(giststate->distanceFn[i]),
index_getprocinfo(index, i + 1, GIST_DISTANCE_PROC),
- CurrentMemoryContext);
+ scanCxt);
else
giststate->distanceFn[i].fn_oid = InvalidOid;
else
giststate->supportCollation[i] = DEFAULT_COLLATION_OID;
}
+
+ MemoryContextSwitchTo(oldCxt);
+
+ return giststate;
}
void
freeGISTstate(GISTSTATE *giststate)
{
- /* no work */
+ /* It's sufficient to delete the scanCxt */
+ MemoryContextDelete(giststate->scanCxt);
}
typedef struct
{
Relation indexrel;
- GISTSTATE giststate;
+ GISTSTATE *giststate;
GISTBuildBuffers *gfbb;
int64 indtuples; /* number of tuples indexed */
Size freespace; /* amount of free space to leave on pages */
GistBufferingMode bufferingMode;
- MemoryContext tmpCtx;
} GISTBuildState;
static void gistInitBuffering(GISTBuildState *buildstate);
RelationGetRelationName(index));
/* no locking is needed */
- initGISTstate(&buildstate.giststate, index);
+ buildstate.giststate = initGISTstate(index);
+
+ /*
+ * Create a temporary memory context that is reset once for each tuple
+ * processed. (Note: we don't bother to make this a child of the
+ * giststate's scanCxt, so we have to delete it separately at the end.)
+ */
+ buildstate.giststate->tempCxt = createTempGistContext();
/* initialize the root page */
buffer = gistNewBuffer(index);
buildstate.indtuples = 0;
buildstate.indtuplesSize = 0;
- /*
- * create a temporary memory context that is reset once for each tuple
- * processed.
- */
- buildstate.tmpCtx = createTempGistContext();
-
/*
* Do the heap scan.
*/
/* okay, all heap tuples are indexed */
MemoryContextSwitchTo(oldcxt);
- MemoryContextDelete(buildstate.tmpCtx);
+ MemoryContextDelete(buildstate.giststate->tempCxt);
- freeGISTstate(&buildstate.giststate);
+ freeGISTstate(buildstate.giststate);
/*
* Return statistics
IndexTuple itup;
MemoryContext oldCtx;
- oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+ oldCtx = MemoryContextSwitchTo(buildstate->giststate->tempCxt);
/* form an index tuple and point it at the heap tuple */
- itup = gistFormTuple(&buildstate->giststate, index, values, isnull, true);
+ itup = gistFormTuple(buildstate->giststate, index, values, isnull, true);
itup->t_tid = htup->t_self;
if (buildstate->bufferingMode == GIST_BUFFERING_ACTIVE)
* locked, we call gistdoinsert directly.
*/
gistdoinsert(index, itup, buildstate->freespace,
- &buildstate->giststate);
+ buildstate->giststate);
}
/* Update tuple count and total size. */
buildstate->indtuplesSize += IndexTupleSize(itup);
MemoryContextSwitchTo(oldCtx);
- MemoryContextReset(buildstate->tmpCtx);
+ MemoryContextReset(buildstate->giststate->tempCxt);
if (buildstate->bufferingMode == GIST_BUFFERING_ACTIVE &&
buildstate->indtuples % BUFFERING_MODE_TUPLE_SIZE_STATS_TARGET == 0)
gistProcessItup(GISTBuildState *buildstate, IndexTuple itup,
GISTBufferingInsertStack *startparent)
{
- GISTSTATE *giststate = &buildstate->giststate;
+ GISTSTATE *giststate = buildstate->giststate;
GISTBuildBuffers *gfbb = buildstate->gfbb;
Relation indexrel = buildstate->indexrel;
GISTBufferingInsertStack *path;
is_split = gistplacetopage(buildstate->indexrel,
buildstate->freespace,
- &buildstate->giststate,
+ buildstate->giststate,
buffer,
itup, ntup, oldoffnum,
InvalidBuffer,
* buffers that will eventually be inserted to them.
*/
gistRelocateBuildBuffersOnSplit(gfbb,
- &buildstate->giststate,
+ buildstate->giststate,
buildstate->indexrel,
path, buffer, splitinfo);
}
/* Free all the memory allocated during index tuple processing */
- MemoryContextReset(CurrentMemoryContext);
+ MemoryContextReset(buildstate->giststate->tempCxt);
}
}
}
MemoryContext oldCtx;
int i;
- oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+ oldCtx = MemoryContextSwitchTo(buildstate->giststate->tempCxt);
/*
* Iterate through the levels from top to bottom.
nodeBuffer->queuedForEmptying = true;
gfbb->bufferEmptyingQueue =
lcons(nodeBuffer, gfbb->bufferEmptyingQueue);
- MemoryContextSwitchTo(buildstate->tmpCtx);
+ MemoryContextSwitchTo(buildstate->giststate->tempCxt);
}
gistProcessEmptyingQueue(buildstate);
}
* Must call gistindex_keytest in tempCxt, and clean up any leftover
* junk afterward.
*/
- oldcxt = MemoryContextSwitchTo(so->tempCxt);
+ oldcxt = MemoryContextSwitchTo(so->giststate->tempCxt);
match = gistindex_keytest(scan, it, page, i, &recheck);
MemoryContextSwitchTo(oldcxt);
- MemoryContextReset(so->tempCxt);
+ MemoryContextReset(so->giststate->tempCxt);
/* Ignore tuple if it doesn't match */
if (!match)
int nkeys = PG_GETARG_INT32(1);
int norderbys = PG_GETARG_INT32(2);
IndexScanDesc scan;
+ GISTSTATE *giststate;
GISTScanOpaque so;
+ MemoryContext oldCxt;
scan = RelationGetIndexScan(r, nkeys, norderbys);
+ /* First, set up a GISTSTATE with a scan-lifespan memory context */
+ giststate = initGISTstate(scan->indexRelation);
+
+ /*
+ * Everything made below is in the scanCxt, or is a child of the scanCxt,
+ * so it'll all go away automatically in gistendscan.
+ */
+ oldCxt = MemoryContextSwitchTo(giststate->scanCxt);
+
/* initialize opaque data */
so = (GISTScanOpaque) palloc0(sizeof(GISTScanOpaqueData));
- so->queueCxt = AllocSetContextCreate(CurrentMemoryContext,
- "GiST queue context",
- ALLOCSET_DEFAULT_MINSIZE,
- ALLOCSET_DEFAULT_INITSIZE,
- ALLOCSET_DEFAULT_MAXSIZE);
- so->tempCxt = createTempGistContext();
- so->giststate = (GISTSTATE *) palloc(sizeof(GISTSTATE));
- initGISTstate(so->giststate, scan->indexRelation);
+ so->giststate = giststate;
+ giststate->tempCxt = createTempGistContext();
+ so->queue = NULL;
+ so->queueCxt = giststate->scanCxt; /* see gistrescan */
+
/* workspaces with size dependent on numberOfOrderBys: */
so->tmpTreeItem = palloc(GSTIHDRSZ + sizeof(double) * scan->numberOfOrderBys);
so->distances = palloc(sizeof(double) * scan->numberOfOrderBys);
scan->opaque = so;
+ MemoryContextSwitchTo(oldCxt);
+
PG_RETURN_POINTER(scan);
}
/* nkeys and norderbys arguments are ignored */
GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ bool first_time;
int i;
MemoryContext oldCxt;
/* rescan an existing indexscan --- reset state */
- MemoryContextReset(so->queueCxt);
- so->curTreeItem = NULL;
+
+ /*
+ * The first time through, we create the search queue in the scanCxt.
+ * Subsequent times through, we create the queue in a separate queueCxt,
+ * which is created on the second call and reset on later calls. Thus, in
+ * the common case where a scan is only rescan'd once, we just put the
+ * queue in scanCxt and don't pay the overhead of making a second memory
+ * context. If we do rescan more than once, the first RBTree is just left
+ * for dead until end of scan; this small wastage seems worth the savings
+ * in the common case.
+ */
+ if (so->queue == NULL)
+ {
+ /* first time through */
+ Assert(so->queueCxt == so->giststate->scanCxt);
+ first_time = true;
+ }
+ else if (so->queueCxt == so->giststate->scanCxt)
+ {
+ /* second time through */
+ so->queueCxt = AllocSetContextCreate(so->giststate->scanCxt,
+ "GiST queue context",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+ first_time = false;
+ }
+ else
+ {
+ /* third or later time through */
+ MemoryContextReset(so->queueCxt);
+ first_time = false;
+ }
/* create new, empty RBTree for search queue */
oldCxt = MemoryContextSwitchTo(so->queueCxt);
scan);
MemoryContextSwitchTo(oldCxt);
+ so->curTreeItem = NULL;
so->firstCall = true;
/* Update scan key, if a new one is given */
if (key && scan->numberOfKeys > 0)
{
+ /*
+ * If this isn't the first time through, preserve the fn_extra
+ * pointers, so that if the consistentFns are using them to cache
+ * data, that data is not leaked across a rescan.
+ */
+ if (!first_time)
+ {
+ for (i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey skey = scan->keyData + i;
+
+ so->giststate->consistentFn[skey->sk_attno - 1].fn_extra =
+ skey->sk_func.fn_extra;
+ }
+ }
+
memmove(scan->keyData, key,
scan->numberOfKeys * sizeof(ScanKeyData));
* Next, if any of keys is a NULL and that key is not marked with
* SK_SEARCHNULL/SK_SEARCHNOTNULL then nothing can be found (ie, we
* assume all indexable operators are strict).
+ *
+ * Note: we intentionally memcpy the FmgrInfo to sk_func rather than
+ * using fmgr_info_copy. This is so that the fn_extra field gets
+ * preserved across multiple rescans.
*/
so->qual_ok = true;
/* Update order-by key, if a new one is given */
if (orderbys && scan->numberOfOrderBys > 0)
{
+ /* As above, preserve fn_extra if not first time through */
+ if (!first_time)
+ {
+ for (i = 0; i < scan->numberOfOrderBys; i++)
+ {
+ ScanKey skey = scan->orderByData + i;
+
+ so->giststate->distanceFn[skey->sk_attno - 1].fn_extra =
+ skey->sk_func.fn_extra;
+ }
+ }
+
memmove(scan->orderByData, orderbys,
scan->numberOfOrderBys * sizeof(ScanKeyData));
* function in the form of its strategy number, which is available
* from the sk_strategy field, and its subtype from the sk_subtype
* field.
+ *
+ * See above comment about why we don't use fmgr_info_copy here.
*/
for (i = 0; i < scan->numberOfOrderBys; i++)
{
IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ /*
+ * freeGISTstate is enough to clean up everything made by gistbeginscan,
+ * as well as the queueCxt if there is a separate context for it.
+ */
freeGISTstate(so->giststate);
- pfree(so->giststate);
- MemoryContextDelete(so->queueCxt);
- MemoryContextDelete(so->tempCxt);
- pfree(so->tmpTreeItem);
- pfree(so->distances);
- pfree(so);
PG_RETURN_VOID();
}
*
* This struct retains call info for the index's opclass-specific support
* functions (per index column), plus the index's tuple descriptor.
+ *
+ * scanCxt holds the GISTSTATE itself as well as any data that lives for the
+ * lifetime of the index operation. We pass this to the support functions
+ * via fn_mcxt, so that they can store scan-lifespan data in it. The
+ * functions are invoked in tempCxt, which is typically short-lifespan
+ * (that is, it's reset after each tuple). However, tempCxt can be the same
+ * as scanCxt if we're not bothering with per-tuple context resets.
*/
typedef struct GISTSTATE
{
+ MemoryContext scanCxt; /* context for scan-lifespan data */
+ MemoryContext tempCxt; /* short-term context for calling functions */
+
+ TupleDesc tupdesc; /* index's tuple descriptor */
+
FmgrInfo consistentFn[INDEX_MAX_KEYS];
FmgrInfo unionFn[INDEX_MAX_KEYS];
FmgrInfo compressFn[INDEX_MAX_KEYS];
/* Collations to pass to the support functions */
Oid supportCollation[INDEX_MAX_KEYS];
-
- TupleDesc tupdesc;
} GISTSTATE;
GISTSTATE *giststate; /* index information, see above */
RBTree *queue; /* queue of unvisited items */
MemoryContext queueCxt; /* context holding the queue */
- MemoryContext tempCxt; /* workspace context for calling functions */
bool qual_ok; /* false if qual can never be satisfied */
bool firstCall; /* true until first gistgettuple call */
extern Datum gistbuildempty(PG_FUNCTION_ARGS);
extern Datum gistinsert(PG_FUNCTION_ARGS);
extern MemoryContext createTempGistContext(void);
-extern void initGISTstate(GISTSTATE *giststate, Relation index);
+extern GISTSTATE *initGISTstate(Relation index);
extern void freeGISTstate(GISTSTATE *giststate);
extern void gistdoinsert(Relation r,
IndexTuple itup,