... or at least, when the planner's cost estimates say it will be faster.
Leonardo Francalanci, reviewed by Itagaki Takahiro and Tom Lane
</para>
<para>
- During the cluster operation, a temporary copy of the table is created
- that contains the table data in the index order. Temporary copies of
- each index on the table are created as well. Therefore, you need free
- space on disk at least equal to the sum of the table size and the index
- sizes.
+ <command>CLUSTER</> can re-sort the table using either an indexscan
+ on the specified index, or (if the index is a b-tree) a sequential
+ scan followed by sorting. It will attempt to choose the method that
+ will be faster, based on planner cost parameters and available statistical
+ information.
</para>
<para>
- Because <command>CLUSTER</command> remembers the clustering information,
- one can cluster the tables one wants clustered manually the first time, and
- setup a timed event similar to <command>VACUUM</command> so that the tables
- are periodically reclustered.
+ When an indexscan is used, a temporary copy of the table is created that
+ contains the table data in the index order. Temporary copies of each
+ index on the table are created as well. Therefore, you need free space on
+ disk at least equal to the sum of the table size and the index sizes.
+ </para>
+
+ <para>
+ When a sequential scan and sort is used, a temporary sort file is
+ also created, so that the peak temporary space requirement is as much
+ as double the table size, plus the index sizes. This method is often
+ faster than the indexscan method, but if the disk space requirement is
+ intolerable, you can disable this choice by temporarily setting <xref
+ linkend="guc-enable-sort"> to <literal>off</>.
+ </para>
+
+ <para>
+ It is advisable to set <xref linkend="guc-maintenance-work-mem"> to
+ a reasonably large value (but not more than the amount of RAM you can
+ dedicate to the <command>CLUSTER</> operation) before clustering.
</para>
<para>
</para>
<para>
- There is another way to cluster data. The
- <command>CLUSTER</command> command reorders the original table by
- scanning it using the index you specify. This can be slow
- on large tables because the rows are fetched from the table
- in index order, and if the table is disordered, the
- entries are on random pages, so there is one disk page
- retrieved for every row moved. (<productname>PostgreSQL</productname> has
- a cache, but the majority of a big table will not fit in the cache.)
- The other way to cluster a table is to use:
-
-<programlisting>
-CREATE TABLE <replaceable class="parameter">newtable</replaceable> AS
- SELECT * FROM <replaceable class="parameter">table</replaceable> ORDER BY <replaceable class="parameter">columnlist</replaceable>;
-</programlisting>
-
- which uses the <productname>PostgreSQL</productname> sorting code
- to produce the desired order;
- this is usually much faster than an index scan for disordered data.
- Then you drop the old table, use
- <command>ALTER TABLE ... RENAME</command>
- to rename <replaceable class="parameter">newtable</replaceable> to the
- old name, and recreate the table's indexes.
- The big disadvantage of this approach is that it does not preserve
- OIDs, constraints, foreign key relationships, granted privileges, and
- other ancillary properties of the table — all such items must be
- manually recreated. Another disadvantage is that this way requires a sort
- temporary file about the same size as the table itself, so peak disk usage
- is about three times the table size instead of twice the table size.
+ Because <command>CLUSTER</command> remembers which indexes are clustered,
+ one can cluster the tables one wants clustered manually the first time,
+ then set up a periodic maintenance script that executes
+ <command>CLUSTER</> without any parameters, so that the desired tables
+ are periodically reclustered.
</para>
+
</refsect1>
<refsect1>
#include "commands/trigger.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
+#include "optimizer/planner.h"
#include "storage/bufmgr.h"
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tqual.h"
+#include "utils/tuplesort.h"
/*
int freeze_min_age, int freeze_table_age,
bool *pSwapToastByContent, TransactionId *pFreezeXid);
static List *get_tables_to_cluster(MemoryContext cluster_context);
-
+static void reform_and_rewrite_tuple(HeapTuple tuple,
+ TupleDesc oldTupDesc, TupleDesc newTupDesc,
+ Datum *values, bool *isnull,
+ bool newRelHasOids, RewriteState rwstate);
/*---------------------------------------------------------------------------
TransactionId OldestXmin;
TransactionId FreezeXid;
RewriteState rwstate;
+ bool use_sort;
+ Tuplesortstate *tuplesort;
/*
* Open the relations we need.
rwstate = begin_heap_rewrite(NewHeap, OldestXmin, FreezeXid, use_wal);
/*
- * Scan through the OldHeap, either in OldIndex order or sequentially, and
- * copy each tuple into the NewHeap. To ensure we see recently-dead
- * tuples that still need to be copied, we scan with SnapshotAny and use
+ * Decide whether to use an indexscan or seqscan-and-optional-sort to
+ * scan the OldHeap. We know how to use a sort to duplicate the ordering
+ * of a btree index, and will use seqscan-and-sort for that case if the
+ * planner tells us it's cheaper. Otherwise, always indexscan if an
+ * index is provided, else plain seqscan.
+ */
+ if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
+ use_sort = plan_cluster_use_sort(OIDOldHeap, OIDOldIndex);
+ else
+ use_sort = false;
+
+ /* Set up sorting if wanted */
+ if (use_sort)
+ tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
+ maintenance_work_mem, false);
+ else
+ tuplesort = NULL;
+
+ /*
+ * Prepare to scan the OldHeap. To ensure we see recently-dead tuples
+ * that still need to be copied, we scan with SnapshotAny and use
* HeapTupleSatisfiesVacuum for the visibility test.
*/
- if (OldIndex != NULL)
+ if (OldIndex != NULL && !use_sort)
{
heapScan = NULL;
indexScan = index_beginscan(OldHeap, OldIndex,
indexScan = NULL;
}
+ /*
+ * Scan through the OldHeap, either in OldIndex order or sequentially;
+ * copy each tuple into the NewHeap, or transiently to the tuplesort
+ * module. Note that we don't bother sorting dead tuples (they won't
+ * get to the new table anyway).
+ */
for (;;)
{
HeapTuple tuple;
- HeapTuple copiedTuple;
Buffer buf;
bool isdead;
- int i;
CHECK_FOR_INTERRUPTS();
- if (OldIndex != NULL)
+ if (indexScan != NULL)
{
tuple = index_getnext(indexScan, ForwardScanDirection);
if (tuple == NULL)
continue;
}
- /*
- * We cannot simply copy the tuple as-is, for several reasons:
- *
- * 1. We'd like to squeeze out the values of any dropped columns, both
- * to save space and to ensure we have no corner-case failures. (It's
- * possible for example that the new table hasn't got a TOAST table
- * and so is unable to store any large values of dropped cols.)
- *
- * 2. The tuple might not even be legal for the new table; this is
- * currently only known to happen as an after-effect of ALTER TABLE
- * SET WITHOUT OIDS.
- *
- * So, we must reconstruct the tuple from component Datums.
- */
- heap_deform_tuple(tuple, oldTupDesc, values, isnull);
+ if (tuplesort != NULL)
+ tuplesort_putheaptuple(tuplesort, tuple);
+ else
+ reform_and_rewrite_tuple(tuple,
+ oldTupDesc, newTupDesc,
+ values, isnull,
+ NewHeap->rd_rel->relhasoids, rwstate);
+ }
- /* Be sure to null out any dropped columns */
- for (i = 0; i < natts; i++)
+ if (indexScan != NULL)
+ index_endscan(indexScan);
+ if (heapScan != NULL)
+ heap_endscan(heapScan);
+
+ /*
+ * In scan-and-sort mode, complete the sort, then read out all live
+ * tuples from the tuplestore and write them to the new relation.
+ */
+ if (tuplesort != NULL)
+ {
+ tuplesort_performsort(tuplesort);
+
+ for (;;)
{
- if (newTupDesc->attrs[i]->attisdropped)
- isnull[i] = true;
- }
+ HeapTuple tuple;
+ bool shouldfree;
- copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
+ CHECK_FOR_INTERRUPTS();
- /* Preserve OID, if any */
- if (NewHeap->rd_rel->relhasoids)
- HeapTupleSetOid(copiedTuple, HeapTupleGetOid(tuple));
+ tuple = tuplesort_getheaptuple(tuplesort, true, &shouldfree);
+ if (tuple == NULL)
+ break;
- /* The heap rewrite module does the rest */
- rewrite_heap_tuple(rwstate, tuple, copiedTuple);
+ reform_and_rewrite_tuple(tuple,
+ oldTupDesc, newTupDesc,
+ values, isnull,
+ NewHeap->rd_rel->relhasoids, rwstate);
- heap_freetuple(copiedTuple);
- }
+ if (shouldfree)
+ heap_freetuple(tuple);
+ }
- if (OldIndex != NULL)
- index_endscan(indexScan);
- else
- heap_endscan(heapScan);
+ tuplesort_end(tuplesort);
+ }
/* Write out any remaining tuples, and fsync if needed */
end_heap_rewrite(rwstate);
return rvs;
}
+
+
+/*
+ * Reconstruct and rewrite the given tuple
+ *
+ * We cannot simply copy the tuple as-is, for several reasons:
+ *
+ * 1. We'd like to squeeze out the values of any dropped columns, both
+ * to save space and to ensure we have no corner-case failures. (It's
+ * possible for example that the new table hasn't got a TOAST table
+ * and so is unable to store any large values of dropped cols.)
+ *
+ * 2. The tuple might not even be legal for the new table; this is
+ * currently only known to happen as an after-effect of ALTER TABLE
+ * SET WITHOUT OIDS.
+ *
+ * So, we must reconstruct the tuple from component Datums.
+ */
+static void
+reform_and_rewrite_tuple(HeapTuple tuple,
+ TupleDesc oldTupDesc, TupleDesc newTupDesc,
+ Datum *values, bool *isnull,
+ bool newRelHasOids, RewriteState rwstate)
+{
+ HeapTuple copiedTuple;
+ int i;
+
+ heap_deform_tuple(tuple, oldTupDesc, values, isnull);
+
+ /* Be sure to null out any dropped columns */
+ for (i = 0; i < newTupDesc->natts; i++)
+ {
+ if (newTupDesc->attrs[i]->attisdropped)
+ isnull[i] = true;
+ }
+
+ copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
+
+ /* Preserve OID, if any */
+ if (newRelHasOids)
+ HeapTupleSetOid(copiedTuple, HeapTupleGetOid(tuple));
+
+ /* The heap rewrite module does the rest */
+ rewrite_heap_tuple(rwstate, tuple, copiedTuple);
+
+ heap_freetuple(copiedTuple);
+}
* Determines and returns the cost of sorting a relation, including
* the cost of reading the input data.
*
- * If the total volume of data to sort is less than work_mem, we will do
+ * If the total volume of data to sort is less than sort_mem, we will do
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
* comparisons for t tuples.
*
- * If the total volume exceeds work_mem, we switch to a tape-style merge
+ * If the total volume exceeds sort_mem, we switch to a tape-style merge
* algorithm. There will still be about t*log2(t) tuple comparisons in
* total, but we will also need to write and read each tuple once per
* merge pass. We expect about ceil(logM(r)) merge passes where r is the
* number of initial runs formed and M is the merge order used by tuplesort.c.
- * Since the average initial run should be about twice work_mem, we have
- * disk traffic = 2 * relsize * ceil(logM(p / (2*work_mem)))
+ * Since the average initial run should be about twice sort_mem, we have
+ * disk traffic = 2 * relsize * ceil(logM(p / (2*sort_mem)))
* cpu = comparison_cost * t * log2(t)
*
* If the sort is bounded (i.e., only the first k result tuples are needed)
- * and k tuples can fit into work_mem, we use a heap method that keeps only
+ * and k tuples can fit into sort_mem, we use a heap method that keeps only
* k tuples in the heap; this will require about t*log2(k) tuple comparisons.
*
* The disk traffic is assumed to be 3/4ths sequential and 1/4th random
* accesses (XXX can't we refine that guess?)
*
- * We charge two operator evals per tuple comparison, which should be in
- * the right ballpark in most cases.
+ * By default, we charge two operator evals per tuple comparison, which should
+ * be in the right ballpark in most cases. The caller can tweak this by
+ * specifying nonzero comparison_cost; typically that's used for any extra
+ * work that has to be done to prepare the inputs to the comparison operators.
*
* 'pathkeys' is a list of sort keys
* 'input_cost' is the total cost for reading the input data
* 'tuples' is the number of tuples in the relation
* 'width' is the average tuple width in bytes
+ * 'comparison_cost' is the extra cost per comparison, if any
+ * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
* 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
*
* NOTE: some callers currently pass NIL for pathkeys because they
void
cost_sort(Path *path, PlannerInfo *root,
List *pathkeys, Cost input_cost, double tuples, int width,
+ Cost comparison_cost, int sort_mem,
double limit_tuples)
{
Cost startup_cost = input_cost;
double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
- long work_mem_bytes = work_mem * 1024L;
+ long sort_mem_bytes = sort_mem * 1024L;
if (!enable_sort)
startup_cost += disable_cost;
if (tuples < 2.0)
tuples = 2.0;
+ /* Include the default cost-per-comparison */
+ comparison_cost += 2.0 * cpu_operator_cost;
+
/* Do we have a useful LIMIT? */
if (limit_tuples > 0 && limit_tuples < tuples)
{
output_bytes = input_bytes;
}
- if (output_bytes > work_mem_bytes)
+ if (output_bytes > sort_mem_bytes)
{
/*
* We'll have to use a disk-based sort of all the tuples
*/
double npages = ceil(input_bytes / BLCKSZ);
- double nruns = (input_bytes / work_mem_bytes) * 0.5;
- double mergeorder = tuplesort_merge_order(work_mem_bytes);
+ double nruns = (input_bytes / sort_mem_bytes) * 0.5;
+ double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs;
double npageaccesses;
/*
* CPU costs
*
- * Assume about two operator evals per tuple comparison and N log2 N
- * comparisons
+ * Assume about N log2 N comparisons
*/
- startup_cost += 2.0 * cpu_operator_cost * tuples * LOG2(tuples);
+ startup_cost += comparison_cost * tuples * LOG2(tuples);
/* Disk costs */
startup_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
- else if (tuples > 2 * output_tuples || input_bytes > work_mem_bytes)
+ else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, for
* factor is a bit higher than for quicksort. Tweak it so that the
* cost curve is continuous at the crossover point.
*/
- startup_cost += 2.0 * cpu_operator_cost * tuples * LOG2(2.0 * output_tuples);
+ startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
}
else
{
/* We'll use plain quicksort on all the input tuples */
- startup_cost += 2.0 * cpu_operator_cost * tuples * LOG2(tuples);
+ startup_cost += comparison_cost * tuples * LOG2(tuples);
}
/*
outer_path->total_cost,
outer_path_rows,
outer_path->parent->width,
+ 0.0,
+ work_mem,
-1.0);
startup_cost += sort_path.startup_cost;
startup_cost += (sort_path.total_cost - sort_path.startup_cost)
inner_path->total_cost,
inner_path_rows,
inner_path->parent->width,
+ 0.0,
+ work_mem,
-1.0);
startup_cost += sort_path.startup_cost;
startup_cost += (sort_path.total_cost - sort_path.startup_cost)
#include <math.h>
#include "access/skey.h"
+#include "miscadmin.h"
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
#include "optimizer/clauses.h"
lefttree->total_cost,
lefttree->plan_rows,
lefttree->plan_width,
+ 0.0,
+ work_mem,
limit_tuples);
plan->startup_cost = sort_path.startup_cost;
plan->total_cost = sort_path.total_cost;
*/
#include "postgres.h"
+#include "miscadmin.h"
#include "optimizer/cost.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
cost_sort(&sort_path, root, root->query_pathkeys,
cheapestpath->total_cost,
final_rel->rows, final_rel->width,
- limit_tuples);
+ 0.0, work_mem, limit_tuples);
}
if (compare_fractional_path_costs(sortedpath, &sort_path,
#include "optimizer/cost.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
+#include "optimizer/plancat.h"
#include "optimizer/planmain.h"
#include "optimizer/planner.h"
#include "optimizer/prep.h"
/* Result of hashed agg is always unsorted */
if (target_pathkeys)
cost_sort(&hashed_p, root, target_pathkeys, hashed_p.total_cost,
- dNumGroups, path_width, limit_tuples);
+ dNumGroups, path_width,
+ 0.0, work_mem, limit_tuples);
if (sorted_path)
{
if (!pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
{
cost_sort(&sorted_p, root, root->group_pathkeys, sorted_p.total_cost,
- path_rows, path_width, -1.0);
+ path_rows, path_width,
+ 0.0, work_mem, -1.0);
current_pathkeys = root->group_pathkeys;
}
if (target_pathkeys &&
!pathkeys_contained_in(target_pathkeys, current_pathkeys))
cost_sort(&sorted_p, root, target_pathkeys, sorted_p.total_cost,
- dNumGroups, path_width, limit_tuples);
+ dNumGroups, path_width,
+ 0.0, work_mem, limit_tuples);
/*
* Now make the decision using the top-level tuple fraction. First we
*/
if (parse->sortClause)
cost_sort(&hashed_p, root, root->sort_pathkeys, hashed_p.total_cost,
- dNumDistinctRows, path_width, limit_tuples);
+ dNumDistinctRows, path_width,
+ 0.0, work_mem, limit_tuples);
/*
* Now for the GROUP case. See comments in grouping_planner about the
else
current_pathkeys = root->sort_pathkeys;
cost_sort(&sorted_p, root, current_pathkeys, sorted_p.total_cost,
- path_rows, path_width, -1.0);
+ path_rows, path_width,
+ 0.0, work_mem, -1.0);
}
cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
sorted_p.startup_cost, sorted_p.total_cost,
if (parse->sortClause &&
!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
cost_sort(&sorted_p, root, root->sort_pathkeys, sorted_p.total_cost,
- dNumDistinctRows, path_width, limit_tuples);
+ dNumDistinctRows, path_width,
+ 0.0, work_mem, limit_tuples);
/*
* Now make the decision using the top-level tuple fraction. First we
return (Expr *) result;
}
+
+
+/*
+ * plan_cluster_use_sort
+ * Use the planner to decide how CLUSTER should implement sorting
+ *
+ * tableOid is the OID of a table to be clustered on its index indexOid
+ * (which is already known to be a btree index). Decide whether it's
+ * cheaper to do an indexscan or a seqscan-plus-sort to execute the CLUSTER.
+ * Return TRUE to use sorting, FALSE to use an indexscan.
+ *
+ * Note: caller had better already hold some type of lock on the table.
+ */
+bool
+plan_cluster_use_sort(Oid tableOid, Oid indexOid)
+{
+ PlannerInfo *root;
+ Query *query;
+ PlannerGlobal *glob;
+ RangeTblEntry *rte;
+ RelOptInfo *rel;
+ IndexOptInfo *indexInfo;
+ QualCost indexExprCost;
+ Cost comparisonCost;
+ Path *seqScanPath;
+ Path seqScanAndSortPath;
+ IndexPath *indexScanPath;
+ ListCell *lc;
+
+ /* Set up mostly-dummy planner state */
+ query = makeNode(Query);
+ query->commandType = CMD_SELECT;
+
+ glob = makeNode(PlannerGlobal);
+
+ root = makeNode(PlannerInfo);
+ root->parse = query;
+ root->glob = glob;
+ root->query_level = 1;
+ root->planner_cxt = CurrentMemoryContext;
+ root->wt_param_id = -1;
+
+ /* Build a minimal RTE for the rel */
+ rte = makeNode(RangeTblEntry);
+ rte->rtekind = RTE_RELATION;
+ rte->relid = tableOid;
+ rte->inh = false;
+ rte->inFromCl = true;
+ query->rtable = list_make1(rte);
+
+ /* ... and insert it into PlannerInfo */
+ root->simple_rel_array_size = 2;
+ root->simple_rel_array = (RelOptInfo **)
+ palloc0(root->simple_rel_array_size * sizeof(RelOptInfo *));
+ root->simple_rte_array = (RangeTblEntry **)
+ palloc0(root->simple_rel_array_size * sizeof(RangeTblEntry *));
+ root->simple_rte_array[1] = rte;
+
+ /* Build RelOptInfo */
+ rel = build_simple_rel(root, 1, RELOPT_BASEREL);
+
+ /*
+ * Rather than doing all the pushups that would be needed to use
+ * set_baserel_size_estimates, just do a quick hack for rows and width.
+ */
+ rel->rows = rel->tuples;
+ rel->width = get_relation_data_width(tableOid);
+
+ root->total_table_pages = rel->pages;
+
+ /* Locate IndexOptInfo for the target index */
+ indexInfo = NULL;
+ foreach(lc, rel->indexlist)
+ {
+ indexInfo = (IndexOptInfo *) lfirst(lc);
+ if (indexInfo->indexoid == indexOid)
+ break;
+ }
+ if (lc == NULL) /* not in the list? */
+ elog(ERROR, "index %u does not belong to table %u",
+ indexOid, tableOid);
+
+ /*
+ * Determine eval cost of the index expressions, if any. We need to
+ * charge twice that amount for each tuple comparison that happens
+ * during the sort, since tuplesort.c will have to re-evaluate the
+ * index expressions each time. (XXX that's pretty inefficient...)
+ */
+ cost_qual_eval(&indexExprCost, indexInfo->indexprs, root);
+ comparisonCost = 2.0 * (indexExprCost.startup + indexExprCost.per_tuple);
+
+ /* Estimate the cost of seq scan + sort */
+ seqScanPath = create_seqscan_path(root, rel);
+ cost_sort(&seqScanAndSortPath, root, NIL,
+ seqScanPath->total_cost, rel->tuples, rel->width,
+ comparisonCost, maintenance_work_mem, -1.0);
+
+ /* Estimate the cost of index scan */
+ indexScanPath = create_index_path(root, indexInfo,
+ NIL, NIL,
+ ForwardScanDirection, NULL);
+
+ return (seqScanAndSortPath.total_cost < indexScanPath->path.total_cost);
+}
sorted_p.total_cost = input_plan->total_cost;
/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
- input_plan->plan_rows, input_plan->plan_width, -1.0);
+ input_plan->plan_rows, input_plan->plan_width,
+ 0.0, work_mem, -1.0);
cost_group(&sorted_p, root, numGroupCols, dNumGroups,
sorted_p.startup_cost, sorted_p.total_cost,
input_plan->plan_rows);
subpath->total_cost,
rel->rows,
rel->width,
+ 0.0,
+ work_mem,
-1.0);
/*
get_relation_info_hook_type get_relation_info_hook = NULL;
+static int32 get_rel_data_width(Relation rel, int32 *attr_widths);
static List *get_relation_constraints(PlannerInfo *root,
Oid relationObjectId, RelOptInfo *rel,
bool include_notnull);
* platform dependencies in the default plans which are kind
* of a headache for regression testing.
*/
- int32 tuple_width = 0;
- int i;
+ int32 tuple_width;
- for (i = 1; i <= RelationGetNumberOfAttributes(rel); i++)
- {
- Form_pg_attribute att = rel->rd_att->attrs[i - 1];
- int32 item_width;
-
- if (att->attisdropped)
- continue;
- /* This should match set_rel_width() in costsize.c */
- item_width = get_attavgwidth(RelationGetRelid(rel), i);
- if (item_width <= 0)
- {
- item_width = get_typavgwidth(att->atttypid,
- att->atttypmod);
- Assert(item_width > 0);
- }
- if (attr_widths != NULL)
- attr_widths[i] = item_width;
- tuple_width += item_width;
- }
+ tuple_width = get_rel_data_width(rel, attr_widths);
tuple_width += sizeof(HeapTupleHeaderData);
tuple_width += sizeof(ItemPointerData);
/* note: integer division is intentional here */
}
+/*
+ * get_rel_data_width
+ *
+ * Estimate the average width of (the data part of) the relation's tuples.
+ * If attr_widths isn't NULL, also store per-column width estimates into
+ * that array.
+ *
+ * Currently we ignore dropped columns. Ideally those should be included
+ * in the result, but we haven't got any way to get info about them; and
+ * since they might be mostly NULLs, treating them as zero-width is not
+ * necessarily the wrong thing anyway.
+ */
+static int32
+get_rel_data_width(Relation rel, int32 *attr_widths)
+{
+ int32 tuple_width = 0;
+ int i;
+
+ for (i = 1; i <= RelationGetNumberOfAttributes(rel); i++)
+ {
+ Form_pg_attribute att = rel->rd_att->attrs[i - 1];
+ int32 item_width;
+
+ if (att->attisdropped)
+ continue;
+ /* This should match set_rel_width() in costsize.c */
+ item_width = get_attavgwidth(RelationGetRelid(rel), i);
+ if (item_width <= 0)
+ {
+ item_width = get_typavgwidth(att->atttypid, att->atttypmod);
+ Assert(item_width > 0);
+ }
+ if (attr_widths != NULL)
+ attr_widths[i] = item_width;
+ tuple_width += item_width;
+ }
+
+ return tuple_width;
+}
+
+/*
+ * get_relation_data_width
+ *
+ * External API for get_rel_data_width
+ */
+int32
+get_relation_data_width(Oid relid)
+{
+ int32 result;
+ Relation relation;
+
+ /* As above, assume relation is already locked */
+ relation = heap_open(relid, NoLock);
+
+ result = get_rel_data_width(relation, NULL);
+
+ heap_close(relation, NoLock);
+
+ return result;
+}
+
+
/*
* get_relation_constraints
*
#include "access/genam.h"
#include "access/nbtree.h"
+#include "catalog/index.h"
#include "catalog/pg_amop.h"
#include "catalog/pg_operator.h"
#include "commands/tablespace.h"
+#include "executor/executor.h"
#include "miscadmin.h"
#include "pg_trace.h"
#include "utils/datum.h"
#define HEAP_SORT 0
#define INDEX_SORT 1
#define DATUM_SORT 2
+#define CLUSTER_SORT 3
/* GUC variables */
#ifdef TRACE_SORT
TupleDesc tupDesc;
ScanKey scanKeys; /* array of length nKeys */
+ /*
+ * These variables are specific to the CLUSTER case; they are set by
+ * tuplesort_begin_cluster. Note CLUSTER also uses tupDesc and
+ * indexScanKey.
+ */
+ IndexInfo *indexInfo; /* info about index being used for reference */
+ EState *estate; /* for evaluating index expressions */
+
/*
* These variables are specific to the IndexTuple case; they are set by
* tuplesort_begin_index_xxx and used only by the IndexTuple routines.
static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static void reversedirection_heap(Tuplesortstate *state);
+static int comparetup_cluster(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state);
+static void copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup);
+static void writetup_cluster(Tuplesortstate *state, int tapenum,
+ SortTuple *stup);
+static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
+ int tapenum, unsigned int len);
static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state);
static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
return state;
}
+Tuplesortstate *
+tuplesort_begin_cluster(TupleDesc tupDesc,
+ Relation indexRel,
+ int workMem, bool randomAccess)
+{
+ Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ MemoryContext oldcontext;
+
+ Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
+
+ oldcontext = MemoryContextSwitchTo(state->sortcontext);
+
+#ifdef TRACE_SORT
+ if (trace_sort)
+ elog(LOG,
+ "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c",
+ RelationGetNumberOfAttributes(indexRel),
+ workMem, randomAccess ? 't' : 'f');
+#endif
+
+ state->nKeys = RelationGetNumberOfAttributes(indexRel);
+
+ TRACE_POSTGRESQL_SORT_START(CLUSTER_SORT,
+ false, /* no unique check */
+ state->nKeys,
+ workMem,
+ randomAccess);
+
+ state->comparetup = comparetup_cluster;
+ state->copytup = copytup_cluster;
+ state->writetup = writetup_cluster;
+ state->readtup = readtup_cluster;
+ state->reversedirection = reversedirection_index_btree;
+
+ state->indexInfo = BuildIndexInfo(indexRel);
+ state->indexScanKey = _bt_mkscankey_nodata(indexRel);
+
+ state->tupDesc = tupDesc; /* assume we need not copy tupDesc */
+
+ if (state->indexInfo->ii_Expressions != NULL)
+ {
+ TupleTableSlot *slot;
+ ExprContext *econtext;
+
+ /*
+ * We will need to use FormIndexDatum to evaluate the index
+ * expressions. To do that, we need an EState, as well as a
+ * TupleTableSlot to put the table tuples into. The econtext's
+ * scantuple has to point to that slot, too.
+ */
+ state->estate = CreateExecutorState();
+ slot = MakeSingleTupleTableSlot(tupDesc);
+ econtext = GetPerTupleExprContext(state->estate);
+ econtext->ecxt_scantuple = slot;
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return state;
+}
+
Tuplesortstate *
tuplesort_begin_index_btree(Relation indexRel,
bool enforceUnique,
TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, 0L);
#endif
+ /* Free any execution state created for CLUSTER case */
+ if (state->estate != NULL)
+ {
+ ExprContext *econtext = GetPerTupleExprContext(state->estate);
+
+ ExecDropSingleTupleTableSlot(econtext->ecxt_scantuple);
+ FreeExecutorState(state->estate);
+ }
+
MemoryContextSwitchTo(oldcontext);
/*
MemoryContextSwitchTo(oldcontext);
}
+/*
+ * Accept one tuple while collecting input data for sort.
+ *
+ * Note that the input data is always copied; the caller need not save it.
+ */
+void
+tuplesort_putheaptuple(Tuplesortstate *state, HeapTuple tup)
+{
+ MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ SortTuple stup;
+
+ /*
+ * Copy the given tuple into memory we control, and decrease availMem.
+ * Then call the common code.
+ */
+ COPYTUP(state, &stup, (void *) tup);
+
+ puttuple_common(state, &stup);
+
+ MemoryContextSwitchTo(oldcontext);
+}
+
/*
* Accept one index tuple while collecting input data for sort.
*
}
}
+/*
+ * Fetch the next tuple in either forward or back direction.
+ * Returns NULL if no more tuples. If *should_free is set, the
+ * caller must pfree the returned tuple when done with it.
+ */
+HeapTuple
+tuplesort_getheaptuple(Tuplesortstate *state, bool forward, bool *should_free)
+{
+ MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ SortTuple stup;
+
+ if (!tuplesort_gettuple_common(state, forward, &stup, should_free))
+ stup.tuple = NULL;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return stup.tuple;
+}
+
/*
* Fetch the next index tuple in either forward or back direction.
* Returns NULL if no more tuples. If *should_free is set, the
}
+/*
+ * Routines specialized for the CLUSTER case (HeapTuple data, with
+ * comparisons per a btree index definition)
+ */
+
+static int
+comparetup_cluster(const SortTuple *a, const SortTuple *b,
+ Tuplesortstate *state)
+{
+ ScanKey scanKey = state->indexScanKey;
+ HeapTuple ltup;
+ HeapTuple rtup;
+ TupleDesc tupDesc;
+ int nkey;
+ int32 compare;
+
+ /* Allow interrupting long sorts */
+ CHECK_FOR_INTERRUPTS();
+
+ /* Compare the leading sort key, if it's simple */
+ if (state->indexInfo->ii_KeyAttrNumbers[0] != 0)
+ {
+ compare = inlineApplySortFunction(&scanKey->sk_func, scanKey->sk_flags,
+ a->datum1, a->isnull1,
+ b->datum1, b->isnull1);
+ if (compare != 0 || state->nKeys == 1)
+ return compare;
+ /* Compare additional columns the hard way */
+ scanKey++;
+ nkey = 1;
+ }
+ else
+ {
+ /* Must compare all keys the hard way */
+ nkey = 0;
+ }
+
+ /* Compare additional sort keys */
+ ltup = (HeapTuple) a->tuple;
+ rtup = (HeapTuple) b->tuple;
+
+ if (state->indexInfo->ii_Expressions == NULL)
+ {
+ /* If not expression index, just compare the proper heap attrs */
+ tupDesc = state->tupDesc;
+
+ for (; nkey < state->nKeys; nkey++, scanKey++)
+ {
+ AttrNumber attno = state->indexInfo->ii_KeyAttrNumbers[nkey];
+ Datum datum1,
+ datum2;
+ bool isnull1,
+ isnull2;
+
+ datum1 = heap_getattr(ltup, attno, tupDesc, &isnull1);
+ datum2 = heap_getattr(rtup, attno, tupDesc, &isnull2);
+
+ compare = inlineApplySortFunction(&scanKey->sk_func,
+ scanKey->sk_flags,
+ datum1, isnull1,
+ datum2, isnull2);
+ if (compare != 0)
+ return compare;
+ }
+ }
+ else
+ {
+ /*
+ * In the expression index case, compute the whole index tuple and
+ * then compare values. It would perhaps be faster to compute only as
+ * many columns as we need to compare, but that would require
+ * duplicating all the logic in FormIndexDatum.
+ */
+ Datum l_index_values[INDEX_MAX_KEYS];
+ bool l_index_isnull[INDEX_MAX_KEYS];
+ Datum r_index_values[INDEX_MAX_KEYS];
+ bool r_index_isnull[INDEX_MAX_KEYS];
+ TupleTableSlot *ecxt_scantuple;
+
+ /* Reset context each time to prevent memory leakage */
+ ResetPerTupleExprContext(state->estate);
+
+ ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+
+ ExecStoreTuple(ltup, ecxt_scantuple, InvalidBuffer, false);
+ FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ l_index_values, l_index_isnull);
+
+ ExecStoreTuple(rtup, ecxt_scantuple, InvalidBuffer, false);
+ FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ r_index_values, r_index_isnull);
+
+ for (; nkey < state->nKeys; nkey++, scanKey++)
+ {
+ compare = inlineApplySortFunction(&scanKey->sk_func,
+ scanKey->sk_flags,
+ l_index_values[nkey],
+ l_index_isnull[nkey],
+ r_index_values[nkey],
+ r_index_isnull[nkey]);
+ if (compare != 0)
+ return compare;
+ }
+ }
+
+ return 0;
+}
+
+static void
+copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup)
+{
+ HeapTuple tuple = (HeapTuple) tup;
+
+ /* copy the tuple into sort storage */
+ tuple = heap_copytuple(tuple);
+ stup->tuple = (void *) tuple;
+ USEMEM(state, GetMemoryChunkSpace(tuple));
+ /* set up first-column key value, if it's a simple column */
+ if (state->indexInfo->ii_KeyAttrNumbers[0] != 0)
+ stup->datum1 = heap_getattr(tuple,
+ state->indexInfo->ii_KeyAttrNumbers[0],
+ state->tupDesc,
+ &stup->isnull1);
+}
+
+static void
+writetup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup)
+{
+ HeapTuple tuple = (HeapTuple) stup->tuple;
+ unsigned int tuplen = tuple->t_len + sizeof(ItemPointerData) + sizeof(int);
+
+ /* We need to store t_self, but not other fields of HeapTupleData */
+ LogicalTapeWrite(state->tapeset, tapenum,
+ &tuplen, sizeof(tuplen));
+ LogicalTapeWrite(state->tapeset, tapenum,
+ &tuple->t_self, sizeof(ItemPointerData));
+ LogicalTapeWrite(state->tapeset, tapenum,
+ tuple->t_data, tuple->t_len);
+ if (state->randomAccess) /* need trailing length word? */
+ LogicalTapeWrite(state->tapeset, tapenum,
+ &tuplen, sizeof(tuplen));
+
+ FREEMEM(state, GetMemoryChunkSpace(tuple));
+ heap_freetuple(tuple);
+}
+
+static void
+readtup_cluster(Tuplesortstate *state, SortTuple *stup,
+ int tapenum, unsigned int tuplen)
+{
+ unsigned int t_len = tuplen - sizeof(ItemPointerData) - sizeof(int);
+ HeapTuple tuple = (HeapTuple) palloc(t_len + HEAPTUPLESIZE);
+
+ USEMEM(state, GetMemoryChunkSpace(tuple));
+ /* Reconstruct the HeapTupleData header */
+ tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
+ tuple->t_len = t_len;
+ if (LogicalTapeRead(state->tapeset, tapenum,
+ &tuple->t_self,
+ sizeof(ItemPointerData)) != sizeof(ItemPointerData))
+ elog(ERROR, "unexpected end of data");
+ /* We don't currently bother to reconstruct t_tableOid */
+ tuple->t_tableOid = InvalidOid;
+ /* Read in the tuple body */
+ if (LogicalTapeRead(state->tapeset, tapenum,
+ tuple->t_data, tuple->t_len) != tuple->t_len)
+ elog(ERROR, "unexpected end of data");
+ if (state->randomAccess) /* need trailing length word? */
+ if (LogicalTapeRead(state->tapeset, tapenum, &tuplen,
+ sizeof(tuplen)) != sizeof(tuplen))
+ elog(ERROR, "unexpected end of data");
+ stup->tuple = (void *) tuple;
+ /* set up first-column key value, if it's a simple column */
+ if (state->indexInfo->ii_KeyAttrNumbers[0] != 0)
+ stup->datum1 = heap_getattr(tuple,
+ state->indexInfo->ii_KeyAttrNumbers[0],
+ state->tupDesc,
+ &stup->isnull1);
+}
+
+
/*
* Routines specialized for IndexTuple case
*
extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
extern void cost_sort(Path *path, PlannerInfo *root,
List *pathkeys, Cost input_cost, double tuples, int width,
+ Cost comparison_cost, int sort_mem,
double limit_tuples);
extern void cost_material(Path *path,
Cost input_startup_cost, Cost input_total_cost,
extern void estimate_rel_size(Relation rel, int32 *attr_widths,
BlockNumber *pages, double *tuples);
+extern int32 get_relation_data_width(Oid relid);
+
extern bool relation_excluded_by_constraints(PlannerInfo *root,
RelOptInfo *rel, RangeTblEntry *rte);
extern Expr *expression_planner(Expr *expr);
+extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
+
#endif /* PLANNER_H */
typedef struct Tuplesortstate Tuplesortstate;
/*
- * We provide two different interfaces to what is essentially the same
- * code: one for sorting HeapTuples and one for sorting IndexTuples.
- * They differ primarily in the way that the sort key information is
- * supplied. Also, the HeapTuple case actually stores MinimalTuples,
- * which means it doesn't preserve the "system columns" (tuple identity and
- * transaction visibility info). The IndexTuple case does preserve all
- * the header fields of an index entry. In the HeapTuple case we can
- * save some cycles by passing and returning the tuples in TupleTableSlots,
- * rather than forming actual HeapTuples (which'd have to be converted to
- * MinimalTuples).
+ * We provide multiple interfaces to what is essentially the same code,
+ * since different callers have different data to be sorted and want to
+ * specify the sort key information differently. There are two APIs for
+ * sorting HeapTuples and two more for sorting IndexTuples. Yet another
+ * API supports sorting bare Datums.
*
- * The IndexTuple case is itself broken into two subcases, one for btree
- * indexes and one for hash indexes; the latter variant actually sorts
- * the tuples by hash code. The API is the same except for the "begin"
- * routine.
+ * The "heap" API actually stores/sorts MinimalTuples, which means it doesn't
+ * preserve the system columns (tuple identity and transaction visibility
+ * info). The sort keys are specified by column numbers within the tuples
+ * and sort operator OIDs. We save some cycles by passing and returning the
+ * tuples in TupleTableSlots, rather than forming actual HeapTuples (which'd
+ * have to be converted to MinimalTuples). This API works well for sorts
+ * executed as parts of plan trees.
*
- * Yet another slightly different interface supports sorting bare Datums.
+ * The "cluster" API stores/sorts full HeapTuples including all visibility
+ * info. The sort keys are specified by reference to a btree index that is
+ * defined on the relation to be sorted. Note that putheaptuple/getheaptuple
+ * go with this API, not the "begin_heap" one!
+ *
+ * The "index_btree" API stores/sorts IndexTuples (preserving all their
+ * header fields). The sort keys are specified by a btree index definition.
+ *
+ * The "index_hash" API is similar to index_btree, but the tuples are
+ * actually sorted by their hash codes not the raw data.
*/
extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
Oid *sortOperators, bool *nullsFirstFlags,
int workMem, bool randomAccess);
+extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
+ Relation indexRel,
+ int workMem, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_index_btree(Relation indexRel,
bool enforceUnique,
int workMem, bool randomAccess);
extern void tuplesort_puttupleslot(Tuplesortstate *state,
TupleTableSlot *slot);
+extern void tuplesort_putheaptuple(Tuplesortstate *state, HeapTuple tup);
extern void tuplesort_putindextuple(Tuplesortstate *state, IndexTuple tuple);
extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
bool isNull);
extern bool tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
TupleTableSlot *slot);
+extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward,
+ bool *should_free);
extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward,
bool *should_free);
extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward,