-<!-- $PostgreSQL: pgsql/doc/src/sgml/planstats.sgml,v 1.8 2007/01/31 20:56:18 momjian Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/planstats.sgml,v 1.9 2007/12/28 21:03:31 tgl Exp $ -->
<chapter id="planner-stats-details">
<title>How the Planner Uses Statistics</title>
<para>
- This chapter builds on the material covered in <xref linkend="using-explain">
- and <xref linkend="planner-stats">, and shows how the planner uses the
- system statistics to estimate the number of rows each stage in a query might
- return. This is a significant part of the planning / optimizing process,
+ This chapter builds on the material covered in <xref
+ linkend="using-explain"> and <xref linkend="planner-stats"> to show some
+ additional details about how the planner uses the
+ system statistics to estimate the number of rows each part of a query might
+ return. This is a significant part of the planning process,
providing much of the raw material for cost calculation.
</para>
<para>
- The intent of this chapter is not to document the code —
- better done in the code itself, but to present an overview of how it works.
- This will perhaps ease the learning curve for someone who subsequently
- wishes to read the code. As a consequence, the approach chosen is to analyze
- a series of incrementally more complex examples.
- </para>
-
- <para>
- The outputs and algorithms shown below are taken from version 8.0.
- The behavior of earlier (or later) versions might vary.
+ The intent of this chapter is not to document the code in detail,
+ but to present an overview of how it works.
+ This will perhaps ease the learning curve for someone who subsequently
+ wishes to read the code.
</para>
<sect1 id="row-estimation-examples">
</indexterm>
<para>
- Using examples drawn from the regression test database, let's start with a
- very simple query:
+ The examples shown below use tables in the <productname>PostgreSQL</>
+ regression test database.
+ The outputs shown are taken from version 8.3.
+ The behavior of earlier (or later) versions might vary.
+ Note also that since <command>ANALYZE</> uses random sampling
+ while producing statistics, the results will change slightly after
+ any new <command>ANALYZE</>.
+ </para>
+
+ <para>
+ Let's start with a very simple query:
+
<programlisting>
EXPLAIN SELECT * FROM tenk1;
QUERY PLAN
-------------------------------------------------------------
- Seq Scan on tenk1 (cost=0.00..445.00 rows=10000 width=244)
+ Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
</programlisting>
-
- How the planner determines the cardinality of <classname>tenk1</classname>
- is covered in <xref linkend="using-explain">, but is repeated here for
- completeness. The number of rows is looked up from
- <classname>pg_class</classname>:
+
+ How the planner determines the cardinality of <structname>tenk1</structname>
+ is covered in <xref linkend="planner-stats">, but is repeated here for
+ completeness. The number of pages and rows is looked up in
+ <structname>pg_class</structname>:
<programlisting>
-SELECT reltuples, relpages FROM pg_class WHERE relname = 'tenk1';
+SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
relpages | reltuples
----------+-----------
- 345 | 10000
-</programlisting>
- The planner will check the <structfield>relpages</structfield>
- estimate (this is a cheap operation) and if incorrect might scale
- <structfield>reltuples</structfield> to obtain a row estimate. In this
- case it does not, thus:
-
-<programlisting>
-rows = 10000
+ 358 | 10000
</programlisting>
+ These numbers are current as of the last <command>VACUUM</> or
+ <command>ANALYZE</> on the table. The planner then fetches the
+ actual current number of pages in the table (this is a cheap operation,
+ not requiring a table scan). If that is different from
+ <structfield>relpages</structfield> then
+ <structfield>reltuples</structfield> is scaled accordingly to
+ arrive at a current number-of-rows estimate. In this case the values
+ are correct so the rows estimate is the same as
+ <structfield>reltuples</structfield>.
</para>
-
+
<para>
- let's move on to an example with a range condition in its
+ Let's move on to an example with a range condition in its
<literal>WHERE</literal> clause:
<programlisting>
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000;
- QUERY PLAN
-------------------------------------------------------------
- Seq Scan on tenk1 (cost=0.00..470.00 rows=1031 width=244)
- Filter: (unique1 < 1000)
-</programlisting>
-
- The planner examines the <literal>WHERE</literal> clause condition:
-
-<programlisting>
-unique1 < 1000
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Bitmap Heap Scan on tenk1 (cost=24.06..394.64 rows=1007 width=244)
+ Recheck Cond: (unique1 < 1000)
+ -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0)
+ Index Cond: (unique1 < 1000)
</programlisting>
- and looks up the restriction function for the operator
- <literal><</literal> in <classname>pg_operator</classname>.
- This is held in the column <structfield>oprrest</structfield>,
- and the result in this case is <function>scalarltsel</function>.
- The <function>scalarltsel</function> function retrieves the histogram for
- <structfield>unique1</structfield> from <classname>pg_statistics</classname>
- - we can follow this by using the simpler <classname>pg_stats</classname>
+ The planner examines the <literal>WHERE</literal> clause condition
+ and looks up the selectivity function for the operator
+ <literal><</literal> in <structname>pg_operator</structname>.
+ This is held in the column <structfield>oprrest</structfield>,
+ and the entry in this case is <function>scalarltsel</function>.
+ The <function>scalarltsel</function> function retrieves the histogram for
+ <structfield>unique1</structfield> from
+ <structname>pg_statistics</structname>. For manual queries it is more
+ convenient to look in the simpler <structname>pg_stats</structname>
view:
<programlisting>
-SELECT histogram_bounds FROM pg_stats
+SELECT histogram_bounds FROM pg_stats
WHERE tablename='tenk1' AND attname='unique1';
histogram_bounds
------------------------------------------------------
- {1,970,1943,2958,3971,5069,6028,7007,7919,8982,9995}
+ {0,993,1997,3050,4040,5036,5957,7057,8029,9016,9995}
</programlisting>
- Next the fraction of the histogram occupied by <quote>< 1000</quote>
- is worked out. This is the selectivity. The histogram divides the range
- into equal frequency buckets, so all we have to do is locate the bucket
- that our value is in and count <emphasis>part</emphasis> of it and
- <emphasis>all</emphasis> of the ones before. The value 1000 is clearly in
- the second (970 - 1943) bucket, so by assuming a linear distribution of
- values inside each bucket we can calculate the selectivity as:
+ Next the fraction of the histogram occupied by <quote>< 1000</quote>
+ is worked out. This is the selectivity. The histogram divides the range
+ into equal frequency buckets, so all we have to do is locate the bucket
+ that our value is in and count <emphasis>part</emphasis> of it and
+ <emphasis>all</emphasis> of the ones before. The value 1000 is clearly in
+ the second bucket (993-1997). Assuming a linear distribution of
+ values inside each bucket, we can calculate the selectivity as:
<programlisting>
-selectivity = (1 + (1000 - bckt[2].min)/(bckt[2].max - bckt[2].min))/num_bckts
- = (1 + (1000 - 970)/(1943 - 970))/10
- = 0.1031
+selectivity = (1 + (1000 - bucket[2].min)/(bucket[2].max - bucket[2].min))/num_buckets
+ = (1 + (1000 - 993)/(1997 - 993))/10
+ = 0.100697
</programlisting>
that is, one whole bucket plus a linear fraction of the second, divided by
the number of buckets. The estimated number of rows can now be calculated as
- the product of the selectivity and the cardinality of
- <classname>tenk1</classname>:
+ the product of the selectivity and the cardinality of
+ <structname>tenk1</structname>:
<programlisting>
rows = rel_cardinality * selectivity
- = 10000 * 0.1031
- = 1031
+ = 10000 * 0.100697
+ = 1007 (rounding off)
</programlisting>
-
</para>
<para>
- Next let's consider an example with equality condition in its
+ Next let's consider an example with an equality condition in its
<literal>WHERE</literal> clause:
<programlisting>
-EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'ATAAAA';
+EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'CRAAAA';
QUERY PLAN
----------------------------------------------------------
- Seq Scan on tenk1 (cost=0.00..470.00 rows=31 width=244)
- Filter: (stringu1 = 'ATAAAA'::name)
+ Seq Scan on tenk1 (cost=0.00..483.00 rows=30 width=244)
+ Filter: (stringu1 = 'CRAAAA'::name)
</programlisting>
- Again the planner examines the <literal>WHERE</literal> clause condition:
+ Again the planner examines the <literal>WHERE</literal> clause condition
+ and looks up the selectivity function for <literal>=</literal>, which is
+ <function>eqsel</function>. For equality estimation the histogram is
+ not useful; instead the list of <firstterm>most
+ common values</> (<acronym>MCV</acronym>s) is used to determine the
+ selectivity. Let's have a look at the MCVs, with some additional columns
+ that will be useful later:
<programlisting>
-stringu1 = 'ATAAAA'
-</programlisting>
-
- and looks up the restriction function for <literal>=</literal>, which is
- <function>eqsel</function>. This case is a bit different, as the most
- common values — <acronym>MCV</acronym>s, are used to determine the
- selectivity. Let's have a look at these, with some extra columns that will
- be useful later:
-
-<programlisting>
-SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
+SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
WHERE tablename='tenk1' AND attname='stringu1';
null_frac | 0
-n_distinct | 672
-most_common_vals | {FDAAAA,NHAAAA,ATAAAA,BGAAAA,EBAAAA,MOAAAA,NDAAAA,OWAAAA,BHAAAA,BJAAAA}
-most_common_freqs | {0.00333333,0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.00266667,0.00266667}
+n_distinct | 676
+most_common_vals | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,JOAAAA,MCAAAA,NAAAAA,WGAAAA}
+most_common_freqs | {0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003}
+
</programlisting>
- The selectivity is merely the most common frequency (<acronym>MCF</acronym>)
- corresponding to the third <acronym>MCV</acronym> — 'ATAAAA':
+ Since <literal>CRAAAA</> appears in the list of MCVs, the selectivity is
+ merely the corresponding entry in the list of most common frequencies
+ (<acronym>MCF</acronym>s):
<programlisting>
selectivity = mcf[3]
= 0.003
</programlisting>
- The estimated number of rows is just the product of this with the
- cardinality of <classname>tenk1</classname> as before:
+ As before, the estimated number of rows is just the product of this with the
+ cardinality of <structname>tenk1</structname>:
<programlisting>
rows = 10000 * 0.003
= 30
</programlisting>
-
- The number displayed by <command>EXPLAIN</command> is one more than this,
- due to some post estimation checks.
</para>
<para>
- Now consider the same query, but with a constant that is not in the
+ Now consider the same query, but with a constant that is not in the
<acronym>MCV</acronym> list:
<programlisting>
QUERY PLAN
----------------------------------------------------------
- Seq Scan on tenk1 (cost=0.00..470.00 rows=15 width=244)
+ Seq Scan on tenk1 (cost=0.00..483.00 rows=15 width=244)
Filter: (stringu1 = 'xxx'::name)
</programlisting>
- This is quite a different problem, how to estimate the selectivity when the
- value is <emphasis>not</emphasis> in the <acronym>MCV</acronym> list.
- The approach is to use the fact that the value is not in the list,
+ This is quite a different problem: how to estimate the selectivity when the
+ value is <emphasis>not</emphasis> in the <acronym>MCV</acronym> list.
+ The approach is to use the fact that the value is not in the list,
combined with the knowledge of the frequencies for all of the
<acronym>MCV</acronym>s:
<programlisting>
selectivity = (1 - sum(mvf))/(num_distinct - num_mcv)
- = (1 - (0.00333333 + 0.00333333 + 0.003 + 0.003 + 0.003
- + 0.003 + 0.003 + 0.003 + 0.00266667 + 0.00266667))/(672 - 10)
- = 0.001465
+ = (1 - (0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 +
+ 0.003 + 0.003 + 0.003 + 0.003))/(676 - 10)
+ = 0.0014559
</programlisting>
- That is, add up all the frequencies for the <acronym>MCV</acronym>s and
- subtract them from one — because it is <emphasis>not</emphasis> one
- of these, and divide by the <emphasis>remaining</emphasis> distinct values.
- Notice that there are no null values so we don't have to worry about those.
- The estimated number of rows is calculated as usual:
+ That is, add up all the frequencies for the <acronym>MCV</acronym>s and
+ subtract them from one, then
+ divide by the number of <emphasis>other</emphasis> distinct values.
+ This amounts to assuming that the fraction of the column that is not any
+ of the MCVs is evenly distributed among all the other distinct values.
+ Notice that there are no null values so we don't have to worry about those
+ (otherwise we'd subtract the null fraction from the numerator as well).
+ The estimated number of rows is then calculated as usual:
<programlisting>
-rows = 10000 * 0.001465
- = 15
+rows = 10000 * 0.0014559
+ = 15 (rounding off)
</programlisting>
-
</para>
<para>
- Let's increase the complexity to consider a case with more than one
- condition in the <literal>WHERE</literal> clause:
+ The previous example with <literal>unique1 < 1000</> was an
+ oversimplification of what <function>scalarltsel</function> really does;
+ now that we have seen an example of the use of MCVs, we can fill in some
+ more detail. The example was correct as far as it went, because since
+ <structfield>unique1</> is a unique column it has no MCVs (obviously, no
+ value is any more common than any other value). For a non-unique
+ column, there will normally be both a histogram and an MCV list, and
+ <emphasis>the histogram does not include the portion of the column
+ population represented by the MCVs</>. We do things this way because
+ it allows more precise estimation. In this situation
+ <function>scalarltsel</function> directly applies the condition (e.g.,
+ <quote>< 1000</>) to each value of the MCV list, and adds up the
+ frequencies of the MCVs for which the condition is true. This gives
+ an exact estimate of the selectivity within the portion of the table
+ that is MCVs. The histogram is then used in the same way as above
+ to estimate the selectivity in the portion of the table that is not
+ MCVs, and then the two numbers are combined to estimate the overall
+ selectivity. For example, consider
<programlisting>
-EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000 AND stringu1 = 'xxx';
+EXPLAIN SELECT * FROM tenk1 WHERE stringu1 < 'IAAAAA';
- QUERY PLAN
------------------------------------------------------------
- Seq Scan on tenk1 (cost=0.00..495.00 rows=2 width=244)
- Filter: ((unique1 < 1000) AND (stringu1 = 'xxx'::name))
+ QUERY PLAN
+------------------------------------------------------------
+ Seq Scan on tenk1 (cost=0.00..483.00 rows=3077 width=244)
+ Filter: (stringu1 < 'IAAAAA'::name)
</programlisting>
- An assumption of independence is made and the selectivities of the
- individual restrictions are multiplied together:
+ We already saw the MCV information for <structfield>stringu1</>,
+ and here is its histogram:
<programlisting>
-selectivity = selectivity(unique1 < 1000) * selectivity(stringu1 = 'xxx')
- = 0.1031 * 0.001465
- = 0.00015104
+SELECT histogram_bounds FROM pg_stats
+WHERE tablename='tenk1' AND attname='stringu1';
+
+ histogram_bounds
+--------------------------------------------------------------------------------
+ {AAAAAA,CQAAAA,FRAAAA,IBAAAA,KRAAAA,NFAAAA,PSAAAA,SGAAAA,VAAAAA,XLAAAA,ZZAAAA}
</programlisting>
- The row estimates are calculated as before:
+ Checking the MCV list, we find that the condition <literal>stringu1 <
+ 'IAAAAA'</> is satisfied by the first six entries and not the last four,
+ so the selectivity within the MCV part of the population is
<programlisting>
-rows = 10000 * 0.00015104
- = 2
+selectivity = sum(relevant mvfs)
+ = 0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003
+ = 0.01833333
</programlisting>
+
+ Summing all the MCFs also tells us that the total fraction of the
+ population represented by MCVs is 0.03033333, and therefore the
+ fraction represented by the histogram is 0.96966667 (again, there
+ are no nulls, else we'd have to exclude them here). We can see
+ that the value <literal>IAAAAA</> falls nearly at the end of the
+ third histogram bucket. Using some rather cheesy assumptions
+ about the frequency of different characters, the planner arrives
+ at the estimate 0.298387 for the portion of the histogram population
+ that is less than <literal>IAAAAA</>. We then combine the estimates
+ for the MCV and non-MCV populations:
+
+<programlisting>
+selectivity = mcv_selectivity + histogram_selectivity * histogram_fraction
+ = 0.01833333 + 0.298387 * 0.96966667
+ = 0.307669
+
+rows = 10000 * 0.307669
+ = 3077 (rounding off)
+</programlisting>
+
+ In this particular example, the correction from the MCV list is fairly
+ small, because the column distribution is actually quite flat (the
+ statistics showing these particular values as being more common than
+ others are mostly due to sampling error). In a more typical case where
+ some values are significantly more common than others, this complicated
+ process gives a useful improvement in accuracy because the selectivity
+ for the most common values is found exactly.
</para>
-
+
<para>
- Finally we will examine a query that includes a <literal>JOIN</literal>
- together with a <literal>WHERE</literal> clause:
+ Now let's consider a case with more than one
+ condition in the <literal>WHERE</literal> clause:
<programlisting>
-EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2
-WHERE t1.unique1 < 50 AND t1.unique2 = t2.unique2;
+EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000 AND stringu1 = 'xxx';
- QUERY PLAN
------------------------------------------------------------------------------------------
- Nested Loop (cost=0.00..346.90 rows=51 width=488)
- -> Index Scan using tenk1_unique1 on tenk1 t1 (cost=0.00..192.57 rows=51 width=244)
- Index Cond: (unique1 < 50)
- -> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.00..3.01 rows=1 width=244)
- Index Cond: ("outer".unique2 = t2.unique2)
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Bitmap Heap Scan on tenk1 (cost=23.80..396.91 rows=1 width=244)
+ Recheck Cond: (unique1 < 1000)
+ Filter: (stringu1 = 'xxx'::name)
+ -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0)
+ Index Cond: (unique1 < 1000)
</programlisting>
- The restriction on <classname>tenk1</classname>
- <quote>unique1 < 50</quote> is evaluated before the nested-loop join.
- This is handled analogously to the previous range example. The restriction
- operator for <literal><</literal> is <function>scalarlteqsel</function>
- as before, but this time the value 50 is in the first bucket of the
- <structfield>unique1</structfield> histogram:
+ The planner assumes that the two conditions are independent, so that
+ the individual selectivities of the clauses can be multiplied together:
+
+<programlisting>
+selectivity = selectivity(unique1 < 1000) * selectivity(stringu1 = 'xxx')
+ = 0.100697 * 0.0014559
+ = 0.0001466
+
+rows = 10000 * 0.0001466
+ = 1 (rounding off)
+</programlisting>
+
+ Notice that the number of rows estimated to be returned from the bitmap
+ index scan reflects only the condition used with the index; this is
+ important since it affects the cost estimate for the subsequent heap
+ fetches.
+ </para>
+
+ <para>
+ Finally we will examine a query that involves a join:
<programlisting>
-selectivity = (0 + (50 - bckt[1].min)/(bckt[1].max - bckt[1].min))/num_bckts
- = (0 + (50 - 1)/(970 - 1))/10
- = 0.005057
+EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2
+WHERE t1.unique1 < 50 AND t1.unique2 = t2.unique2;
-rows = 10000 * 0.005057
- = 51
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Nested Loop (cost=4.64..456.23 rows=50 width=488)
+ -> Bitmap Heap Scan on tenk1 t1 (cost=4.64..142.17 rows=50 width=244)
+ Recheck Cond: (unique1 < 50)
+ -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.63 rows=50 width=0)
+ Index Cond: (unique1 < 50)
+ -> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.00..6.27 rows=1 width=244)
+ Index Cond: (t2.unique2 = t1.unique2)
</programlisting>
- The restriction for the join is:
+ The restriction on <structname>tenk1</structname>,
+ <literal>unique1 < 50</literal>,
+ is evaluated before the nested-loop join.
+ This is handled analogously to the previous range example. This time the
+ value 50 falls into the first bucket of the
+ <structfield>unique1</structfield> histogram:
<programlisting>
-t2.unique2 = t1.unique2
+selectivity = (0 + (50 - bucket[1].min)/(bucket[1].max - bucket[1].min))/num_buckets
+ = (0 + (50 - 0)/(993 - 0))/10
+ = 0.005035
+
+rows = 10000 * 0.005035
+ = 50 (rounding off)
</programlisting>
- This is due to the join method being nested-loop, with
- <classname>tenk1</classname> being in the outer loop. The operator is just
- our familiar <literal>=</literal>, however the restriction function is
- obtained from the <structfield>oprjoin</structfield> column of
- <classname>pg_operator</classname> - and is <function>eqjoinsel</function>.
- Additionally we use the statistical information for both
- <classname>tenk2</classname> and <classname>tenk1</classname>:
+ The restriction for the join is <literal>t2.unique2 = t1.unique2</>.
+ The operator is just
+ our familiar <literal>=</literal>, however the selectivity function is
+ obtained from the <structfield>oprjoin</structfield> column of
+ <structname>pg_operator</structname>, and is <function>eqjoinsel</function>.
+ <function>eqjoinsel</function> looks up the statistical information for both
+ <structname>tenk2</structname> and <structname>tenk1</structname>:
<programlisting>
-SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
-WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';
+SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
+WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';
tablename | null_frac | n_distinct | most_common_vals
-----------+-----------+------------+------------------
tenk2 | 0 | -1 |
</programlisting>
- In this case there is no <acronym>MCV</acronym> information for
- <structfield>unique2</structfield> because all the values appear to be
- unique, so we can use an algorithm that relies only on the number of
+ In this case there is no <acronym>MCV</acronym> information for
+ <structfield>unique2</structfield> because all the values appear to be
+ unique, so we use an algorithm that relies only on the number of
distinct values for both relations together with their null fractions:
<programlisting>
selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)
- = (1 - 0) * (1 - 0) * min(1/10000, 1/1000)
+ = (1 - 0) * (1 - 0) / max(10000, 10000)
= 0.0001
</programlisting>
- This is, subtract the null fraction from one for each of the relations,
- and divide by the maximum of the two distinct values. The number of rows
- that the join is likely to emit is calculated as the cardinality of
- Cartesian product of the two nodes in the nested-loop, multiplied by the
+ This is, subtract the null fraction from one for each of the relations,
+ and divide by the maximum of the numbers of distinct values.
+ The number of rows
+ that the join is likely to emit is calculated as the cardinality of the
+ Cartesian product of the two inputs, multiplied by the
selectivity:
<programlisting>
rows = (outer_cardinality * inner_cardinality) * selectivity
- = (51 * 10000) * 0.0001
- = 51
+ = (50 * 10000) * 0.0001
+ = 50
</programlisting>
</para>
<para>
- For those interested in further details, estimation of the number of rows in
- a relation is covered in
- <filename>src/backend/optimizer/util/plancat.c</filename>. The calculation
- logic for clause selectivities is in
- <filename>src/backend/optimizer/path/clausesel.c</filename>. The actual
- implementations of the operator and join restriction functions can be found
+ Had there been MCV lists for the two columns,
+ <function>eqjoinsel</function> would have used direct comparison of the MCV
+ lists to determine the join selectivity within the part of the column
+ populations represented by the MCVs. The estimate for the remainder of the
+ populations follows the same approach shown here.
+ </para>
+
+ <para>
+ Notice that we showed <literal>inner_cardinality</> as 10000, that is,
+ the unmodified size of <structname>tenk2</>. It might appear from
+ inspection of the <command>EXPLAIN</> output that the estimate of
+ join rows comes from 50 * 1, that is, the number of outer rows times
+ the estimated number of rows obtained by each inner indexscan on
+ <structname>tenk2</>. But this is not the case: the join relation size
+ is estimated before any particular join plan has been considered. If
+ everything is working well then the two ways of estimating the join
+ size will produce about the same answer, but due to roundoff error and
+ other factors they sometimes diverge significantly.
+ </para>
+
+ <para>
+ For those interested in further details, estimation of the size of
+ a table (before any <literal>WHERE</> clauses) is done in
+ <filename>src/backend/optimizer/util/plancat.c</filename>. The generic
+ logic for clause selectivities is in
+ <filename>src/backend/optimizer/path/clausesel.c</filename>. The
+ operator-specific selectivity functions are mostly found
in <filename>src/backend/utils/adt/selfuncs.c</filename>.
</para>
</sect1>
-
</chapter>