granicus.if.org Git - postgresql/blob - doc/src/sgml/xoper.sgml

   1 <!--
   2 $Header: /cvsroot/pgsql/doc/src/sgml/xoper.sgml,v 1.15 2001/10/26 21:17:03 tgl Exp $
   3 -->
   4
   5  <Chapter Id="xoper">
   6   <Title>Extending <Acronym>SQL</Acronym>: Operators</Title>
   7
   8   <Para>
   9    <ProductName>Postgres</ProductName> supports left unary,
  10    right  unary  and  binary
  11    operators.   Operators  can  be  overloaded; that is,
  12    the same operator name can be used for different operators
  13    that have different numbers and types of arguments.   If
  14    there  is  an ambiguous situation and the system cannot
  15    determine the correct operator to use, it  will  return
  16    an  error.  You may have to typecast the left and/or
  17    right operands to help it understand which operator you
  18    meant to use.
  19   </Para>
  20
  21   <Para>
  22    Every operator is <quote>syntactic sugar</quote> for a call to an
  23    underlying function that does the real work; so you must
  24    first create the underlying function before you can create
  25    the operator.  However, an operator is <emphasis>not</emphasis>
  26    merely syntactic sugar, because it carries additional information
  27    that helps the query planner optimize queries that use the
  28    operator.  Much of this chapter will be devoted to explaining
  29    that additional information.
  30   </Para>
  31
  32   <Para>
  33    Here is an example of creating an operator for adding two
  34    complex numbers.  We assume we've already created the definition
  35    of type complex.  First we need a function that does the work;
  36    then we can define the operator:
  37
  38    <ProgramListing>
  39 CREATE FUNCTION complex_add(complex, complex)
  40     RETURNS complex
  41     AS '<replaceable>PGROOT</replaceable>/tutorial/complex'
  42     LANGUAGE C;
  43
  44 CREATE OPERATOR + (
  45     leftarg = complex,
  46     rightarg = complex,
  47     procedure = complex_add,
  48     commutator = +
  49 );
  50    </ProgramListing>
  51   </Para>
  52
  53   <Para>
  54    Now we can do:
  55
  56    <ProgramListing>
  57 SELECT (a + b) AS c FROM test_complex;
  58
  59 +----------------+
  60 |c               |
  61 +----------------+
  62 |(5.2,6.05)      |
  63 +----------------+
  64 |(133.42,144.95) |
  65 +----------------+
  66    </ProgramListing>
  67   </Para>
  68
  69   <Para>
  70    We've shown how to create a binary  operator  here.  To
  71    create  unary  operators, just omit one of leftarg (for
  72    left unary) or rightarg (for right unary).  The procedure
  73    clause and the argument clauses are the only required items
  74    in CREATE OPERATOR.  The COMMUTATOR clause shown in the example
  75    is an optional hint to the query optimizer.  Further details about
  76    COMMUTATOR and other optimizer hints appear below.
  77   </Para>
  78
  79   <sect1 id="xoper-optimization">
  80    <title>Operator Optimization Information</title>
  81
  82    <note>
  83     <title>Author</title>
  84     <para>
  85      Written by Tom Lane.
  86     </para>
  87    </note>
  88
  89    <para>
  90     A <ProductName>Postgres</ProductName> operator definition can include
  91     several optional clauses that tell the system useful things about how
  92     the operator behaves.  These clauses should be provided whenever
  93     appropriate, because they can make for considerable speedups in execution
  94     of queries that use the operator.  But if you provide them, you must be
  95     sure that they are right!  Incorrect use of an optimization clause can
  96     result in backend crashes, subtly wrong output, or other Bad Things.
  97     You can always leave out an optimization clause if you are not sure
  98     about it; the only consequence is that queries might run slower than
  99     they need to.
 100    </para>
 101
 102    <para>
 103     Additional optimization clauses might be added in future versions of
 104     <ProductName>Postgres</ProductName>.  The ones described here are all
 105     the ones that release 6.5 understands.
 106    </para>
 107
 108    <sect2>
 109     <title>COMMUTATOR</title>
 110
 111     <para>
 112      The COMMUTATOR clause, if provided, names an operator that is the
 113      commutator of the operator being defined.  We say that operator A is the
 114      commutator of operator B if (x A y) equals (y B x) for all possible input
 115      values x,y.  Notice that B is also the commutator of A.  For example,
 116      operators <literal>&lt;</> and <literal>&gt;</> for a particular data type are usually each others'
 117      commutators, and operator <literal>+</> is usually commutative with itself.
 118      But operator <literal>-</> is usually not commutative with anything.
 119     </para>
 120
 121     <para>
 122      The left argument type of a commuted operator is the same as the
 123      right argument type of its commutator, and vice versa.  So the name of
 124      the commutator operator is all that <ProductName>Postgres</ProductName>
 125      needs to be given to look up the commutator, and that's all that need
 126      be provided in the COMMUTATOR clause.
 127     </para>
 128
 129     <para>
 130      When you are defining a self-commutative operator, you just do it.
 131      When you are defining a pair of commutative operators, things are
 132      a little trickier: how can the first one to be defined refer to the
 133      other one, which you haven't defined yet?  There are two solutions
 134      to this problem:
 135
 136      <itemizedlist>
 137       <listitem>
 138        <para>
 139         One way is to omit the COMMUTATOR clause in the first operator that
 140         you define, and then provide one in the second operator's definition.
 141         Since <ProductName>Postgres</ProductName> knows that commutative
 142         operators come in pairs, when it sees the second definition it will
 143         automatically go back and fill in the missing COMMUTATOR clause in
 144         the first definition.
 145        </para>
 146       </listitem>
 147
 148       <listitem>
 149        <para>
 150         The other, more straightforward way is just to include COMMUTATOR clauses
 151         in both definitions.  When <ProductName>Postgres</ProductName> processes
 152         the first definition and realizes that COMMUTATOR refers to a non-existent
 153         operator, the system will make a dummy entry for that operator in the
 154         system's pg_operator table.  This dummy entry will have valid data only
 155         for the operator name, left and right argument types, and result type,
 156         since that's all that <ProductName>Postgres</ProductName> can deduce
 157         at this point.  The first operator's catalog entry will link to this
 158         dummy entry.  Later, when you define the second operator, the system
 159         updates the dummy entry with the additional information from the second
 160         definition.  If you try to use the dummy operator before it's been filled
 161         in, you'll just get an error message.  (Note: this procedure did not work
 162         reliably in <ProductName>Postgres</ProductName> versions before 6.5,
 163         but it is now the recommended way to do things.)
 164        </para>
 165       </listitem>
 166      </itemizedlist>
 167     </para>
 168    </sect2>
 169
 170    <sect2>
 171     <title>NEGATOR</title>
 172
 173     <para>
 174      The NEGATOR clause, if provided, names an operator that is the
 175      negator of the operator being defined.  We say that operator A
 176      is the negator of operator B if both return boolean results and
 177      (x A y) equals NOT (x B y) for all possible inputs x,y.
 178      Notice that B is also the negator of A.
 179      For example, <literal>&lt;</> and <literal>&gt;=</> are a negator pair for most data types.
 180      An operator can never be validly be its own negator.
 181     </para>
 182
 183    <para>
 184     Unlike COMMUTATOR, a pair of unary operators could validly be marked
 185     as each others' negators; that would mean (A x) equals NOT (B x)
 186     for all x, or the equivalent for right-unary operators.
 187    </para>
 188
 189    <para>
 190     An operator's negator must have the same left and/or right argument types
 191     as the operator itself, so just as with COMMUTATOR, only the operator
 192     name need be given in the NEGATOR clause.
 193    </para>
 194
 195    <para>
 196     Providing NEGATOR is very helpful to the query optimizer since
 197     it allows expressions like NOT (x = y) to be simplified into
 198     x &lt;&gt; y.  This comes up more often than you might think, because
 199     NOTs can be inserted as a consequence of other rearrangements.
 200    </para>
 201
 202    <para>
 203     Pairs of negator operators can be defined using the same methods
 204     explained above for commutator pairs.
 205    </para>
 206
 207   </sect2>
 208
 209   <sect2>
 210    <title>RESTRICT</title>
 211
 212    <para>
 213     The RESTRICT clause, if provided, names a restriction selectivity
 214     estimation function for the operator (note that this is a function
 215     name, not an operator name).  RESTRICT clauses only make sense for
 216     binary operators that return boolean.  The idea behind a restriction
 217     selectivity estimator is to guess what fraction of the rows in a
 218     table will satisfy a WHERE-clause condition of the form
 219    <ProgramListing>
 220                 field OP constant
 221    </ProgramListing>
 222     for the current operator and a particular constant value.
 223     This assists the optimizer by
 224     giving it some idea of how many rows will be eliminated by WHERE
 225     clauses that have this form.  (What happens if the constant is on
 226     the left, you may be wondering?  Well, that's one of the things that
 227     COMMUTATOR is for...)
 228    </para>
 229
 230    <para>
 231     Writing new restriction selectivity estimation functions is far beyond
 232     the scope of this chapter, but fortunately you can usually just use
 233     one of the system's standard estimators for many of your own operators.
 234     These are the standard restriction estimators:
 235    <ProgramListing>
 236         eqsel           for =
 237         neqsel          for &lt;&gt;
 238         scalarltsel     for &lt; or &lt;=
 239         scalargtsel     for &gt; or &gt;=
 240    </ProgramListing>
 241     It might seem a little odd that these are the categories, but they
 242     make sense if you think about it.  <literal>=</> will typically accept only
 243     a small fraction of the rows in a table; <literal>&lt;&gt;</> will typically reject
 244     only a small fraction.  <literal>&lt;</> will accept a fraction that depends on
 245     where the given constant falls in the range of values for that table
 246     column (which, it just so happens, is information collected by
 247     <command>ANALYZE</command> and made available to the selectivity estimator).
 248     <literal>&lt;=</> will accept a slightly larger fraction than <literal>&lt;</> for the same
 249     comparison constant, but they're close enough to not be worth
 250     distinguishing, especially since we're not likely to do better than a
 251     rough guess anyhow.  Similar remarks apply to <literal>&gt;</> and <literal>&gt;=</>.
 252    </para>
 253
 254    <para>
 255     You can frequently get away with using either eqsel or neqsel for
 256     operators that have very high or very low selectivity, even if they
 257     aren't really equality or inequality.  For example, the
 258     approximate-equality geometric operators use eqsel on the assumption that
 259     they'll usually only match a small fraction of the entries in a table.
 260    </para>
 261
 262    <para>
 263     You can use scalarltsel and scalargtsel for comparisons on data types that
 264     have some sensible means of being converted into numeric scalars for
 265     range comparisons.  If possible, add the data type to those understood
 266     by the routine convert_to_scalar() in <filename>src/backend/utils/adt/selfuncs.c</filename>.
 267     (Eventually, this routine should be replaced by per-data-type functions
 268     identified through a column of the pg_type table; but that hasn't happened
 269     yet.)  If you do not do this, things will still work, but the optimizer's
 270     estimates won't be as good as they could be.
 271    </para>
 272
 273    <para>
 274     There are additional selectivity functions designed for geometric
 275     operators in <filename>src/backend/utils/adt/geo_selfuncs.c</filename>: areasel, positionsel,
 276     and contsel.  At this writing these are just stubs, but you may want
 277     to use them (or even better, improve them) anyway.
 278    </para>
 279    </sect2>
 280
 281    <sect2>
 282     <title>JOIN</title>
 283
 284     <para>
 285      The JOIN clause, if provided, names a join selectivity
 286      estimation function for the operator (note that this is a function
 287      name, not an operator name).  JOIN clauses only make sense for
 288      binary operators that return boolean.  The idea behind a join
 289      selectivity estimator is to guess what fraction of the rows in a
 290      pair of tables will satisfy a WHERE-clause condition of the form
 291      <ProgramListing>
 292                 table1.field1 OP table2.field2
 293      </ProgramListing>
 294      for the current operator.  As with the RESTRICT clause, this helps
 295      the optimizer very substantially by letting it figure out which
 296      of several possible join sequences is likely to take the least work.
 297     </para>
 298
 299     <para>
 300      As before, this chapter will make no attempt to explain how to write
 301      a join selectivity estimator function, but will just suggest that
 302      you use one of the standard estimators if one is applicable:
 303      <ProgramListing>
 304         eqjoinsel       for =
 305         neqjoinsel      for &lt;&gt;
 306         scalarltjoinsel for &lt; or &lt;=
 307         scalargtjoinsel for &gt; or &gt;=
 308         areajoinsel     for 2D area-based comparisons
 309         positionjoinsel for 2D position-based comparisons
 310         contjoinsel     for 2D containment-based comparisons
 311     </ProgramListing>
 312     </para>
 313    </sect2>
 314
 315    <sect2>
 316     <title>HASHES</title>
 317
 318     <para>
 319      The HASHES clause, if present, tells the system that it is OK to
 320      use the hash join method for a join based on this operator.  HASHES
 321      only makes sense for binary operators that return boolean, and
 322      in practice the operator had better be equality for some data type.
 323     </para>
 324
 325     <para>
 326      The assumption underlying hash join is that the join operator can
 327      only return TRUE for pairs of left and right values that hash to the
 328      same hash code.  If two values get put in different hash buckets, the
 329      join will never compare them at all, implicitly assuming that the
 330      result of the join operator must be FALSE.  So it never makes sense
 331      to specify HASHES for operators that do not represent equality.
 332     </para>
 333
 334     <para>
 335      In fact, logical equality is not good enough either; the operator
 336      had better represent pure bitwise equality, because the hash function
 337      will be computed on the memory representation of the values regardless
 338      of what the bits mean.  For example, equality of
 339      time intervals is not bitwise equality; the interval equality operator
 340      considers two time intervals equal if they have the same
 341      duration, whether or not their endpoints are identical.  What this means
 342      is that a join using <literal>=</literal> between interval fields would yield different
 343      results if implemented as a hash join than if implemented another way,
 344      because a large fraction of the pairs that should match will hash to
 345      different values and will never be compared by the hash join.  But
 346      if the optimizer chose to use a different kind of join, all the pairs
 347      that the equality operator says are equal will be found.
 348      We don't want that kind of inconsistency, so we don't mark interval
 349      equality as hashable.
 350     </para>
 351
 352     <para>
 353      There are also machine-dependent ways in which a hash join might fail
 354      to do the right thing.  For example, if your data type
 355      is a structure in which there may be uninteresting pad bits, it's unsafe
 356      to mark the equality operator HASHES.  (Unless, perhaps, you write
 357      your other operators to ensure that the unused bits are always zero.)
 358      Another example is that the FLOAT data types are unsafe for hash
 359      joins.  On machines that meet the <acronym>IEEE</> floating point standard, minus
 360      zero and plus zero are different values (different bit patterns) but
 361      they are defined to compare equal.  So, if float equality were marked
 362      HASHES, a minus zero and a plus zero would probably not be matched up
 363      by a hash join, but they would be matched up by any other join process.
 364     </para>
 365
 366     <para>
 367      The bottom line is that you should probably only use HASHES for
 368      equality operators that are (or could be) implemented by <function>memcmp()</function>.
 369     </para>
 370
 371    </sect2>
 372
 373    <sect2>
 374     <title>SORT1 and SORT2</title>
 375
 376     <para>
 377      The SORT clauses, if present, tell the system that it is permissible to use
 378      the merge join method for a join based on the current operator.
 379      Both must be specified if either is.  The current operator must be
 380      equality for some pair of data types, and the SORT1 and SORT2 clauses
 381      name the ordering operator ('<' operator) for the left and right-side
 382      data types respectively.
 383     </para>
 384
 385     <para>
 386      Merge join is based on the idea of sorting the left and righthand tables
 387      into order and then scanning them in parallel.  So, both data types must
 388      be capable of being fully ordered, and the join operator must be one
 389      that can only succeed for pairs of values that fall at the <quote>same place</>
 390      in the sort order.  In practice this means that the join operator must
 391      behave like equality.  But unlike hashjoin, where the left and right
 392      data types had better be the same (or at least bitwise equivalent),
 393      it is possible to mergejoin two
 394      distinct data types so long as they are logically compatible.  For
 395      example, the int2-versus-int4 equality operator is mergejoinable.
 396      We only need sorting operators that will bring both data types into a
 397      logically compatible sequence.
 398     </para>
 399
 400     <para>
 401      When specifying merge sort operators, the current operator and both
 402      referenced operators must return boolean; the SORT1 operator must have
 403      both input data types equal to the current operator's left argument type,
 404      and the SORT2 operator must have
 405      both input data types equal to the current operator's right argument type.
 406      (As with COMMUTATOR and NEGATOR, this means that the operator name is
 407      sufficient to specify the operator, and the system is able to make dummy
 408      operator entries if you happen to define the equality operator before
 409      the other ones.)
 410     </para>
 411
 412     <para>
 413      In practice you should only write SORT clauses for an <literal>=</> operator,
 414      and the two referenced operators should always be named <literal>&lt;</>.  Trying
 415      to use merge join with operators named anything else will result in
 416      hopeless confusion, for reasons we'll see in a moment.
 417     </para>
 418
 419     <para>
 420      There are additional restrictions on operators that you mark
 421      mergejoinable.  These restrictions are not currently checked by
 422      CREATE OPERATOR, but a merge join may fail at runtime if any are
 423      not true:
 424
 425      <itemizedlist>
 426       <listitem>
 427        <para>
 428         The mergejoinable equality operator must have a commutator
 429         (itself if the two data types are the same, or a related equality operator
 430         if they are different).
 431        </para>
 432       </listitem>
 433
 434       <listitem>
 435        <para>
 436         There must be <literal>&lt;</> and <literal>&gt;</> ordering operators having the same left and
 437         right input data types as the mergejoinable operator itself.  These
 438         operators <emphasis>must</emphasis> be named <literal>&lt;</> and <literal>&gt;</>; you do
 439         not have any choice in the matter, since there is no provision for
 440         specifying them explicitly.  Note that if the left and right data types
 441         are different, neither of these operators is the same as either
 442         SORT operator.  But they had better order the data values compatibly
 443         with the SORT operators, or mergejoin will fail to work.
 444        </para>
 445       </listitem>
 446      </itemizedlist>
 447     </para>
 448    </sect2>
 449
 450   </sect1>
 451  </Chapter>
 452
 453 <!-- Keep this comment at the end of the file
 454 Local variables:
 455 mode:sgml
 456 sgml-omittag:nil
 457 sgml-shorttag:t
 458 sgml-minimize-attributes:nil
 459 sgml-always-quote-attributes:t
 460 sgml-indent-step:1
 461 sgml-indent-data:t
 462 sgml-parent-document:nil
 463 sgml-default-dtd-file:"./reference.ced"
 464 sgml-exposed-tags:nil
 465 sgml-local-catalogs:("/usr/lib/sgml/catalog")
 466 sgml-local-ecat-files:nil
 467 End:
 468 -->