granicus.if.org Git - postgresql/blob - doc/src/sgml/pgtrgm.sgml

   1 <!-- doc/src/sgml/pgtrgm.sgml -->
   2
   3 <sect1 id="pgtrgm" xreflabel="pg_trgm">
   4  <title>pg_trgm</title>
   5
   6  <indexterm zone="pgtrgm">
   7   <primary>pg_trgm</primary>
   8  </indexterm>
   9
  10  <para>
  11   The <filename>pg_trgm</filename> module provides functions and operators
  12   for determining the similarity of
  13   alphanumeric text based on trigram matching, as
  14   well as index operator classes that support fast searching for similar
  15   strings.
  16  </para>
  17
  18  <sect2>
  19   <title>Trigram (or Trigraph) Concepts</title>
  20
  21   <para>
  22    A trigram is a group of three consecutive characters taken
  23    from a string.  We can measure the similarity of two strings by
  24    counting the number of trigrams they share.  This simple idea
  25    turns out to be very effective for measuring the similarity of
  26    words in many natural languages.
  27   </para>
  28
  29   <note>
  30    <para>
  31     <filename>pg_trgm</filename> ignores non-word characters
  32     (non-alphanumerics) when extracting trigrams from a string.
  33     Each word is considered to have two spaces
  34     prefixed and one space suffixed when determining the set
  35     of trigrams contained in the string.
  36     For example, the set of trigrams in the string
  37     <quote><literal>cat</literal></quote> is
  38     <quote><literal>  c</literal></quote>,
  39     <quote><literal> ca</literal></quote>,
  40     <quote><literal>cat</literal></quote>, and
  41     <quote><literal>at </literal></quote>.
  42     The set of trigrams in the string
  43     <quote><literal>foo|bar</literal></quote> is
  44     <quote><literal>  f</literal></quote>,
  45     <quote><literal> fo</literal></quote>,
  46     <quote><literal>foo</literal></quote>,
  47     <quote><literal>oo </literal></quote>,
  48     <quote><literal>  b</literal></quote>,
  49     <quote><literal> ba</literal></quote>,
  50     <quote><literal>bar</literal></quote>, and
  51     <quote><literal>ar </literal></quote>.
  52    </para>
  53   </note>
  54  </sect2>
  55
  56  <sect2>
  57   <title>Functions and Operators</title>
  58
  59   <para>
  60    The functions provided by the <filename>pg_trgm</filename> module
  61    are shown in <xref linkend="pgtrgm-func-table">, the operators
  62    in <xref linkend="pgtrgm-op-table">.
  63   </para>
  64
  65   <table id="pgtrgm-func-table">
  66    <title><filename>pg_trgm</filename> Functions</title>
  67    <tgroup cols="3">
  68     <thead>
  69      <row>
  70       <entry>Function</entry>
  71       <entry>Returns</entry>
  72       <entry>Description</entry>
  73      </row>
  74     </thead>
  75
  76     <tbody>
  77      <row>
  78       <entry><function>similarity(text, text)</function><indexterm><primary>similarity</primary></indexterm></entry>
  79       <entry><type>real</type></entry>
  80       <entry>
  81        Returns a number that indicates how similar the two arguments are.
  82        The range of the result is zero (indicating that the two strings are
  83        completely dissimilar) to one (indicating that the two strings are
  84        identical).
  85       </entry>
  86      </row>
  87      <row>
  88       <entry><function>show_trgm(text)</function><indexterm><primary>show_trgm</primary></indexterm></entry>
  89       <entry><type>text[]</type></entry>
  90       <entry>
  91        Returns an array of all the trigrams in the given string.
  92        (In practice this is seldom useful except for debugging.)
  93       </entry>
  94      </row>
  95      <row>
  96       <entry>
  97        <function>word_similarity(text, text)</function>
  98        <indexterm><primary>word_similarity</primary></indexterm>
  99       </entry>
 100       <entry><type>real</type></entry>
 101       <entry>
 102        Returns a number that indicates how similar the first string
 103        to the most similar word of the second string. The function searches in
 104        the second string a most similar word not a most similar substring.  The
 105        range of the result is zero (indicating that the two strings are
 106        completely dissimilar) to one (indicating that the first string is
 107        identical to one of the words of the second string).
 108       </entry>
 109      </row>
 110      <row>
 111       <entry><function>show_limit()</function><indexterm><primary>show_limit</primary></indexterm></entry>
 112       <entry><type>real</type></entry>
 113       <entry>
 114        Returns the current similarity threshold used by the <literal>%</>
 115        operator.  This sets the minimum similarity between
 116        two words for them to be considered similar enough to
 117        be misspellings of each other, for example
 118        (<emphasis>deprecated</emphasis>).
 119       </entry>
 120      </row>
 121      <row>
 122       <entry><function>set_limit(real)</function><indexterm><primary>set_limit</primary></indexterm></entry>
 123       <entry><type>real</type></entry>
 124       <entry>
 125        Sets the current similarity threshold that is used by the <literal>%</>
 126        operator.  The threshold must be between 0 and 1 (default is 0.3).
 127        Returns the same value passed in (<emphasis>deprecated</emphasis>).
 128       </entry>
 129      </row>
 130     </tbody>
 131    </tgroup>
 132   </table>
 133
 134   <table id="pgtrgm-op-table">
 135    <title><filename>pg_trgm</filename> Operators</title>
 136    <tgroup cols="3">
 137     <thead>
 138      <row>
 139       <entry>Operator</entry>
 140       <entry>Returns</entry>
 141       <entry>Description</entry>
 142      </row>
 143     </thead>
 144
 145     <tbody>
 146      <row>
 147       <entry><type>text</> <literal>%</literal> <type>text</></entry>
 148       <entry><type>boolean</type></entry>
 149       <entry>
 150        Returns <literal>true</> if its arguments have a similarity that is
 151        greater than the current similarity threshold set by
 152        <varname>pg_trgm.similarity_threshold</>.
 153       </entry>
 154      </row>
 155      <row>
 156       <entry><type>text</> <literal>&lt;%</literal> <type>text</></entry>
 157       <entry><type>boolean</type></entry>
 158       <entry>
 159        Returns <literal>true</> if its first argument has the similar word in
 160        the second argument and they have a similarity that is greater than the
 161        current word similarity threshold set by
 162        <varname>pg_trgm.word_similarity_threshold</> parameter.
 163       </entry>
 164      </row>
 165      <row>
 166       <entry><type>text</> <literal>%&gt;</literal> <type>text</></entry>
 167       <entry><type>boolean</type></entry>
 168       <entry>
 169        Commutator of the <literal>&lt;%</> operator.
 170       </entry>
 171      </row>
 172      <row>
 173       <entry><type>text</> <literal>&lt;-&gt;</literal> <type>text</></entry>
 174       <entry><type>real</type></entry>
 175       <entry>
 176        Returns the <quote>distance</> between the arguments, that is
 177        one minus the <function>similarity()</> value.
 178       </entry>
 179      </row>
 180      <row>
 181       <entry>
 182        <type>text</> <literal>&lt;&lt;-&gt;</literal> <type>text</>
 183       </entry>
 184       <entry><type>real</type></entry>
 185       <entry>
 186        Returns the <quote>distance</> between the arguments, that is
 187        one minus the <function>word_similarity()</> value.
 188       </entry>
 189      </row>
 190      <row>
 191       <entry>
 192        <type>text</> <literal>&lt;-&gt;&gt;</literal> <type>text</>
 193       </entry>
 194       <entry><type>real</type></entry>
 195       <entry>
 196        Commutator of the <literal>&lt;&lt;-&gt;</> operator.
 197       </entry>
 198      </row>
 199     </tbody>
 200    </tgroup>
 201   </table>
 202  </sect2>
 203
 204  <sect2>
 205   <title>GUC Parameters</title>
 206
 207   <variablelist>
 208    <varlistentry id="guc-pgtrgm-similarity-threshold" xreflabel="pg_trgm.similarity_threshold">
 209     <term>
 210      <varname>pg_trgm.similarity_threshold</> (<type>real</type>)
 211      <indexterm>
 212       <primary><varname>pg_trgm.similarity_threshold</> configuration parameter</primary>
 213      </indexterm>
 214     </term>
 215     <listitem>
 216      <para>
 217       Sets the current similarity threshold that is used by the <literal>%</>
 218       operator.  The threshold must be between 0 and 1 (default is 0.3).
 219      </para>
 220     </listitem>
 221    </varlistentry>
 222     <varlistentry id="guc-pgtrgm-word-similarity-threshold" xreflabel="pg_trgm.word_similarity_threshold">
 223      <term>
 224       <varname>pg_trgm.word_similarity_threshold</> (<type>real</type>)
 225       <indexterm>
 226        <primary>
 227         <varname>pg_trgm.word_similarity_threshold</> configuration parameter
 228        </primary>
 229       </indexterm>
 230      </term>
 231      <listitem>
 232       <para>
 233        Sets the current word similarity threshold that is used by
 234        <literal>&lt;%</> and <literal>%&gt;</> operators.  The threshold
 235        must be between 0 and 1 (default is 0.6).
 236       </para>
 237      </listitem>
 238     </varlistentry>
 239   </variablelist>
 240  </sect2>
 241
 242  <sect2>
 243   <title>Index Support</title>
 244
 245   <para>
 246    The <filename>pg_trgm</filename> module provides GiST and GIN index
 247    operator classes that allow you to create an index over a text column for
 248    the purpose of very fast similarity searches.  These index types support
 249    the above-described similarity operators, and additionally support
 250    trigram-based index searches for <literal>LIKE</>, <literal>ILIKE</>,
 251    <literal>~</> and <literal>~*</> queries.  (These indexes do not
 252    support equality nor simple comparison operators, so you may need a
 253    regular B-tree index too.)
 254   </para>
 255
 256   <para>
 257    Example:
 258
 259 <programlisting>
 260 CREATE TABLE test_trgm (t text);
 261 CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops);
 262 </programlisting>
 263 or
 264 <programlisting>
 265 CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);
 266 </programlisting>
 267   </para>
 268
 269   <para>
 270    At this point, you will have an index on the <structfield>t</> column that
 271    you can use for similarity searching.  A typical query is
 272 <programlisting>
 273 SELECT t, similarity(t, '<replaceable>word</>') AS sml
 274   FROM test_trgm
 275   WHERE t % '<replaceable>word</>'
 276   ORDER BY sml DESC, t;
 277 </programlisting>
 278    This will return all values in the text column that are sufficiently
 279    similar to <replaceable>word</>, sorted from best match to worst.  The
 280    index will be used to make this a fast operation even over very large data
 281    sets.
 282   </para>
 283
 284   <para>
 285    A variant of the above query is
 286 <programlisting>
 287 SELECT t, t &lt;-&gt; '<replaceable>word</>' AS dist
 288   FROM test_trgm
 289   ORDER BY dist LIMIT 10;
 290 </programlisting>
 291    This can be implemented quite efficiently by GiST indexes, but not
 292    by GIN indexes.  It will usually beat the first formulation when only
 293    a small number of the closest matches is wanted.
 294   </para>
 295
 296   <para>
 297    Also you can use an index on the <structfield>t</> column for word
 298    similarity.  For example:
 299 <programlisting>
 300 SELECT t, word_similarity('<replaceable>word</>', t) AS sml
 301   FROM test_trgm
 302   WHERE '<replaceable>word</>' &lt;% t
 303   ORDER BY sml DESC, t;
 304 </programlisting>
 305    This will return all values in the text column that have a word
 306    which sufficiently similar to <replaceable>word</>, sorted from best
 307    match to worst.  The index will be used to make this a fast operation
 308    even over very large data sets.
 309   </para>
 310
 311   <para>
 312    A variant of the above query is
 313 <programlisting>
 314 SELECT t, '<replaceable>word</>' &lt;&lt;-&gt; t AS dist
 315   FROM test_trgm
 316   ORDER BY dist LIMIT 10;
 317 </programlisting>
 318    This can be implemented quite efficiently by GiST indexes, but not
 319    by GIN indexes.
 320   </para>
 321
 322
 323   <para>
 324    Beginning in <productname>PostgreSQL</> 9.1, these index types also support
 325    index searches for <literal>LIKE</> and <literal>ILIKE</>, for example
 326 <programlisting>
 327 SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';
 328 </programlisting>
 329    The index search works by extracting trigrams from the search string
 330    and then looking these up in the index.  The more trigrams in the search
 331    string, the more effective the index search is.  Unlike B-tree based
 332    searches, the search string need not be left-anchored.
 333   </para>
 334
 335   <para>
 336    Beginning in <productname>PostgreSQL</> 9.3, these index types also support
 337    index searches for regular-expression matches
 338    (<literal>~</> and <literal>~*</> operators), for example
 339 <programlisting>
 340 SELECT * FROM test_trgm WHERE t ~ '(foo|bar)';
 341 </programlisting>
 342    The index search works by extracting trigrams from the regular expression
 343    and then looking these up in the index.  The more trigrams that can be
 344    extracted from the regular expression, the more effective the index search
 345    is.  Unlike B-tree based searches, the search string need not be
 346    left-anchored.
 347   </para>
 348
 349   <para>
 350    For both <literal>LIKE</> and regular-expression searches, keep in mind
 351    that a pattern with no extractable trigrams will degenerate to a full-index
 352    scan.
 353   </para>
 354
 355   <para>
 356    The choice between GiST and GIN indexing depends on the relative
 357    performance characteristics of GiST and GIN, which are discussed elsewhere.
 358   </para>
 359  </sect2>
 360
 361  <sect2>
 362   <title>Text Search Integration</title>
 363
 364   <para>
 365    Trigram matching is a very useful tool when used in conjunction
 366    with a full text index.  In particular it can help to recognize
 367    misspelled input words that will not be matched directly by the
 368    full text search mechanism.
 369   </para>
 370
 371   <para>
 372    The first step is to generate an auxiliary table containing all
 373    the unique words in the documents:
 374
 375 <programlisting>
 376 CREATE TABLE words AS SELECT word FROM
 377         ts_stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');
 378 </programlisting>
 379
 380    where <structname>documents</> is a table that has a text field
 381    <structfield>bodytext</> that we wish to search.  The reason for using
 382    the <literal>simple</> configuration with the <function>to_tsvector</>
 383    function, instead of using a language-specific configuration,
 384    is that we want a list of the original (unstemmed) words.
 385   </para>
 386
 387   <para>
 388    Next, create a trigram index on the word column:
 389
 390 <programlisting>
 391 CREATE INDEX words_idx ON words USING GIN (word gin_trgm_ops);
 392 </programlisting>
 393
 394    Now, a <command>SELECT</command> query similar to the previous example can
 395    be used to suggest spellings for misspelled words in user search terms.
 396    A useful extra test is to require that the selected words are also of
 397    similar length to the misspelled word.
 398   </para>
 399
 400   <note>
 401    <para>
 402     Since the <structname>words</> table has been generated as a separate,
 403     static table, it will need to be periodically regenerated so that
 404     it remains reasonably up-to-date with the document collection.
 405     Keeping it exactly current is usually unnecessary.
 406    </para>
 407   </note>
 408  </sect2>
 409
 410  <sect2>
 411   <title>References</title>
 412
 413   <para>
 414    GiST Development Site
 415    <ulink url="http://www.sai.msu.su/~megera/postgres/gist/"></ulink>
 416   </para>
 417   <para>
 418    Tsearch2 Development Site
 419    <ulink url="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/"></ulink>
 420   </para>
 421  </sect2>
 422
 423  <sect2>
 424   <title>Authors</title>
 425
 426   <para>
 427    Oleg Bartunov <email>oleg@sai.msu.su</email>, Moscow, Moscow University, Russia
 428   </para>
 429   <para>
 430    Teodor Sigaev <email>teodor@sigaev.ru</email>, Moscow, Delta-Soft Ltd.,Russia
 431   </para>
 432   <para>
 433    Alexander Korotkov <email>a.korotkov@postgrespro.ru</email>, Moscow, Postgres Professional, Russia
 434   </para>
 435   <para>
 436    Documentation: Christopher Kings-Lynne
 437   </para>
 438   <para>
 439    This module is sponsored by Delta-Soft Ltd., Moscow, Russia.
 440   </para>
 441  </sect2>
 442
 443 </sect1>