granicus.if.org Git - postgresql/blob - doc/src/sgml/textsearch.sgml

   1 <!-- doc/src/sgml/textsearch.sgml -->
   2
   3 <chapter id="textsearch">
   4  <title>Full Text Search</title>
   5
   6   <indexterm zone="textsearch">
   7    <primary>full text search</primary>
   8   </indexterm>
   9
  10   <indexterm zone="textsearch">
  11    <primary>text search</primary>
  12   </indexterm>
  13
  14  <sect1 id="textsearch-intro">
  15   <title>Introduction</title>
  16
  17   <para>
  18    Full Text Searching (or just <firstterm>text search</firstterm>) provides
  19    the capability to identify natural-language <firstterm>documents</> that
  20    satisfy a <firstterm>query</firstterm>, and optionally to sort them by
  21    relevance to the query.  The most common type of search
  22    is to find all documents containing given <firstterm>query terms</firstterm>
  23    and return them in order of their <firstterm>similarity</firstterm> to the
  24    query.  Notions of <varname>query</varname> and
  25    <varname>similarity</varname> are very flexible and depend on the specific
  26    application. The simplest search considers <varname>query</varname> as a
  27    set of words and <varname>similarity</varname> as the frequency of query
  28    words in the document.
  29   </para>
  30
  31   <para>
  32    Textual search operators have existed in databases for years.
  33    <productname>PostgreSQL</productname> has
  34    <literal>~</literal>, <literal>~*</literal>, <literal>LIKE</literal>, and
  35    <literal>ILIKE</literal> operators for textual data types, but they lack
  36    many essential properties required by modern information systems:
  37   </para>
  38
  39   <itemizedlist  spacing="compact" mark="bullet">
  40    <listitem>
  41     <para>
  42      There is no linguistic support, even for English.  Regular expressions
  43      are not sufficient because they cannot easily handle derived words, e.g.,
  44      <literal>satisfies</literal> and <literal>satisfy</literal>. You might
  45      miss documents that contain <literal>satisfies</literal>, although you
  46      probably would like to find them when searching for
  47      <literal>satisfy</literal>. It is possible to use <literal>OR</literal>
  48      to search for multiple derived forms, but this is tedious and error-prone
  49      (some words can have several thousand derivatives).
  50     </para>
  51    </listitem>
  52
  53    <listitem>
  54     <para>
  55      They provide no ordering (ranking) of search results, which makes them
  56      ineffective when thousands of matching documents are found.
  57     </para>
  58    </listitem>
  59
  60    <listitem>
  61     <para>
  62      They tend to be slow because there is no index support, so they must
  63      process all documents for every search.
  64     </para>
  65    </listitem>
  66   </itemizedlist>
  67
  68   <para>
  69    Full text indexing allows documents to be <emphasis>preprocessed</emphasis>
  70    and an index saved for later rapid searching. Preprocessing includes:
  71   </para>
  72
  73   <itemizedlist  mark="none">
  74    <listitem>
  75     <para>
  76      <emphasis>Parsing documents into <firstterm>tokens</></emphasis>. It is
  77      useful to identify various classes of tokens, e.g., numbers, words,
  78      complex words, email addresses, so that they can be processed
  79      differently.  In principle token classes depend on the specific
  80      application, but for most purposes it is adequate to use a predefined
  81      set of classes.
  82      <productname>PostgreSQL</productname> uses a <firstterm>parser</> to
  83      perform this step.  A standard parser is provided, and custom parsers
  84      can be created for specific needs.
  85     </para>
  86    </listitem>
  87
  88    <listitem>
  89     <para>
  90      <emphasis>Converting tokens into <firstterm>lexemes</></emphasis>.
  91      A lexeme is a string, just like a token, but it has been
  92      <firstterm>normalized</> so that different forms of the same word
  93      are made alike.  For example, normalization almost always includes
  94      folding upper-case letters to lower-case, and often involves removal
  95      of suffixes (such as <literal>s</> or <literal>es</> in English).
  96      This allows searches to find variant forms of the
  97      same word, without tediously entering all the possible variants.
  98      Also, this step typically eliminates <firstterm>stop words</>, which
  99      are words that are so common that they are useless for searching.
 100      (In short, then, tokens are raw fragments of the document text, while
 101      lexemes are words that are believed useful for indexing and searching.)
 102      <productname>PostgreSQL</productname> uses <firstterm>dictionaries</> to
 103      perform this step.  Various standard dictionaries are provided, and
 104      custom ones can be created for specific needs.
 105     </para>
 106    </listitem>
 107
 108    <listitem>
 109     <para>
 110      <emphasis>Storing preprocessed documents optimized for
 111      searching</emphasis>.  For example, each document can be represented
 112      as a sorted array of normalized lexemes. Along with the lexemes it is
 113      often desirable to store positional information to use for
 114      <firstterm>proximity ranking</firstterm>, so that a document that
 115      contains a more <quote>dense</> region of query words is
 116      assigned a higher rank than one with scattered query words.
 117     </para>
 118    </listitem>
 119   </itemizedlist>
 120
 121   <para>
 122    Dictionaries allow fine-grained control over how tokens are normalized.
 123    With appropriate dictionaries, you can:
 124   </para>
 125
 126   <itemizedlist  spacing="compact" mark="bullet">
 127    <listitem>
 128     <para>
 129      Define stop words that should not be indexed.
 130     </para>
 131    </listitem>
 132
 133    <listitem>
 134     <para>
 135      Map synonyms to a single word using <application>Ispell</>.
 136     </para>
 137    </listitem>
 138
 139    <listitem>
 140     <para>
 141      Map phrases to a single word using a thesaurus.
 142     </para>
 143    </listitem>
 144
 145    <listitem>
 146     <para>
 147      Map different variations of a word to a canonical form using
 148      an <application>Ispell</> dictionary.
 149     </para>
 150    </listitem>
 151
 152    <listitem>
 153     <para>
 154      Map different variations of a word to a canonical form using
 155      <application>Snowball</> stemmer rules.
 156     </para>
 157    </listitem>
 158   </itemizedlist>
 159
 160   <para>
 161    A data type <type>tsvector</type> is provided for storing preprocessed
 162    documents, along with a type <type>tsquery</type> for representing processed
 163    queries (<xref linkend="datatype-textsearch">).  There are many
 164    functions and operators available for these data types
 165    (<xref linkend="functions-textsearch">), the most important of which is
 166    the match operator <literal>@@</literal>, which we introduce in
 167    <xref linkend="textsearch-matching">.  Full text searches can be accelerated
 168    using indexes (<xref linkend="textsearch-indexes">).
 169   </para>
 170
 171
 172   <sect2 id="textsearch-document">
 173    <title>What Is a Document?</title>
 174
 175    <indexterm zone="textsearch-document">
 176     <primary>document</primary>
 177     <secondary>text search</secondary>
 178    </indexterm>
 179
 180    <para>
 181     A <firstterm>document</> is the unit of searching in a full text search
 182     system; for example, a magazine article or email message.  The text search
 183     engine must be able to parse documents and store associations of lexemes
 184     (key words) with their parent document. Later, these associations are
 185     used to search for documents that contain query words.
 186    </para>
 187
 188    <para>
 189     For searches within <productname>PostgreSQL</productname>,
 190     a document is normally a textual field within a row of a database table,
 191     or possibly a combination (concatenation) of such fields, perhaps stored
 192     in several tables or obtained dynamically. In other words, a document can
 193     be constructed from different parts for indexing and it might not be
 194     stored anywhere as a whole. For example:
 195
 196 <programlisting>
 197 SELECT title || ' ' ||  author || ' ' ||  abstract || ' ' || body AS document
 198 FROM messages
 199 WHERE mid = 12;
 200
 201 SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document
 202 FROM messages m, docs d
 203 WHERE mid = did AND mid = 12;
 204 </programlisting>
 205    </para>
 206
 207    <note>
 208     <para>
 209      Actually, in these example queries, <function>coalesce</function>
 210      should be used to prevent a single <literal>NULL</literal> attribute from
 211      causing a <literal>NULL</literal> result for the whole document.
 212     </para>
 213    </note>
 214
 215    <para>
 216     Another possibility is to store the documents as simple text files in the
 217     file system. In this case, the database can be used to store the full text
 218     index and to execute searches, and some unique identifier can be used to
 219     retrieve the document from the file system.  However, retrieving files
 220     from outside the database requires superuser permissions or special
 221     function support, so this is usually less convenient than keeping all
 222     the data inside <productname>PostgreSQL</productname>.  Also, keeping
 223     everything inside the database allows easy access
 224     to document metadata to assist in indexing and display.
 225    </para>
 226
 227    <para>
 228     For text search purposes, each document must be reduced to the
 229     preprocessed <type>tsvector</> format.  Searching and ranking
 230     are performed entirely on the <type>tsvector</> representation
 231     of a document &mdash; the original text need only be retrieved
 232     when the document has been selected for display to a user.
 233     We therefore often speak of the <type>tsvector</> as being the
 234     document, but of course it is only a compact representation of
 235     the full document.
 236    </para>
 237   </sect2>
 238
 239   <sect2 id="textsearch-matching">
 240    <title>Basic Text Matching</title>
 241
 242    <para>
 243     Full text searching in <productname>PostgreSQL</productname> is based on
 244     the match operator <literal>@@</literal>, which returns
 245     <literal>true</literal> if a <type>tsvector</type>
 246     (document) matches a <type>tsquery</type> (query).
 247     It doesn't matter which data type is written first:
 248
 249 <programlisting>
 250 SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat &amp; rat'::tsquery;
 251  ?column?
 252 ----------
 253  t
 254
 255 SELECT 'fat &amp; cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
 256  ?column?
 257 ----------
 258  f
 259 </programlisting>
 260    </para>
 261
 262    <para>
 263     As the above example suggests, a <type>tsquery</type> is not just raw
 264     text, any more than a <type>tsvector</type> is.  A <type>tsquery</type>
 265     contains search terms, which must be already-normalized lexemes, and
 266     may combine multiple terms using AND, OR, and NOT operators.
 267     (For details see <xref linkend="datatype-textsearch">.)  There are
 268     functions <function>to_tsquery</> and <function>plainto_tsquery</>
 269     that are helpful in converting user-written text into a proper
 270     <type>tsquery</type>, for example by normalizing words appearing in
 271     the text.  Similarly, <function>to_tsvector</> is used to parse and
 272     normalize a document string.  So in practice a text search match would
 273     look more like this:
 274
 275 <programlisting>
 276 SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat &amp; rat');
 277  ?column?
 278 ----------
 279  t
 280 </programlisting>
 281
 282     Observe that this match would not succeed if written as
 283
 284 <programlisting>
 285 SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat &amp; rat');
 286  ?column?
 287 ----------
 288  f
 289 </programlisting>
 290
 291     since here no normalization of the word <literal>rats</> will occur.
 292     The elements of a <type>tsvector</> are lexemes, which are assumed
 293     already normalized, so <literal>rats</> does not match <literal>rat</>.
 294    </para>
 295
 296    <para>
 297     The <literal>@@</literal> operator also
 298     supports <type>text</type> input, allowing explicit conversion of a text
 299     string to <type>tsvector</type> or <type>tsquery</> to be skipped
 300     in simple cases.  The variants available are:
 301
 302 <programlisting>
 303 tsvector @@ tsquery
 304 tsquery  @@ tsvector
 305 text @@ tsquery
 306 text @@ text
 307 </programlisting>
 308    </para>
 309
 310    <para>
 311     The first two of these we saw already.
 312     The form <type>text</type> <literal>@@</literal> <type>tsquery</type>
 313     is equivalent to <literal>to_tsvector(x) @@ y</literal>.
 314     The form <type>text</type> <literal>@@</literal> <type>text</type>
 315     is equivalent to <literal>to_tsvector(x) @@ plainto_tsquery(y)</literal>.
 316    </para>
 317   </sect2>
 318
 319   <sect2 id="textsearch-intro-configurations">
 320    <title>Configurations</title>
 321
 322    <para>
 323     The above are all simple text search examples.  As mentioned before, full
 324     text search functionality includes the ability to do many more things:
 325     skip indexing certain words (stop words), process synonyms, and use
 326     sophisticated parsing, e.g., parse based on more than just white space.
 327     This functionality is controlled by <firstterm>text search
 328     configurations</>.  <productname>PostgreSQL</> comes with predefined
 329     configurations for many languages, and you can easily create your own
 330     configurations.  (<application>psql</>'s <command>\dF</> command
 331     shows all available configurations.)
 332    </para>
 333
 334    <para>
 335     During installation an appropriate configuration is selected and
 336     <xref linkend="guc-default-text-search-config"> is set accordingly
 337     in <filename>postgresql.conf</>.  If you are using the same text search
 338     configuration for the entire cluster you can use the value in
 339     <filename>postgresql.conf</>.  To use different configurations
 340     throughout the cluster but the same configuration within any one database,
 341     use <command>ALTER DATABASE ... SET</>.  Otherwise, you can set
 342     <varname>default_text_search_config</varname> in each session.
 343    </para>
 344
 345    <para>
 346     Each text search function that depends on a configuration has an optional
 347     <type>regconfig</> argument, so that the configuration to use can be
 348     specified explicitly.  <varname>default_text_search_config</varname>
 349     is used only when this argument is omitted.
 350    </para>
 351
 352    <para>
 353     To make it easier to build custom text search configurations, a
 354     configuration is built up from simpler database objects.
 355     <productname>PostgreSQL</>'s text search facility provides
 356     four types of configuration-related database objects:
 357    </para>
 358
 359   <itemizedlist  spacing="compact" mark="bullet">
 360    <listitem>
 361     <para>
 362      <firstterm>Text search parsers</> break documents into tokens
 363      and classify each token (for example, as words or numbers).
 364     </para>
 365    </listitem>
 366
 367    <listitem>
 368     <para>
 369      <firstterm>Text search dictionaries</> convert tokens to normalized
 370      form and reject stop words.
 371     </para>
 372    </listitem>
 373
 374    <listitem>
 375     <para>
 376      <firstterm>Text search templates</> provide the functions underlying
 377      dictionaries.  (A dictionary simply specifies a template and a set
 378      of parameters for the template.)
 379     </para>
 380    </listitem>
 381
 382    <listitem>
 383     <para>
 384      <firstterm>Text search configurations</> select a parser and a set
 385      of dictionaries to use to normalize the tokens produced by the parser.
 386     </para>
 387    </listitem>
 388   </itemizedlist>
 389
 390    <para>
 391     Text search parsers and templates are built from low-level C functions;
 392     therefore it requires C programming ability to develop new ones, and
 393     superuser privileges to install one into a database.  (There are examples
 394     of add-on parsers and templates in the <filename>contrib/</> area of the
 395     <productname>PostgreSQL</> distribution.)  Since dictionaries and
 396     configurations just parameterize and connect together some underlying
 397     parsers and templates, no special privilege is needed to create a new
 398     dictionary or configuration.  Examples of creating custom dictionaries and
 399     configurations appear later in this chapter.
 400    </para>
 401
 402   </sect2>
 403
 404  </sect1>
 405
 406  <sect1 id="textsearch-tables">
 407   <title>Tables and Indexes</title>
 408
 409   <para>
 410    The examples in the previous section illustrated full text matching using
 411    simple constant strings.  This section shows how to search table data,
 412    optionally using indexes.
 413   </para>
 414
 415   <sect2 id="textsearch-tables-search">
 416    <title>Searching a Table</title>
 417
 418    <para>
 419     It is possible to do a full text search without an index.  A simple query
 420     to print the <structname>title</> of each row that contains the word
 421     <literal>friend</> in its <structfield>body</> field is:
 422
 423 <programlisting>
 424 SELECT title
 425 FROM pgweb
 426 WHERE to_tsvector('english', body) @@ to_tsquery('english', 'friend');
 427 </programlisting>
 428
 429     This will also find related words such as <literal>friends</>
 430     and <literal>friendly</>, since all these are reduced to the same
 431     normalized lexeme.
 432    </para>
 433
 434    <para>
 435     The query above specifies that the <literal>english</> configuration
 436     is to be used to parse and normalize the strings.  Alternatively we
 437     could omit the configuration parameters:
 438
 439 <programlisting>
 440 SELECT title
 441 FROM pgweb
 442 WHERE to_tsvector(body) @@ to_tsquery('friend');
 443 </programlisting>
 444
 445     This query will use the configuration set by <xref
 446     linkend="guc-default-text-search-config">.
 447    </para>
 448
 449    <para>
 450     A more complex example is to
 451     select the ten most recent documents that contain <literal>create</> and
 452     <literal>table</> in the <structname>title</> or <structname>body</>:
 453
 454 <programlisting>
 455 SELECT title
 456 FROM pgweb
 457 WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create &amp; table')
 458 ORDER BY last_mod_date DESC
 459 LIMIT 10;
 460 </programlisting>
 461
 462     For clarity we omitted the <function>coalesce</function> function calls
 463     which would be needed to find rows that contain <literal>NULL</literal>
 464     in one of the two fields.
 465    </para>
 466
 467    <para>
 468     Although these queries will work without an index, most applications
 469     will find this approach too slow, except perhaps for occasional ad-hoc
 470     searches.  Practical use of text searching usually requires creating
 471     an index.
 472    </para>
 473
 474   </sect2>
 475
 476   <sect2 id="textsearch-tables-index">
 477    <title>Creating Indexes</title>
 478
 479    <para>
 480     We can create a <acronym>GIN</acronym> index (<xref
 481     linkend="textsearch-indexes">) to speed up text searches:
 482
 483 <programlisting>
 484 CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', body));
 485 </programlisting>
 486
 487     Notice that the 2-argument version of <function>to_tsvector</function> is
 488     used.  Only text search functions that specify a configuration name can
 489     be used in expression indexes (<xref linkend="indexes-expressional">).
 490     This is because the index contents must be unaffected by <xref
 491     linkend="guc-default-text-search-config">.  If they were affected, the
 492     index contents might be inconsistent because different entries could
 493     contain <type>tsvector</>s that were created with different text search
 494     configurations, and there would be no way to guess which was which.  It
 495     would be impossible to dump and restore such an index correctly.
 496    </para>
 497
 498    <para>
 499     Because the two-argument version of <function>to_tsvector</function> was
 500     used in the index above, only a query reference that uses the 2-argument
 501     version of <function>to_tsvector</function> with the same configuration
 502     name will use that index.  That is, <literal>WHERE
 503     to_tsvector('english', body) @@ 'a &amp; b'</> can use the index,
 504     but <literal>WHERE to_tsvector(body) @@ 'a &amp; b'</> cannot.
 505     This ensures that an index will be used only with the same configuration
 506     used to create the index entries.
 507    </para>
 508
 509   <para>
 510     It is possible to set up more complex expression indexes wherein the
 511     configuration name is specified by another column, e.g.:
 512
 513 <programlisting>
 514 CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(config_name, body));
 515 </programlisting>
 516
 517     where <literal>config_name</> is a column in the <literal>pgweb</>
 518     table.  This allows mixed configurations in the same index while
 519     recording which configuration was used for each index entry.  This
 520     would be useful, for example, if the document collection contained
 521     documents in different languages.  Again,
 522     queries that are meant to use the index must be phrased to match, e.g.,
 523     <literal>WHERE to_tsvector(config_name, body) @@ 'a &amp; b'</>.
 524    </para>
 525
 526    <para>
 527     Indexes can even concatenate columns:
 528
 529 <programlisting>
 530 CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', title || ' ' || body));
 531 </programlisting>
 532    </para>
 533
 534    <para>
 535     Another approach is to create a separate <type>tsvector</> column
 536     to hold the output of <function>to_tsvector</>.  This example is a
 537     concatenation of <literal>title</literal> and <literal>body</literal>,
 538     using <function>coalesce</> to ensure that one field will still be
 539     indexed when the other is <literal>NULL</>:
 540
 541 <programlisting>
 542 ALTER TABLE pgweb ADD COLUMN textsearchable_index_col tsvector;
 543 UPDATE pgweb SET textsearchable_index_col =
 544      to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,''));
 545 </programlisting>
 546
 547     Then we create a <acronym>GIN</acronym> index to speed up the search:
 548
 549 <programlisting>
 550 CREATE INDEX textsearch_idx ON pgweb USING gin(textsearchable_index_col);
 551 </programlisting>
 552
 553     Now we are ready to perform a fast full text search:
 554
 555 <programlisting>
 556 SELECT title
 557 FROM pgweb
 558 WHERE textsearchable_index_col @@ to_tsquery('create &amp; table')
 559 ORDER BY last_mod_date DESC
 560 LIMIT 10;
 561 </programlisting>
 562    </para>
 563
 564    <para>
 565     When using a separate column to store the <type>tsvector</>
 566     representation,
 567     it is necessary to create a trigger to keep the <type>tsvector</>
 568     column current anytime <literal>title</> or <literal>body</> changes.
 569     <xref linkend="textsearch-update-triggers"> explains how to do that.
 570    </para>
 571
 572    <para>
 573     One advantage of the separate-column approach over an expression index
 574     is that it is not necessary to explicitly specify the text search
 575     configuration in queries in order to make use of the index.  As shown
 576     in the example above, the query can depend on
 577     <varname>default_text_search_config</>.  Another advantage is that
 578     searches will be faster, since it will not be necessary to redo the
 579     <function>to_tsvector</> calls to verify index matches.  (This is more
 580     important when using a GiST index than a GIN index; see <xref
 581     linkend="textsearch-indexes">.)  The expression-index approach is
 582     simpler to set up, however, and it requires less disk space since the
 583     <type>tsvector</> representation is not stored explicitly.
 584    </para>
 585
 586   </sect2>
 587
 588  </sect1>
 589
 590  <sect1 id="textsearch-controls">
 591   <title>Controlling Text Search</title>
 592
 593   <para>
 594    To implement full text searching there must be a function to create a
 595    <type>tsvector</type> from a document and a <type>tsquery</type> from a
 596    user query. Also, we need to return results in a useful order, so we need
 597    a function that compares documents with respect to their relevance to
 598    the query. It's also important to be able to display the results nicely.
 599    <productname>PostgreSQL</productname> provides support for all of these
 600    functions.
 601   </para>
 602
 603   <sect2 id="textsearch-parsing-documents">
 604    <title>Parsing Documents</title>
 605
 606    <para>
 607     <productname>PostgreSQL</productname> provides the
 608     function <function>to_tsvector</function> for converting a document to
 609     the <type>tsvector</type> data type.
 610    </para>
 611
 612    <indexterm>
 613     <primary>to_tsvector</primary>
 614    </indexterm>
 615
 616 <synopsis>
 617 to_tsvector(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">document</replaceable> <type>text</>) returns <type>tsvector</>
 618 </synopsis>
 619
 620    <para>
 621     <function>to_tsvector</function> parses a textual document into tokens,
 622     reduces the tokens to lexemes, and returns a <type>tsvector</type> which
 623     lists the lexemes together with their positions in the document.
 624     The document is processed according to the specified or default
 625     text search configuration.
 626     Here is a simple example:
 627
 628 <screen>
 629 SELECT to_tsvector('english', 'a fat  cat sat on a mat - it ate a fat rats');
 630                   to_tsvector
 631 -----------------------------------------------------
 632  'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
 633 </screen>
 634    </para>
 635
 636    <para>
 637     In the example above we see that the resulting <type>tsvector</type> does not
 638     contain the words <literal>a</literal>, <literal>on</literal>, or
 639     <literal>it</literal>, the word <literal>rats</literal> became
 640     <literal>rat</literal>, and the punctuation sign <literal>-</literal> was
 641     ignored.
 642    </para>
 643
 644    <para>
 645     The <function>to_tsvector</function> function internally calls a parser
 646     which breaks the document text into tokens and assigns a type to
 647     each token.  For each token, a list of
 648     dictionaries (<xref linkend="textsearch-dictionaries">) is consulted,
 649     where the list can vary depending on the token type.  The first dictionary
 650     that <firstterm>recognizes</> the token emits one or more normalized
 651     <firstterm>lexemes</firstterm> to represent the token.  For example,
 652     <literal>rats</literal> became <literal>rat</literal> because one of the
 653     dictionaries recognized that the word <literal>rats</literal> is a plural
 654     form of <literal>rat</literal>.  Some words are recognized as
 655     <firstterm>stop words</> (<xref linkend="textsearch-stopwords">), which
 656     causes them to be ignored since they occur too frequently to be useful in
 657     searching.  In our example these are
 658     <literal>a</literal>, <literal>on</literal>, and <literal>it</literal>.
 659     If no dictionary in the list recognizes the token then it is also ignored.
 660     In this example that happened to the punctuation sign <literal>-</literal>
 661     because there are in fact no dictionaries assigned for its token type
 662     (<literal>Space symbols</literal>), meaning space tokens will never be
 663     indexed. The choices of parser, dictionaries and which types of tokens to
 664     index are determined by the selected text search configuration (<xref
 665     linkend="textsearch-configuration">).  It is possible to have
 666     many different configurations in the same database, and predefined
 667     configurations are available for various languages. In our example
 668     we used the default configuration <literal>english</literal> for the
 669     English language.
 670    </para>
 671
 672    <para>
 673     The function <function>setweight</function> can be used to label the
 674     entries of a <type>tsvector</type> with a given <firstterm>weight</>,
 675     where a weight is one of the letters <literal>A</>, <literal>B</>,
 676     <literal>C</>, or <literal>D</>.
 677     This is typically used to mark entries coming from
 678     different parts of a document, such as title versus body.  Later, this
 679     information can be used for ranking of search results.
 680    </para>
 681
 682    <para>
 683     Because <function>to_tsvector</function>(<literal>NULL</literal>) will
 684     return <literal>NULL</literal>, it is recommended to use
 685     <function>coalesce</function> whenever a field might be null.
 686     Here is the recommended method for creating
 687     a <type>tsvector</type> from a structured document:
 688
 689 <programlisting>
 690 UPDATE tt SET ti =
 691     setweight(to_tsvector(coalesce(title,'')), 'A')    ||
 692     setweight(to_tsvector(coalesce(keyword,'')), 'B')  ||
 693     setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
 694     setweight(to_tsvector(coalesce(body,'')), 'D');
 695 </programlisting>
 696
 697     Here we have used <function>setweight</function> to label the source
 698     of each lexeme in the finished <type>tsvector</type>, and then merged
 699     the labeled <type>tsvector</type> values using the <type>tsvector</>
 700     concatenation operator <literal>||</>.  (<xref
 701     linkend="textsearch-manipulate-tsvector"> gives details about these
 702     operations.)
 703    </para>
 704
 705   </sect2>
 706
 707   <sect2 id="textsearch-parsing-queries">
 708    <title>Parsing Queries</title>
 709
 710    <para>
 711     <productname>PostgreSQL</productname> provides the
 712     functions <function>to_tsquery</function> and
 713     <function>plainto_tsquery</function> for converting a query to
 714     the <type>tsquery</type> data type.  <function>to_tsquery</function>
 715     offers access to more features than <function>plainto_tsquery</function>,
 716     but is less forgiving about its input.
 717    </para>
 718
 719    <indexterm>
 720     <primary>to_tsquery</primary>
 721    </indexterm>
 722
 723 <synopsis>
 724 to_tsquery(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">querytext</replaceable> <type>text</>) returns <type>tsquery</>
 725 </synopsis>
 726
 727    <para>
 728     <function>to_tsquery</function> creates a <type>tsquery</> value from
 729     <replaceable>querytext</replaceable>, which must consist of single tokens
 730     separated by the Boolean operators <literal>&amp;</literal> (AND),
 731     <literal>|</literal> (OR) and <literal>!</literal> (NOT).  These operators
 732     can be grouped using parentheses.  In other words, the input to
 733     <function>to_tsquery</function> must already follow the general rules for
 734     <type>tsquery</> input, as described in <xref
 735     linkend="datatype-textsearch">.  The difference is that while basic
 736     <type>tsquery</> input takes the tokens at face value,
 737     <function>to_tsquery</function> normalizes each token to a lexeme using
 738     the specified or default configuration, and discards any tokens that are
 739     stop words according to the configuration.  For example:
 740
 741 <screen>
 742 SELECT to_tsquery('english', 'The &amp; Fat &amp; Rats');
 743   to_tsquery
 744 ---------------
 745  'fat' &amp; 'rat'
 746 </screen>
 747
 748     As in basic <type>tsquery</> input, weight(s) can be attached to each
 749     lexeme to restrict it to match only <type>tsvector</> lexemes of those
 750     weight(s).  For example:
 751
 752 <screen>
 753 SELECT to_tsquery('english', 'Fat | Rats:AB');
 754     to_tsquery
 755 ------------------
 756  'fat' | 'rat':AB
 757 </screen>
 758
 759     Also, <literal>*</> can be attached to a lexeme to specify prefix matching:
 760
 761 <screen>
 762 SELECT to_tsquery('supern:*A &amp; star:A*B');
 763         to_tsquery
 764 --------------------------
 765  'supern':*A &amp; 'star':*AB
 766 </screen>
 767
 768     Such a lexeme will match any word in a <type>tsvector</> that begins
 769     with the given string.
 770    </para>
 771
 772    <para>
 773     <function>to_tsquery</function> can also accept single-quoted
 774     phrases.  This is primarily useful when the configuration includes a
 775     thesaurus dictionary that may trigger on such phrases.
 776     In the example below, a thesaurus contains the rule <literal>supernovae
 777     stars : sn</literal>:
 778
 779 <screen>
 780 SELECT to_tsquery('''supernovae stars'' &amp; !crab');
 781   to_tsquery
 782 ---------------
 783  'sn' &amp; !'crab'
 784 </screen>
 785
 786     Without quotes, <function>to_tsquery</function> will generate a syntax
 787     error for tokens that are not separated by an AND or OR operator.
 788    </para>
 789
 790    <indexterm>
 791     <primary>plainto_tsquery</primary>
 792    </indexterm>
 793
 794 <synopsis>
 795 plainto_tsquery(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">querytext</replaceable> <type>text</>) returns <type>tsquery</>
 796 </synopsis>
 797
 798    <para>
 799     <function>plainto_tsquery</> transforms unformatted text
 800     <replaceable>querytext</replaceable> to <type>tsquery</type>.
 801     The text is parsed and normalized much as for <function>to_tsvector</>,
 802     then the <literal>&amp;</literal> (AND) Boolean operator is inserted
 803     between surviving words.
 804    </para>
 805
 806    <para>
 807     Example:
 808
 809 <screen>
 810 SELECT plainto_tsquery('english', 'The Fat Rats');
 811  plainto_tsquery
 812 -----------------
 813  'fat' &amp; 'rat'
 814 </screen>
 815
 816     Note that <function>plainto_tsquery</> cannot
 817     recognize Boolean operators, weight labels, or prefix-match labels
 818     in its input:
 819
 820 <screen>
 821 SELECT plainto_tsquery('english', 'The Fat &amp; Rats:C');
 822    plainto_tsquery
 823 ---------------------
 824  'fat' &amp; 'rat' &amp; 'c'
 825 </screen>
 826
 827     Here, all the input punctuation was discarded as being space symbols.
 828    </para>
 829
 830   </sect2>
 831
 832   <sect2 id="textsearch-ranking">
 833    <title>Ranking Search Results</title>
 834
 835    <para>
 836     Ranking attempts to measure how relevant documents are to a particular
 837     query, so that when there are many matches the most relevant ones can be
 838     shown first.  <productname>PostgreSQL</productname> provides two
 839     predefined ranking functions, which take into account lexical, proximity,
 840     and structural information; that is, they consider how often the query
 841     terms appear in the document, how close together the terms are in the
 842     document, and how important is the part of the document where they occur.
 843     However, the concept of relevancy is vague and very application-specific.
 844     Different applications might require additional information for ranking,
 845     e.g., document modification time.  The built-in ranking functions are only
 846     examples.  You can write your own ranking functions and/or combine their
 847     results with additional factors to fit your specific needs.
 848    </para>
 849
 850    <para>
 851     The two ranking functions currently available are:
 852
 853     <variablelist>
 854
 855      <varlistentry>
 856
 857       <indexterm>
 858        <primary>ts_rank</primary>
 859       </indexterm>
 860
 861       <term>
 862 <synopsis>
 863 ts_rank(<optional> <replaceable class="PARAMETER">weights</replaceable> <type>float4[]</>, </optional> <replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>,
 864         <replaceable class="PARAMETER">query</replaceable> <type>tsquery</> <optional>, <replaceable class="PARAMETER">normalization</replaceable> <type>integer</> </optional>) returns <type>float4</>
 865 </synopsis>
 866       </term>
 867
 868       <listitem>
 869        <para>
 870         Ranks vectors based on the frequency of their matching lexems.
 871        </para>
 872       </listitem>
 873      </varlistentry>
 874
 875      <varlistentry>
 876
 877       <indexterm>
 878        <primary>ts_rank_cd</primary>
 879       </indexterm>
 880
 881       <term>
 882 <synopsis>
 883 ts_rank_cd(<optional> <replaceable class="PARAMETER">weights</replaceable> <type>float4[]</>, </optional> <replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>,
 884            <replaceable class="PARAMETER">query</replaceable> <type>tsquery</> <optional>, <replaceable class="PARAMETER">normalization</replaceable> <type>integer</> </optional>) returns <type>float4</>
 885 </synopsis>
 886       </term>
 887
 888       <listitem>
 889        <para>
 890         This function computes the <firstterm>cover density</firstterm>
 891         ranking for the given document vector and query, as described in
 892         Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three
 893         Term Queries" in the journal "Information Processing and Management",
 894         1999.
 895        </para>
 896
 897        <para>
 898         This function requires positional information in its input.
 899         Therefore it will not work on <quote>stripped</> <type>tsvector</>
 900         values &mdash; it will always return zero.
 901        </para>
 902       </listitem>
 903      </varlistentry>
 904
 905     </variablelist>
 906
 907    </para>
 908
 909    <para>
 910     For both these functions,
 911     the optional <replaceable class="PARAMETER">weights</replaceable>
 912     argument offers the ability to weigh word instances more or less
 913     heavily depending on how they are labeled.  The weight arrays specify
 914     how heavily to weigh each category of word, in the order:
 915
 916 <synopsis>
 917 {D-weight, C-weight, B-weight, A-weight}
 918 </synopsis>
 919
 920     If no <replaceable class="PARAMETER">weights</replaceable> are provided,
 921     then these defaults are used:
 922
 923 <programlisting>
 924 {0.1, 0.2, 0.4, 1.0}
 925 </programlisting>
 926
 927     Typically weights are used to mark words from special areas of the
 928     document, like the title or an initial abstract, so they can be
 929     treated with more or less importance than words in the document body.
 930    </para>
 931
 932    <para>
 933     Since a longer document has a greater chance of containing a query term
 934     it is reasonable to take into account document size, e.g., a hundred-word
 935     document with five instances of a search word is probably more relevant
 936     than a thousand-word document with five instances.  Both ranking functions
 937     take an integer <replaceable>normalization</replaceable> option that
 938     specifies whether and how a document's length should impact its rank.
 939     The integer option controls several behaviors, so it is a bit mask:
 940     you can specify one or more behaviors using
 941     <literal>|</literal> (for example, <literal>2|4</literal>).
 942
 943     <itemizedlist  spacing="compact" mark="bullet">
 944      <listitem>
 945       <para>
 946        0 (the default) ignores the document length
 947       </para>
 948      </listitem>
 949      <listitem>
 950       <para>
 951        1 divides the rank by 1 + the logarithm of the document length
 952       </para>
 953      </listitem>
 954      <listitem>
 955       <para>
 956        2 divides the rank by the document length
 957       </para>
 958      </listitem>
 959      <listitem>
 960       <para>
 961        4 divides the rank by the mean harmonic distance between extents
 962        (this is implemented only by <function>ts_rank_cd</>)
 963       </para>
 964      </listitem>
 965      <listitem>
 966       <para>
 967        8 divides the rank by the number of unique words in document
 968       </para>
 969      </listitem>
 970      <listitem>
 971       <para>
 972        16 divides the rank by 1 + the logarithm of the number
 973        of unique words in document
 974       </para>
 975      </listitem>
 976      <listitem>
 977       <para>
 978        32 divides the rank by itself + 1
 979       </para>
 980      </listitem>
 981     </itemizedlist>
 982
 983     If more than one flag bit is specified, the transformations are
 984     applied in the order listed.
 985    </para>
 986
 987    <para>
 988     It is important to note that the ranking functions do not use any global
 989     information, so it is impossible to produce a fair normalization to 1% or
 990     100% as sometimes desired.  Normalization option 32
 991     (<literal>rank/(rank+1)</literal>) can be applied to scale all ranks
 992     into the range zero to one, but of course this is just a cosmetic change;
 993     it will not affect the ordering of the search results.
 994    </para>
 995
 996    <para>
 997     Here is an example that selects only the ten highest-ranked matches:
 998
 999 <screen>
1000 SELECT title, ts_rank_cd(textsearch, query) AS rank
1001 FROM apod, to_tsquery('neutrino|(dark &amp; matter)') query
1002 WHERE query @@ textsearch
1003 ORDER BY rank DESC
1004 LIMIT 10;
1005                      title                     |   rank
1006 -----------------------------------------------+----------
1007  Neutrinos in the Sun                          |      3.1
1008  The Sudbury Neutrino Detector                 |      2.4
1009  A MACHO View of Galactic Dark Matter          |  2.01317
1010  Hot Gas and Dark Matter                       |  1.91171
1011  The Virgo Cluster: Hot Plasma and Dark Matter |  1.90953
1012  Rafting for Solar Neutrinos                   |      1.9
1013  NGC 4650A: Strange Galaxy and Dark Matter     |  1.85774
1014  Hot Gas and Dark Matter                       |   1.6123
1015  Ice Fishing for Cosmic Neutrinos              |      1.6
1016  Weak Lensing Distorts the Universe            | 0.818218
1017 </screen>
1018
1019     This is the same example using normalized ranking:
1020
1021 <screen>
1022 SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank
1023 FROM apod, to_tsquery('neutrino|(dark &amp; matter)') query
1024 WHERE  query @@ textsearch
1025 ORDER BY rank DESC
1026 LIMIT 10;
1027                      title                     |        rank
1028 -----------------------------------------------+-------------------
1029  Neutrinos in the Sun                          | 0.756097569485493
1030  The Sudbury Neutrino Detector                 | 0.705882361190954
1031  A MACHO View of Galactic Dark Matter          | 0.668123210574724
1032  Hot Gas and Dark Matter                       |  0.65655958650282
1033  The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
1034  Rafting for Solar Neutrinos                   | 0.655172410958162
1035  NGC 4650A: Strange Galaxy and Dark Matter     | 0.650072921219637
1036  Hot Gas and Dark Matter                       | 0.617195790024749
1037  Ice Fishing for Cosmic Neutrinos              | 0.615384618911517
1038  Weak Lensing Distorts the Universe            | 0.450010798361481
1039 </screen>
1040    </para>
1041
1042    <para>
1043     Ranking can be expensive since it requires consulting the
1044     <type>tsvector</type> of each matching document, which can be I/O bound and
1045     therefore slow. Unfortunately, it is almost impossible to avoid since
1046     practical queries often result in large numbers of matches.
1047    </para>
1048
1049   </sect2>
1050
1051   <sect2 id="textsearch-headline">
1052    <title>Highlighting Results</title>
1053
1054    <para>
1055     To present search results it is ideal to show a part of each document and
1056     how it is related to the query. Usually, search engines show fragments of
1057     the document with marked search terms.  <productname>PostgreSQL</>
1058     provides a function <function>ts_headline</function> that
1059     implements this functionality.
1060    </para>
1061
1062    <indexterm>
1063     <primary>ts_headline</primary>
1064    </indexterm>
1065
1066 <synopsis>
1067 ts_headline(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">document</replaceable> <type>text</>, <replaceable class="PARAMETER">query</replaceable> <type>tsquery</> <optional>, <replaceable class="PARAMETER">options</replaceable> <type>text</> </optional>) returns <type>text</>
1068 </synopsis>
1069
1070    <para>
1071     <function>ts_headline</function> accepts a document along
1072     with a query, and returns an excerpt from
1073     the document in which terms from the query are highlighted.  The
1074     configuration to be used to parse the document can be specified by
1075     <replaceable>config</replaceable>; if <replaceable>config</replaceable>
1076     is omitted, the
1077     <varname>default_text_search_config</varname> configuration is used.
1078    </para>
1079
1080    <para>
1081     If an <replaceable>options</replaceable> string is specified it must
1082     consist of a comma-separated list of one or more
1083     <replaceable>option</><literal>=</><replaceable>value</> pairs.
1084     The available options are:
1085
1086     <itemizedlist  spacing="compact" mark="bullet">
1087      <listitem>
1088       <para>
1089        <literal>StartSel</>, <literal>StopSel</literal>: the strings with
1090        which to delimit query words appearing in the document, to distinguish
1091        them from other excerpted words.  You must double-quote these strings
1092        if they contain spaces or commas.
1093       </para>
1094      </listitem>
1095      <listitem>
1096       <para>
1097        <literal>MaxWords</>, <literal>MinWords</literal>: these numbers
1098        determine the longest and shortest headlines to output.
1099       </para>
1100      </listitem>
1101      <listitem>
1102       <para>
1103        <literal>ShortWord</literal>: words of this length or less will be
1104        dropped at the start and end of a headline. The default
1105        value of three eliminates common English articles.
1106       </para>
1107      </listitem>
1108      <listitem>
1109       <para>
1110        <literal>HighlightAll</literal>: Boolean flag;  if
1111        <literal>true</literal> the whole document will be used as the
1112        headline, ignoring the preceding three parameters.
1113       </para>
1114      </listitem>
1115      <listitem>
1116       <para>
1117        <literal>MaxFragments</literal>: maximum number of text excerpts
1118        or fragments to display.  The default value of zero selects a
1119        non-fragment-oriented headline generation method.  A value greater than
1120        zero selects fragment-based headline generation.  This method
1121        finds text fragments with as many query words as possible and
1122        stretches those fragments around the query words.  As a result
1123        query words are close to the middle of each fragment and have words on
1124        each side. Each fragment will be of at most <literal>MaxWords</> and
1125        words of length <literal>ShortWord</> or less are dropped at the start
1126        and end of each fragment. If not all query words are found in the
1127        document, then a single fragment of the first <literal>MinWords</>
1128        in the document will be displayed.
1129       </para>
1130      </listitem>
1131      <listitem>
1132       <para>
1133        <literal>FragmentDelimiter</literal>: When more than one fragment is
1134        displayed, the fragments will be separated by this string.
1135       </para>
1136      </listitem>
1137     </itemizedlist>
1138
1139     Any unspecified options receive these defaults:
1140
1141 <programlisting>
1142 StartSel=&lt;b&gt;, StopSel=&lt;/b&gt;,
1143 MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE,
1144 MaxFragments=0, FragmentDelimiter=" ... "
1145 </programlisting>
1146    </para>
1147
1148    <para>
1149     For example:
1150
1151 <screen>
1152 SELECT ts_headline('english',
1153   'The most common type of search
1154 is to find all documents containing given query terms
1155 and return them in order of their similarity to the
1156 query.',
1157   to_tsquery('query &amp; similarity'));
1158                         ts_headline
1159 ------------------------------------------------------------
1160  containing given &lt;b&gt;query&lt;/b&gt; terms
1161  and return them in order of their &lt;b&gt;similarity&lt;/b&gt; to the
1162  &lt;b&gt;query&lt;/b&gt;.
1163
1164 SELECT ts_headline('english',
1165   'The most common type of search
1166 is to find all documents containing given query terms
1167 and return them in order of their similarity to the
1168 query.',
1169   to_tsquery('query &amp; similarity'),
1170   'StartSel = &lt;, StopSel = &gt;');
1171                       ts_headline
1172 -------------------------------------------------------
1173  containing given &lt;query&gt; terms
1174  and return them in order of their &lt;similarity&gt; to the
1175  &lt;query&gt;.
1176 </screen>
1177    </para>
1178
1179    <para>
1180     <function>ts_headline</> uses the original document, not a
1181     <type>tsvector</type> summary, so it can be slow and should be used with
1182     care.  A typical mistake is to call <function>ts_headline</function> for
1183     <emphasis>every</emphasis> matching document when only ten documents are
1184     to be shown. <acronym>SQL</acronym> subqueries can help; here is an
1185     example:
1186
1187 <programlisting>
1188 SELECT id, ts_headline(body, q), rank
1189 FROM (SELECT id, body, q, ts_rank_cd(ti, q) AS rank
1190       FROM apod, to_tsquery('stars') q
1191       WHERE ti @@ q
1192       ORDER BY rank DESC
1193       LIMIT 10) AS foo;
1194 </programlisting>
1195    </para>
1196
1197   </sect2>
1198
1199  </sect1>
1200
1201  <sect1 id="textsearch-features">
1202   <title>Additional Features</title>
1203
1204   <para>
1205    This section describes additional functions and operators that are
1206    useful in connection with text search.
1207   </para>
1208
1209   <sect2 id="textsearch-manipulate-tsvector">
1210    <title>Manipulating Documents</title>
1211
1212    <para>
1213     <xref linkend="textsearch-parsing-documents"> showed how raw textual
1214     documents can be converted into <type>tsvector</> values.
1215     <productname>PostgreSQL</productname> also provides functions and
1216     operators that can be used to manipulate documents that are already
1217     in <type>tsvector</> form.
1218    </para>
1219
1220    <variablelist>
1221
1222     <varlistentry>
1223
1224      <indexterm>
1225       <primary>tsvector concatenation</primary>
1226      </indexterm>
1227
1228      <term>
1229 <synopsis>
1230 <type>tsvector</> || <type>tsvector</>
1231 </synopsis>
1232      </term>
1233
1234      <listitem>
1235       <para>
1236        The <type>tsvector</> concatenation operator
1237        returns a vector which combines the lexemes and positional information
1238        of the two vectors given as arguments.  Positions and weight labels
1239        are retained during the concatenation.
1240        Positions appearing in the right-hand vector are offset by the largest
1241        position mentioned in the left-hand vector, so that the result is
1242        nearly equivalent to the result of performing <function>to_tsvector</>
1243        on the concatenation of the two original document strings.  (The
1244        equivalence is not exact, because any stop-words removed from the
1245        end of the left-hand argument will not affect the result, whereas
1246        they would have affected the positions of the lexemes in the
1247        right-hand argument if textual concatenation were used.)
1248       </para>
1249
1250       <para>
1251        One advantage of using concatenation in the vector form, rather than
1252        concatenating text before applying <function>to_tsvector</>, is that
1253        you can use different configurations to parse different sections
1254        of the document.  Also, because the <function>setweight</> function
1255        marks all lexemes of the given vector the same way, it is necessary
1256        to parse the text and do <function>setweight</> before concatenating
1257        if you want to label different parts of the document with different
1258        weights.
1259       </para>
1260      </listitem>
1261     </varlistentry>
1262
1263     <varlistentry>
1264
1265      <indexterm>
1266       <primary>setweight</primary>
1267      </indexterm>
1268
1269      <term>
1270 <synopsis>
1271 setweight(<replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>, <replaceable class="PARAMETER">weight</replaceable> <type>"char"</>) returns <type>tsvector</>
1272 </synopsis>
1273      </term>
1274
1275      <listitem>
1276       <para>
1277        <function>setweight</> returns a copy of the input vector in which every
1278        position has been labeled with the given <replaceable>weight</>, either
1279        <literal>A</literal>, <literal>B</literal>, <literal>C</literal>, or
1280        <literal>D</literal>.  (<literal>D</literal> is the default for new
1281        vectors and as such is not displayed on output.)  These labels are
1282        retained when vectors are concatenated, allowing words from different
1283        parts of a document to be weighted differently by ranking functions.
1284       </para>
1285
1286       <para>
1287        Note that weight labels apply to <emphasis>positions</>, not
1288        <emphasis>lexemes</>.  If the input vector has been stripped of
1289        positions then <function>setweight</> does nothing.
1290       </para>
1291      </listitem>
1292     </varlistentry>
1293
1294     <varlistentry>
1295      <indexterm>
1296       <primary>length(tsvector)</primary>
1297      </indexterm>
1298
1299      <term>
1300 <synopsis>
1301 length(<replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>) returns <type>integer</>
1302 </synopsis>
1303      </term>
1304
1305      <listitem>
1306       <para>
1307        Returns the number of lexemes stored in the vector.
1308       </para>
1309      </listitem>
1310     </varlistentry>
1311
1312     <varlistentry>
1313
1314      <indexterm>
1315       <primary>strip</primary>
1316      </indexterm>
1317
1318      <term>
1319 <synopsis>
1320 strip(<replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>) returns <type>tsvector</>
1321 </synopsis>
1322      </term>
1323
1324      <listitem>
1325       <para>
1326        Returns a vector which lists the same lexemes as the given vector, but
1327        which lacks any position or weight information.  While the returned
1328        vector is much less useful than an unstripped vector for relevance
1329        ranking, it will usually be much smaller.
1330       </para>
1331      </listitem>
1332
1333     </varlistentry>
1334
1335    </variablelist>
1336
1337   </sect2>
1338
1339   <sect2 id="textsearch-manipulate-tsquery">
1340    <title>Manipulating Queries</title>
1341
1342    <para>
1343     <xref linkend="textsearch-parsing-queries"> showed how raw textual
1344     queries can be converted into <type>tsquery</> values.
1345     <productname>PostgreSQL</productname> also provides functions and
1346     operators that can be used to manipulate queries that are already
1347     in <type>tsquery</> form.
1348    </para>
1349
1350    <variablelist>
1351
1352     <varlistentry>
1353
1354      <term>
1355 <synopsis>
1356 <type>tsquery</> &amp;&amp; <type>tsquery</>
1357 </synopsis>
1358      </term>
1359
1360      <listitem>
1361       <para>
1362        Returns the AND-combination of the two given queries.
1363       </para>
1364      </listitem>
1365
1366     </varlistentry>
1367
1368     <varlistentry>
1369
1370      <term>
1371 <synopsis>
1372 <type>tsquery</> || <type>tsquery</>
1373 </synopsis>
1374      </term>
1375
1376      <listitem>
1377       <para>
1378        Returns the OR-combination of the two given queries.
1379       </para>
1380      </listitem>
1381
1382     </varlistentry>
1383
1384     <varlistentry>
1385
1386      <term>
1387 <synopsis>
1388 !! <type>tsquery</>
1389 </synopsis>
1390      </term>
1391
1392      <listitem>
1393       <para>
1394        Returns the negation (NOT) of the given query.
1395       </para>
1396      </listitem>
1397
1398     </varlistentry>
1399
1400     <varlistentry>
1401
1402      <indexterm>
1403       <primary>numnode</primary>
1404      </indexterm>
1405
1406      <term>
1407 <synopsis>
1408 numnode(<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>) returns <type>integer</>
1409 </synopsis>
1410      </term>
1411
1412      <listitem>
1413       <para>
1414        Returns the number of nodes (lexemes plus operators) in a
1415        <type>tsquery</>. This function is useful
1416        to determine if the <replaceable>query</replaceable> is meaningful
1417        (returns &gt; 0), or contains only stop words (returns 0).
1418        Examples:
1419
1420 <screen>
1421 SELECT numnode(plainto_tsquery('the any'));
1422 NOTICE:  query contains only stopword(s) or doesn't contain lexeme(s), ignored
1423  numnode
1424 ---------
1425        0
1426
1427 SELECT numnode('foo &amp; bar'::tsquery);
1428  numnode
1429 ---------
1430        3
1431 </screen>
1432       </para>
1433      </listitem>
1434     </varlistentry>
1435
1436     <varlistentry>
1437
1438      <indexterm>
1439       <primary>querytree</primary>
1440      </indexterm>
1441
1442      <term>
1443 <synopsis>
1444 querytree(<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>) returns <type>text</>
1445 </synopsis>
1446      </term>
1447
1448      <listitem>
1449       <para>
1450        Returns the portion of a <type>tsquery</> that can be used for
1451        searching an index.  This function is useful for detecting
1452        unindexable queries, for example those containing only stop words
1453        or only negated terms.  For example:
1454
1455 <screen>
1456 SELECT querytree(to_tsquery('!defined'));
1457  querytree
1458 -----------
1459
1460 </screen>
1461       </para>
1462      </listitem>
1463     </varlistentry>
1464
1465    </variablelist>
1466
1467    <sect3 id="textsearch-query-rewriting">
1468     <title>Query Rewriting</title>
1469
1470     <indexterm zone="textsearch-query-rewriting">
1471      <primary>ts_rewrite</primary>
1472     </indexterm>
1473
1474     <para>
1475      The <function>ts_rewrite</function> family of functions search a
1476      given <type>tsquery</> for occurrences of a target
1477      subquery, and replace each occurrence with a
1478      substitute subquery.  In essence this operation is a
1479      <type>tsquery</>-specific version of substring replacement.
1480      A target and substitute combination can be
1481      thought of as a <firstterm>query rewrite rule</>.  A collection
1482      of such rewrite rules can be a powerful search aid.
1483      For example, you can expand the search using synonyms
1484      (e.g., <literal>new york</>, <literal>big apple</>, <literal>nyc</>,
1485      <literal>gotham</>) or narrow the search to direct the user to some hot
1486      topic.  There is some overlap in functionality between this feature
1487      and thesaurus dictionaries (<xref linkend="textsearch-thesaurus">).
1488      However, you can modify a set of rewrite rules on-the-fly without
1489      reindexing, whereas updating a thesaurus requires reindexing to be
1490      effective.
1491     </para>
1492
1493     <variablelist>
1494
1495      <varlistentry>
1496
1497       <term>
1498 <synopsis>
1499 ts_rewrite (<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">target</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">substitute</replaceable> <type>tsquery</>) returns <type>tsquery</>
1500 </synopsis>
1501       </term>
1502
1503       <listitem>
1504        <para>
1505         This form of <function>ts_rewrite</> simply applies a single
1506         rewrite rule: <replaceable class="PARAMETER">target</replaceable>
1507         is replaced by <replaceable class="PARAMETER">substitute</replaceable>
1508         wherever it appears in <replaceable
1509         class="PARAMETER">query</replaceable>.  For example:
1510
1511 <screen>
1512 SELECT ts_rewrite('a &amp; b'::tsquery, 'a'::tsquery, 'c'::tsquery);
1513  ts_rewrite
1514 ------------
1515  'b' &amp; 'c'
1516 </screen>
1517        </para>
1518       </listitem>
1519      </varlistentry>
1520
1521      <varlistentry>
1522
1523       <term>
1524 <synopsis>
1525 ts_rewrite (<replaceable class="PARAMETER">query</> <type>tsquery</>, <replaceable class="PARAMETER">select</> <type>text</>) returns <type>tsquery</>
1526 </synopsis>
1527       </term>
1528
1529       <listitem>
1530        <para>
1531         This form of <function>ts_rewrite</> accepts a starting
1532         <replaceable>query</> and a SQL <replaceable>select</> command, which
1533         is given as a text string.  The <replaceable>select</> must yield two
1534         columns of <type>tsquery</> type.  For each row of the
1535         <replaceable>select</> result, occurrences of the first column value
1536         (the target) are replaced by the second column value (the substitute)
1537         within the current <replaceable>query</> value.  For example:
1538
1539 <screen>
1540 CREATE TABLE aliases (t tsquery PRIMARY KEY, s tsquery);
1541 INSERT INTO aliases VALUES('a', 'c');
1542
1543 SELECT ts_rewrite('a &amp; b'::tsquery, 'SELECT t,s FROM aliases');
1544  ts_rewrite
1545 ------------
1546  'b' &amp; 'c'
1547 </screen>
1548        </para>
1549
1550        <para>
1551         Note that when multiple rewrite rules are applied in this way,
1552         the order of application can be important; so in practice you will
1553         want the source query to <literal>ORDER BY</> some ordering key.
1554        </para>
1555       </listitem>
1556      </varlistentry>
1557
1558     </variablelist>
1559
1560     <para>
1561      Let's consider a real-life astronomical example. We'll expand query
1562      <literal>supernovae</literal> using table-driven rewriting rules:
1563
1564 <screen>
1565 CREATE TABLE aliases (t tsquery primary key, s tsquery);
1566 INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn'));
1567
1568 SELECT ts_rewrite(to_tsquery('supernovae &amp; crab'), 'SELECT * FROM aliases');
1569            ts_rewrite
1570 ---------------------------------
1571  'crab' &amp; ( 'supernova' | 'sn' )
1572 </screen>
1573
1574      We can change the rewriting rules just by updating the table:
1575
1576 <screen>
1577 UPDATE aliases
1578 SET s = to_tsquery('supernovae|sn &amp; !nebulae')
1579 WHERE t = to_tsquery('supernovae');
1580
1581 SELECT ts_rewrite(to_tsquery('supernovae &amp; crab'), 'SELECT * FROM aliases');
1582                  ts_rewrite
1583 ---------------------------------------------
1584  'crab' &amp; ( 'supernova' | 'sn' &amp; !'nebula' )
1585 </screen>
1586     </para>
1587
1588     <para>
1589      Rewriting can be slow when there are many rewriting rules, since it
1590      checks every rule for a possible match. To filter out obvious non-candidate
1591      rules we can use the containment operators for the <type>tsquery</type>
1592      type. In the example below, we select only those rules which might match
1593      the original query:
1594
1595 <screen>
1596 SELECT ts_rewrite('a &amp; b'::tsquery,
1597                   'SELECT t,s FROM aliases WHERE ''a &amp; b''::tsquery @&gt; t');
1598  ts_rewrite
1599 ------------
1600  'b' &amp; 'c'
1601 </screen>
1602     </para>
1603
1604    </sect3>
1605
1606   </sect2>
1607
1608   <sect2 id="textsearch-update-triggers">
1609    <title>Triggers for Automatic Updates</title>
1610
1611    <indexterm>
1612     <primary>trigger</primary>
1613     <secondary>for updating a derived tsvector column</secondary>
1614    </indexterm>
1615
1616    <para>
1617     When using a separate column to store the <type>tsvector</> representation
1618     of your documents, it is necessary to create a trigger to update the
1619     <type>tsvector</> column when the document content columns change.
1620     Two built-in trigger functions are available for this, or you can write
1621     your own.
1622    </para>
1623
1624 <synopsis>
1625 tsvector_update_trigger(<replaceable class="PARAMETER">tsvector_column_name</replaceable>, <replaceable class="PARAMETER">config_name</replaceable>, <replaceable class="PARAMETER">text_column_name</replaceable> <optional>, ... </optional>)
1626 tsvector_update_trigger_column(<replaceable class="PARAMETER">tsvector_column_name</replaceable>, <replaceable class="PARAMETER">config_column_name</replaceable>, <replaceable class="PARAMETER">text_column_name</replaceable> <optional>, ... </optional>)
1627 </synopsis>
1628
1629    <para>
1630     These trigger functions automatically compute a <type>tsvector</>
1631     column from one or more textual columns, under the control of
1632     parameters specified in the <command>CREATE TRIGGER</> command.
1633     An example of their use is:
1634
1635 <screen>
1636 CREATE TABLE messages (
1637     title       text,
1638     body        text,
1639     tsv         tsvector
1640 );
1641
1642 CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
1643 ON messages FOR EACH ROW EXECUTE PROCEDURE
1644 tsvector_update_trigger(tsv, 'pg_catalog.english', title, body);
1645
1646 INSERT INTO messages VALUES('title here', 'the body text is here');
1647
1648 SELECT * FROM messages;
1649    title    |         body          |            tsv
1650 ------------+-----------------------+----------------------------
1651  title here | the body text is here | 'bodi':4 'text':5 'titl':1
1652
1653 SELECT title, body FROM messages WHERE tsv @@ to_tsquery('title &amp; body');
1654    title    |         body
1655 ------------+-----------------------
1656  title here | the body text is here
1657 </screen>
1658
1659     Having created this trigger, any change in <structfield>title</> or
1660     <structfield>body</> will automatically be reflected into
1661     <structfield>tsv</>, without the application having to worry about it.
1662    </para>
1663
1664    <para>
1665     The first trigger argument must be the name of the <type>tsvector</>
1666     column to be updated.  The second argument specifies the text search
1667     configuration to be used to perform the conversion.  For
1668     <function>tsvector_update_trigger</>, the configuration name is simply
1669     given as the second trigger argument.  It must be schema-qualified as
1670     shown above, so that the trigger behavior will not change with changes
1671     in <varname>search_path</>.  For
1672     <function>tsvector_update_trigger_column</>, the second trigger argument
1673     is the name of another table column, which must be of type
1674     <type>regconfig</>.  This allows a per-row selection of configuration
1675     to be made.  The remaining argument(s) are the names of textual columns
1676     (of type <type>text</>, <type>varchar</>, or <type>char</>).  These
1677     will be included in the document in the order given.  NULL values will
1678     be skipped (but the other columns will still be indexed).
1679    </para>
1680
1681    <para>
1682     A limitation of these built-in triggers is that they treat all the
1683     input columns alike.  To process columns differently &mdash; for
1684     example, to weight title differently from body &mdash; it is necessary
1685     to write a custom trigger.  Here is an example using
1686     <application>PL/pgSQL</application> as the trigger language:
1687
1688 <programlisting>
1689 CREATE FUNCTION messages_trigger() RETURNS trigger AS $$
1690 begin
1691   new.tsv :=
1692      setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'A') ||
1693      setweight(to_tsvector('pg_catalog.english', coalesce(new.body,'')), 'D');
1694   return new;
1695 end
1696 $$ LANGUAGE plpgsql;
1697
1698 CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
1699     ON messages FOR EACH ROW EXECUTE PROCEDURE messages_trigger();
1700 </programlisting>
1701    </para>
1702
1703    <para>
1704     Keep in mind that it is important to specify the configuration name
1705     explicitly when creating <type>tsvector</> values inside triggers,
1706     so that the column's contents will not be affected by changes to
1707     <varname>default_text_search_config</>.  Failure to do this is likely to
1708     lead to problems such as search results changing after a dump and reload.
1709    </para>
1710
1711   </sect2>
1712
1713   <sect2 id="textsearch-statistics">
1714    <title>Gathering Document Statistics</title>
1715
1716    <indexterm>
1717     <primary>ts_stat</primary>
1718    </indexterm>
1719
1720    <para>
1721     The function <function>ts_stat</> is useful for checking your
1722     configuration and for finding stop-word candidates.
1723    </para>
1724
1725 <synopsis>
1726 ts_stat(<replaceable class="PARAMETER">sqlquery</replaceable> <type>text</>, <optional> <replaceable class="PARAMETER">weights</replaceable> <type>text</>, </optional>
1727         OUT <replaceable class="PARAMETER">word</replaceable> <type>text</>, OUT <replaceable class="PARAMETER">ndoc</replaceable> <type>integer</>,
1728         OUT <replaceable class="PARAMETER">nentry</replaceable> <type>integer</>) returns <type>setof record</>
1729 </synopsis>
1730
1731    <para>
1732     <replaceable>sqlquery</replaceable> is a text value containing an SQL
1733     query which must return a single <type>tsvector</type> column.
1734     <function>ts_stat</> executes the query and returns statistics about
1735     each distinct lexeme (word) contained in the <type>tsvector</type>
1736     data.  The columns returned are
1737
1738     <itemizedlist  spacing="compact" mark="bullet">
1739      <listitem>
1740       <para>
1741        <replaceable>word</> <type>text</> &mdash; the value of a lexeme
1742       </para>
1743      </listitem>
1744      <listitem>
1745       <para>
1746        <replaceable>ndoc</> <type>integer</> &mdash; number of documents
1747        (<type>tsvector</>s) the word occurred in
1748       </para>
1749      </listitem>
1750      <listitem>
1751       <para>
1752        <replaceable>nentry</> <type>integer</> &mdash; total number of
1753        occurrences of the word
1754       </para>
1755      </listitem>
1756     </itemizedlist>
1757
1758     If <replaceable>weights</replaceable> is supplied, only occurrences
1759     having one of those weights are counted.
1760    </para>
1761
1762    <para>
1763     For example, to find the ten most frequent words in a document collection:
1764
1765 <programlisting>
1766 SELECT * FROM ts_stat('SELECT vector FROM apod')
1767 ORDER BY nentry DESC, ndoc DESC, word
1768 LIMIT 10;
1769 </programlisting>
1770
1771     The same, but counting only word occurrences with weight <literal>A</>
1772     or <literal>B</>:
1773
1774 <programlisting>
1775 SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
1776 ORDER BY nentry DESC, ndoc DESC, word
1777 LIMIT 10;
1778 </programlisting>
1779    </para>
1780
1781   </sect2>
1782
1783  </sect1>
1784
1785  <sect1 id="textsearch-parsers">
1786   <title>Parsers</title>
1787
1788   <para>
1789    Text search parsers are responsible for splitting raw document text
1790    into <firstterm>tokens</> and identifying each token's type, where
1791    the set of possible types is defined by the parser itself.
1792    Note that a parser does not modify the text at all &mdash; it simply
1793    identifies plausible word boundaries.  Because of this limited scope,
1794    there is less need for application-specific custom parsers than there is
1795    for custom dictionaries.  At present <productname>PostgreSQL</productname>
1796    provides just one built-in parser, which has been found to be useful for a
1797    wide range of applications.
1798   </para>
1799
1800   <para>
1801    The built-in parser is named <literal>pg_catalog.default</>.
1802    It recognizes 23 token types, shown in <xref linkend="textsearch-default-parser">.
1803   </para>
1804
1805   <table id="textsearch-default-parser">
1806    <title>Default Parser's Token Types</title>
1807    <tgroup cols="3">
1808     <thead>
1809      <row>
1810       <entry>Alias</entry>
1811       <entry>Description</entry>
1812       <entry>Example</entry>
1813      </row>
1814     </thead>
1815     <tbody>
1816      <row>
1817       <entry><literal>asciiword</></entry>
1818       <entry>Word, all ASCII letters</entry>
1819       <entry><literal>elephant</literal></entry>
1820      </row>
1821      <row>
1822       <entry><literal>word</></entry>
1823       <entry>Word, all letters</entry>
1824       <entry><literal>ma&ntilde;ana</literal></entry>
1825      </row>
1826      <row>
1827       <entry><literal>numword</></entry>
1828       <entry>Word, letters and digits</entry>
1829       <entry><literal>beta1</literal></entry>
1830      </row>
1831      <row>
1832       <entry><literal>asciihword</></entry>
1833       <entry>Hyphenated word, all ASCII</entry>
1834       <entry><literal>up-to-date</literal></entry>
1835      </row>
1836      <row>
1837       <entry><literal>hword</></entry>
1838       <entry>Hyphenated word, all letters</entry>
1839       <entry><literal>l&oacute;gico-matem&aacute;tica</literal></entry>
1840      </row>
1841      <row>
1842       <entry><literal>numhword</></entry>
1843       <entry>Hyphenated word, letters and digits</entry>
1844       <entry><literal>postgresql-beta1</literal></entry>
1845      </row>
1846      <row>
1847       <entry><literal>hword_asciipart</></entry>
1848       <entry>Hyphenated word part, all ASCII</entry>
1849       <entry><literal>postgresql</literal> in the context <literal>postgresql-beta1</literal></entry>
1850      </row>
1851      <row>
1852       <entry><literal>hword_part</></entry>
1853       <entry>Hyphenated word part, all letters</entry>
1854       <entry><literal>l&oacute;gico</literal> or <literal>matem&aacute;tica</literal>
1855        in the context <literal>l&oacute;gico-matem&aacute;tica</literal></entry>
1856      </row>
1857      <row>
1858       <entry><literal>hword_numpart</></entry>
1859       <entry>Hyphenated word part, letters and digits</entry>
1860       <entry><literal>beta1</literal> in the context
1861        <literal>postgresql-beta1</literal></entry>
1862      </row>
1863      <row>
1864       <entry><literal>email</></entry>
1865       <entry>Email address</entry>
1866       <entry><literal>foo@example.com</literal></entry>
1867      </row>
1868      <row>
1869       <entry><literal>protocol</></entry>
1870       <entry>Protocol head</entry>
1871       <entry><literal>http://</literal></entry>
1872      </row>
1873      <row>
1874       <entry><literal>url</></entry>
1875       <entry>URL</entry>
1876       <entry><literal>example.com/stuff/index.html</literal></entry>
1877      </row>
1878      <row>
1879       <entry><literal>host</></entry>
1880       <entry>Host</entry>
1881       <entry><literal>example.com</literal></entry>
1882      </row>
1883      <row>
1884       <entry><literal>url_path</></entry>
1885       <entry>URL path</entry>
1886       <entry><literal>/stuff/index.html</literal>, in the context of a URL</entry>
1887      </row>
1888      <row>
1889       <entry><literal>file</></entry>
1890       <entry>File or path name</entry>
1891       <entry><literal>/usr/local/foo.txt</literal>, if not within a URL</entry>
1892      </row>
1893      <row>
1894       <entry><literal>sfloat</></entry>
1895       <entry>Scientific notation</entry>
1896       <entry><literal>-1.234e56</literal></entry>
1897      </row>
1898      <row>
1899       <entry><literal>float</></entry>
1900       <entry>Decimal notation</entry>
1901       <entry><literal>-1.234</literal></entry>
1902      </row>
1903      <row>
1904       <entry><literal>int</></entry>
1905       <entry>Signed integer</entry>
1906       <entry><literal>-1234</literal></entry>
1907      </row>
1908      <row>
1909       <entry><literal>uint</></entry>
1910       <entry>Unsigned integer</entry>
1911       <entry><literal>1234</literal></entry>
1912      </row>
1913      <row>
1914       <entry><literal>version</></entry>
1915       <entry>Version number</entry>
1916       <entry><literal>8.3.0</literal></entry>
1917      </row>
1918      <row>
1919       <entry><literal>tag</></entry>
1920       <entry>XML tag</entry>
1921       <entry><literal>&lt;a href="dictionaries.html"&gt;</literal></entry>
1922      </row>
1923      <row>
1924       <entry><literal>entity</></entry>
1925       <entry>XML entity</entry>
1926       <entry><literal>&amp;amp;</literal></entry>
1927      </row>
1928      <row>
1929       <entry><literal>blank</></entry>
1930       <entry>Space symbols</entry>
1931       <entry>(any whitespace or punctuation not otherwise recognized)</entry>
1932      </row>
1933     </tbody>
1934    </tgroup>
1935   </table>
1936
1937   <note>
1938    <para>
1939     The parser's notion of a <quote>letter</> is determined by the database's
1940     locale setting, specifically <varname>lc_ctype</>.  Words containing
1941     only the basic ASCII letters are reported as a separate token type,
1942     since it is sometimes useful to distinguish them.  In most European
1943     languages, token types <literal>word</> and <literal>asciiword</>
1944     should be treated alike.
1945    </para>
1946
1947    <para>
1948     <literal>email</> does not support all valid email characters as
1949     defined by RFC 5322.  Specifically, the only non-alphanumeric
1950     characters supported for email user names are period, dash, and
1951     underscore.
1952    </para>
1953   </note>
1954
1955   <para>
1956    It is possible for the parser to produce overlapping tokens from the same
1957    piece of text.  As an example, a hyphenated word will be reported both
1958    as the entire word and as each component:
1959
1960 <screen>
1961 SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
1962       alias      |               description                |     token
1963 -----------------+------------------------------------------+---------------
1964  numhword        | Hyphenated word, letters and digits      | foo-bar-beta1
1965  hword_asciipart | Hyphenated word part, all ASCII          | foo
1966  blank           | Space symbols                            | -
1967  hword_asciipart | Hyphenated word part, all ASCII          | bar
1968  blank           | Space symbols                            | -
1969  hword_numpart   | Hyphenated word part, letters and digits | beta1
1970 </screen>
1971
1972    This behavior is desirable since it allows searches to work for both
1973    the whole compound word and for components.  Here is another
1974    instructive example:
1975
1976 <screen>
1977 SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');
1978   alias   |  description  |            token
1979 ----------+---------------+------------------------------
1980  protocol | Protocol head | http://
1981  url      | URL           | example.com/stuff/index.html
1982  host     | Host          | example.com
1983  url_path | URL path      | /stuff/index.html
1984 </screen>
1985   </para>
1986
1987  </sect1>
1988
1989  <sect1 id="textsearch-dictionaries">
1990   <title>Dictionaries</title>
1991
1992   <para>
1993    Dictionaries are used to eliminate words that should not be considered in a
1994    search (<firstterm>stop words</>), and to <firstterm>normalize</> words so
1995    that different derived forms of the same word will match.  A successfully
1996    normalized word is called a <firstterm>lexeme</>.  Aside from
1997    improving search quality, normalization and removal of stop words reduce the
1998    size of the <type>tsvector</type> representation of a document, thereby
1999    improving performance.  Normalization does not always have linguistic meaning
2000    and usually depends on application semantics.
2001   </para>
2002
2003   <para>
2004    Some examples of normalization:
2005
2006    <itemizedlist  spacing="compact" mark="bullet">
2007
2008     <listitem>
2009      <para>
2010       Linguistic - Ispell dictionaries try to reduce input words to a
2011       normalized form; stemmer dictionaries remove word endings
2012      </para>
2013     </listitem>
2014     <listitem>
2015      <para>
2016       <acronym>URL</acronym> locations can be canonicalized to make
2017       equivalent URLs match:
2018
2019       <itemizedlist  spacing="compact" mark="bullet">
2020        <listitem>
2021         <para>
2022          http://www.pgsql.ru/db/mw/index.html
2023         </para>
2024        </listitem>
2025        <listitem>
2026         <para>
2027          http://www.pgsql.ru/db/mw/
2028         </para>
2029        </listitem>
2030        <listitem>
2031         <para>
2032          http://www.pgsql.ru/db/../db/mw/index.html
2033         </para>
2034        </listitem>
2035       </itemizedlist>
2036      </para>
2037     </listitem>
2038     <listitem>
2039      <para>
2040       Color names can be replaced by their hexadecimal values, e.g.,
2041       <literal>red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF</literal>
2042      </para>
2043     </listitem>
2044     <listitem>
2045      <para>
2046       If indexing numbers, we can
2047       remove some fractional digits to reduce the range of possible
2048       numbers, so for example <emphasis>3.14</emphasis>159265359,
2049       <emphasis>3.14</emphasis>15926, <emphasis>3.14</emphasis> will be the same
2050       after normalization if only two digits are kept after the decimal point.
2051      </para>
2052     </listitem>
2053    </itemizedlist>
2054
2055   </para>
2056
2057   <para>
2058    A dictionary is a program that accepts a token as
2059    input and returns:
2060    <itemizedlist  spacing="compact" mark="bullet">
2061     <listitem>
2062      <para>
2063       an array of lexemes if the input token is known to the dictionary
2064       (notice that one token can produce more than one lexeme)
2065      </para>
2066     </listitem>
2067     <listitem>
2068      <para>
2069       a single lexeme with the <literal>TSL_FILTER</> flag set, to replace
2070       the original token with a new token to be passed to subsequent
2071       dictionaries (a dictionary that does this is called a
2072       <firstterm>filtering dictionary</>)
2073      </para>
2074     </listitem>
2075     <listitem>
2076      <para>
2077       an empty array if the dictionary knows the token, but it is a stop word
2078      </para>
2079     </listitem>
2080     <listitem>
2081      <para>
2082       <literal>NULL</literal> if the dictionary does not recognize the input token
2083      </para>
2084     </listitem>
2085    </itemizedlist>
2086   </para>
2087
2088   <para>
2089    <productname>PostgreSQL</productname> provides predefined dictionaries for
2090    many languages.  There are also several predefined templates that can be
2091    used to create new dictionaries with custom parameters.  Each predefined
2092    dictionary template is described below.  If no existing
2093    template is suitable, it is possible to create new ones; see the
2094    <filename>contrib/</> area of the <productname>PostgreSQL</> distribution
2095    for examples.
2096   </para>
2097
2098   <para>
2099    A text search configuration binds a parser together with a set of
2100    dictionaries to process the parser's output tokens.  For each token
2101    type that the parser can return, a separate list of dictionaries is
2102    specified by the configuration.  When a token of that type is found
2103    by the parser, each dictionary in the list is consulted in turn,
2104    until some dictionary recognizes it as a known word.  If it is identified
2105    as a stop word, or if no dictionary recognizes the token, it will be
2106    discarded and not indexed or searched for.
2107    Normally, the first dictionary that returns a non-<literal>NULL</>
2108    output determines the result, and any remaining dictionaries are not
2109    consulted; but a filtering dictionary can replace the given word
2110    with a modified word, which is then passed to subsequent dictionaries.
2111   </para>
2112
2113   <para>
2114    The general rule for configuring a list of dictionaries
2115    is to place first the most narrow, most specific dictionary, then the more
2116    general dictionaries, finishing with a very general dictionary, like
2117    a <application>Snowball</> stemmer or <literal>simple</>, which
2118    recognizes everything.  For example, for an astronomy-specific search
2119    (<literal>astro_en</literal> configuration) one could bind token type
2120    <type>asciiword</type> (ASCII word) to a synonym dictionary of astronomical
2121    terms, a general English dictionary and a <application>Snowball</> English
2122    stemmer:
2123
2124 <programlisting>
2125 ALTER TEXT SEARCH CONFIGURATION astro_en
2126     ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;
2127 </programlisting>
2128   </para>
2129
2130   <para>
2131    A filtering dictionary can be placed anywhere in the list, except at the
2132    end where it'd be useless.  Filtering dictionaries are useful to partially
2133    normalize words to simplify the task of later dictionaries.  For example,
2134    a filtering dictionary could be used to remove accents from accented
2135    letters, as is done by the <xref linkend="unaccent"> module.
2136   </para>
2137
2138   <sect2 id="textsearch-stopwords">
2139    <title>Stop Words</title>
2140
2141    <para>
2142     Stop words are words that are very common, appear in almost every
2143     document, and have no discrimination value. Therefore, they can be ignored
2144     in the context of full text searching. For example, every English text
2145     contains words like <literal>a</literal> and <literal>the</>, so it is
2146     useless to store them in an index.  However, stop words do affect the
2147     positions in <type>tsvector</type>, which in turn affect ranking:
2148
2149 <screen>
2150 SELECT to_tsvector('english','in the list of stop words');
2151         to_tsvector
2152 ----------------------------
2153  'list':3 'stop':5 'word':6
2154 </screen>
2155
2156     The missing positions 1,2,4 are because of stop words.  Ranks
2157     calculated for documents with and without stop words are quite different:
2158
2159 <screen>
2160 SELECT ts_rank_cd (to_tsvector('english','in the list of stop words'), to_tsquery('list &amp; stop'));
2161  ts_rank_cd
2162 ------------
2163        0.05
2164
2165 SELECT ts_rank_cd (to_tsvector('english','list stop words'), to_tsquery('list &amp; stop'));
2166  ts_rank_cd
2167 ------------
2168         0.1
2169 </screen>
2170
2171    </para>
2172
2173    <para>
2174     It is up to the specific dictionary how it treats stop words. For example,
2175     <literal>ispell</literal> dictionaries first normalize words and then
2176     look at the list of stop words, while <literal>Snowball</literal> stemmers
2177     first check the list of stop words. The reason for the different
2178     behavior is an attempt to decrease noise.
2179    </para>
2180
2181   </sect2>
2182
2183   <sect2 id="textsearch-simple-dictionary">
2184    <title>Simple Dictionary</title>
2185
2186    <para>
2187     The <literal>simple</> dictionary template operates by converting the
2188     input token to lower case and checking it against a file of stop words.
2189     If it is found in the file then an empty array is returned, causing
2190     the token to be discarded.  If not, the lower-cased form of the word
2191     is returned as the normalized lexeme.  Alternatively, the dictionary
2192     can be configured to report non-stop-words as unrecognized, allowing
2193     them to be passed on to the next dictionary in the list.
2194    </para>
2195
2196    <para>
2197     Here is an example of a dictionary definition using the <literal>simple</>
2198     template:
2199
2200 <programlisting>
2201 CREATE TEXT SEARCH DICTIONARY public.simple_dict (
2202     TEMPLATE = pg_catalog.simple,
2203     STOPWORDS = english
2204 );
2205 </programlisting>
2206
2207     Here, <literal>english</literal> is the base name of a file of stop words.
2208     The file's full name will be
2209     <filename>$SHAREDIR/tsearch_data/english.stop</>,
2210     where <literal>$SHAREDIR</> means the
2211     <productname>PostgreSQL</productname> installation's shared-data directory,
2212     often <filename>/usr/local/share/postgresql</> (use <command>pg_config
2213     --sharedir</> to determine it if you're not sure).
2214     The file format is simply a list
2215     of words, one per line.  Blank lines and trailing spaces are ignored,
2216     and upper case is folded to lower case, but no other processing is done
2217     on the file contents.
2218    </para>
2219
2220    <para>
2221     Now we can test our dictionary:
2222
2223 <screen>
2224 SELECT ts_lexize('public.simple_dict','YeS');
2225  ts_lexize
2226 -----------
2227  {yes}
2228
2229 SELECT ts_lexize('public.simple_dict','The');
2230  ts_lexize
2231 -----------
2232  {}
2233 </screen>
2234    </para>
2235
2236    <para>
2237     We can also choose to return <literal>NULL</>, instead of the lower-cased
2238     word, if it is not found in the stop words file.  This behavior is
2239     selected by setting the dictionary's <literal>Accept</> parameter to
2240     <literal>false</>.  Continuing the example:
2241
2242 <screen>
2243 ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );
2244
2245 SELECT ts_lexize('public.simple_dict','YeS');
2246  ts_lexize
2247 -----------
2248
2249
2250 SELECT ts_lexize('public.simple_dict','The');
2251  ts_lexize
2252 -----------
2253  {}
2254 </screen>
2255    </para>
2256
2257    <para>
2258     With the default setting of <literal>Accept</> = <literal>true</>,
2259     it is only useful to place a <literal>simple</> dictionary at the end
2260     of a list of dictionaries, since it will never pass on any token to
2261     a following dictionary.  Conversely, <literal>Accept</> = <literal>false</>
2262     is only useful when there is at least one following dictionary.
2263    </para>
2264
2265    <caution>
2266     <para>
2267      Most types of dictionaries rely on configuration files, such as files of
2268      stop words.  These files <emphasis>must</> be stored in UTF-8 encoding.
2269      They will be translated to the actual database encoding, if that is
2270      different, when they are read into the server.
2271     </para>
2272    </caution>
2273
2274    <caution>
2275     <para>
2276      Normally, a database session will read a dictionary configuration file
2277      only once, when it is first used within the session.  If you modify a
2278      configuration file and want to force existing sessions to pick up the
2279      new contents, issue an <command>ALTER TEXT SEARCH DICTIONARY</> command
2280      on the dictionary.  This can be a <quote>dummy</> update that doesn't
2281      actually change any parameter values.
2282     </para>
2283    </caution>
2284
2285   </sect2>
2286
2287   <sect2 id="textsearch-synonym-dictionary">
2288    <title>Synonym Dictionary</title>
2289
2290    <para>
2291     This dictionary template is used to create dictionaries that replace a
2292     word with a synonym. Phrases are not supported (use the thesaurus
2293     template (<xref linkend="textsearch-thesaurus">) for that).  A synonym
2294     dictionary can be used to overcome linguistic problems, for example, to
2295     prevent an English stemmer dictionary from reducing the word <quote>Paris</quote> to
2296     <quote>pari</quote>.  It is enough to have a <literal>Paris paris</literal> line in the
2297     synonym dictionary and put it before the <literal>english_stem</>
2298     dictionary.  For example:
2299
2300 <screen>
2301 SELECT * FROM ts_debug('english', 'Paris');
2302    alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
2303 -----------+-----------------+-------+----------------+--------------+---------
2304  asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari}
2305
2306 CREATE TEXT SEARCH DICTIONARY my_synonym (
2307     TEMPLATE = synonym,
2308     SYNONYMS = my_synonyms
2309 );
2310
2311 ALTER TEXT SEARCH CONFIGURATION english
2312     ALTER MAPPING FOR asciiword
2313     WITH my_synonym, english_stem;
2314
2315 SELECT * FROM ts_debug('english', 'Paris');
2316    alias   |   description   | token |       dictionaries        | dictionary | lexemes
2317 -----------+-----------------+-------+---------------------------+------------+---------
2318  asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}
2319 </screen>
2320    </para>
2321
2322    <para>
2323     The only parameter required by the <literal>synonym</> template is
2324     <literal>SYNONYMS</>, which is the base name of its configuration file
2325     &mdash; <literal>my_synonyms</> in the above example.
2326     The file's full name will be
2327     <filename>$SHAREDIR/tsearch_data/my_synonyms.syn</>
2328     (where <literal>$SHAREDIR</> means the
2329     <productname>PostgreSQL</> installation's shared-data directory).
2330     The file format is just one line
2331     per word to be substituted, with the word followed by its synonym,
2332     separated by white space.  Blank lines and trailing spaces are ignored.
2333    </para>
2334
2335    <para>
2336     The <literal>synonym</> template also has an optional parameter
2337     <literal>CaseSensitive</>, which defaults to <literal>false</>.  When
2338     <literal>CaseSensitive</> is <literal>false</>, words in the synonym file
2339     are folded to lower case, as are input tokens.  When it is
2340     <literal>true</>, words and tokens are not folded to lower case,
2341     but are compared as-is.
2342    </para>
2343
2344    <para>
2345     An asterisk (<literal>*</literal>) can be placed at the end of a synonym
2346     in the configuration file.  This indicates that the synonym is a prefix.
2347     The asterisk is ignored when the entry is used in
2348     <function>to_tsvector()</function>, but when it is used in
2349     <function>to_tsquery()</function>, the result will be a query item with
2350     the prefix match marker (see
2351     <xref linkend="textsearch-parsing-queries">).
2352     For example, suppose we have these entries in
2353     <filename>$SHAREDIR/tsearch_data/synonym_sample.syn</>:
2354 <programlisting>
2355 postgres        pgsql
2356 postgresql      pgsql
2357 postgre pgsql
2358 gogle   googl
2359 indices index*
2360 </programlisting>
2361     Then we will get these results:
2362 <screen>
2363 mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
2364 mydb=# SELECT ts_lexize('syn','indices');
2365  ts_lexize
2366 -----------
2367  {index}
2368 (1 row)
2369
2370 mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
2371 mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
2372 mydb=# SELECT to_tsvector('tst','indices');
2373  to_tsvector
2374 -------------
2375  'index':1
2376 (1 row)
2377
2378 mydb=# SELECT to_tsquery('tst','indices');
2379  to_tsquery
2380 ------------
2381  'index':*
2382 (1 row)
2383
2384 mydb=# SELECT 'indexes are very useful'::tsvector;
2385             tsvector
2386 ---------------------------------
2387  'are' 'indexes' 'useful' 'very'
2388 (1 row)
2389
2390 mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');
2391  ?column?
2392 ----------
2393  t
2394 (1 row)
2395 </screen>
2396    </para>
2397   </sect2>
2398
2399   <sect2 id="textsearch-thesaurus">
2400    <title>Thesaurus Dictionary</title>
2401
2402    <para>
2403     A thesaurus dictionary (sometimes abbreviated as <acronym>TZ</acronym>) is
2404     a collection of words that includes information about the relationships
2405     of words and phrases, i.e., broader terms (<acronym>BT</acronym>), narrower
2406     terms (<acronym>NT</acronym>), preferred terms, non-preferred terms, related
2407     terms, etc.
2408    </para>
2409
2410    <para>
2411     Basically a thesaurus dictionary replaces all non-preferred terms by one
2412     preferred term and, optionally, preserves the original terms for indexing
2413     as well.  <productname>PostgreSQL</>'s current implementation of the
2414     thesaurus dictionary is an extension of the synonym dictionary with added
2415     <firstterm>phrase</firstterm> support.  A thesaurus dictionary requires
2416     a configuration file of the following format:
2417
2418 <programlisting>
2419 # this is a comment
2420 sample word(s) : indexed word(s)
2421 more sample word(s) : more indexed word(s)
2422 ...
2423 </programlisting>
2424
2425     where  the colon (<symbol>:</symbol>) symbol acts as a delimiter between a
2426     a phrase and its replacement.
2427    </para>
2428
2429    <para>
2430     A thesaurus dictionary uses a <firstterm>subdictionary</firstterm> (which
2431     is specified in the dictionary's configuration) to normalize the input
2432     text before checking for phrase matches. It is only possible to select one
2433     subdictionary.  An error is reported if the subdictionary fails to
2434     recognize a word. In that case, you should remove the use of the word or
2435     teach the subdictionary about it.  You can place an asterisk
2436     (<symbol>*</symbol>) at the beginning of an indexed word to skip applying
2437     the subdictionary to it, but all sample words <emphasis>must</> be known
2438     to the subdictionary.
2439    </para>
2440
2441    <para>
2442     The thesaurus dictionary chooses the longest match if there are multiple
2443     phrases matching the input, and ties are broken by using the last
2444     definition.
2445    </para>
2446
2447    <para>
2448     Specific stop words recognized by the subdictionary cannot be
2449     specified;  instead use <literal>?</> to mark the location where any
2450     stop word can appear.  For example, assuming that <literal>a</> and
2451     <literal>the</> are stop words according to the subdictionary:
2452
2453 <programlisting>
2454 ? one ? two : swsw
2455 </programlisting>
2456
2457     matches <literal>a one the two</> and <literal>the one a two</>;
2458     both would be replaced by <literal>swsw</>.
2459    </para>
2460
2461    <para>
2462     Since a thesaurus dictionary has the capability to recognize phrases it
2463     must remember its state and interact with the parser. A thesaurus dictionary
2464     uses these assignments to check if it should handle the next word or stop
2465     accumulation.  The thesaurus dictionary must be configured
2466     carefully. For example, if the thesaurus dictionary is assigned to handle
2467     only the <literal>asciiword</literal> token, then a thesaurus dictionary
2468     definition like <literal>one 7</> will not work since token type
2469     <literal>uint</literal> is not assigned to the thesaurus dictionary.
2470    </para>
2471
2472    <caution>
2473     <para>
2474      Thesauruses are used during indexing so any change in the thesaurus
2475      dictionary's parameters <emphasis>requires</emphasis> reindexing.
2476      For most other dictionary types, small changes such as adding or
2477      removing stopwords does not force reindexing.
2478     </para>
2479    </caution>
2480
2481   <sect3 id="textsearch-thesaurus-config">
2482    <title>Thesaurus Configuration</title>
2483
2484    <para>
2485     To define a new thesaurus dictionary, use the <literal>thesaurus</>
2486     template.  For example:
2487
2488 <programlisting>
2489 CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
2490     TEMPLATE = thesaurus,
2491     DictFile = mythesaurus,
2492     Dictionary = pg_catalog.english_stem
2493 );
2494 </programlisting>
2495
2496     Here:
2497     <itemizedlist  spacing="compact" mark="bullet">
2498      <listitem>
2499       <para>
2500        <literal>thesaurus_simple</literal> is the new dictionary's name
2501       </para>
2502      </listitem>
2503      <listitem>
2504       <para>
2505        <literal>mythesaurus</literal> is the base name of the thesaurus
2506        configuration file.
2507        (Its full name will be <filename>$SHAREDIR/tsearch_data/mythesaurus.ths</>,
2508        where <literal>$SHAREDIR</> means the installation shared-data
2509        directory.)
2510       </para>
2511      </listitem>
2512      <listitem>
2513       <para>
2514        <literal>pg_catalog.english_stem</literal> is the subdictionary (here,
2515        a Snowball English stemmer) to use for thesaurus normalization.
2516        Notice that the subdictionary will have its own
2517        configuration (for example, stop words), which is not shown here.
2518       </para>
2519      </listitem>
2520     </itemizedlist>
2521
2522     Now it is possible to bind the thesaurus dictionary <literal>thesaurus_simple</literal>
2523     to the desired token types in a configuration, for example:
2524
2525 <programlisting>
2526 ALTER TEXT SEARCH CONFIGURATION russian
2527     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
2528     WITH thesaurus_simple;
2529 </programlisting>
2530    </para>
2531
2532   </sect3>
2533
2534   <sect3 id="textsearch-thesaurus-examples">
2535    <title>Thesaurus Example</title>
2536
2537    <para>
2538     Consider a simple astronomical thesaurus <literal>thesaurus_astro</literal>,
2539     which contains some astronomical word combinations:
2540
2541 <programlisting>
2542 supernovae stars : sn
2543 crab nebulae : crab
2544 </programlisting>
2545
2546     Below we create a dictionary and bind some token types to
2547     an astronomical thesaurus and English stemmer:
2548
2549 <programlisting>
2550 CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
2551     TEMPLATE = thesaurus,
2552     DictFile = thesaurus_astro,
2553     Dictionary = english_stem
2554 );
2555
2556 ALTER TEXT SEARCH CONFIGURATION russian
2557     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
2558     WITH thesaurus_astro, english_stem;
2559 </programlisting>
2560
2561     Now we can see how it works.
2562     <function>ts_lexize</function> is not very useful for testing a thesaurus,
2563     because it treats its input as a single token.  Instead we can use
2564     <function>plainto_tsquery</function> and <function>to_tsvector</function>
2565     which will break their input strings into multiple tokens:
2566
2567 <screen>
2568 SELECT plainto_tsquery('supernova star');
2569  plainto_tsquery
2570 -----------------
2571  'sn'
2572
2573 SELECT to_tsvector('supernova star');
2574  to_tsvector
2575 -------------
2576  'sn':1
2577 </screen>
2578
2579     In principle, one can use <function>to_tsquery</function> if you quote
2580     the argument:
2581
2582 <screen>
2583 SELECT to_tsquery('''supernova star''');
2584  to_tsquery
2585 ------------
2586  'sn'
2587 </screen>
2588
2589     Notice that <literal>supernova star</literal> matches <literal>supernovae
2590     stars</literal> in <literal>thesaurus_astro</literal> because we specified
2591     the <literal>english_stem</literal> stemmer in the thesaurus definition.
2592     The stemmer removed the <literal>e</> and <literal>s</>.
2593    </para>
2594
2595    <para>
2596     To index the original phrase as well as the substitute, just include it
2597     in the right-hand part of the definition:
2598
2599 <screen>
2600 supernovae stars : sn supernovae stars
2601
2602 SELECT plainto_tsquery('supernova star');
2603        plainto_tsquery
2604 -----------------------------
2605  'sn' &amp; 'supernova' &amp; 'star'
2606 </screen>
2607    </para>
2608
2609   </sect3>
2610
2611   </sect2>
2612
2613   <sect2 id="textsearch-ispell-dictionary">
2614    <title><application>Ispell</> Dictionary</title>
2615
2616    <para>
2617     The <application>Ispell</> dictionary template supports
2618     <firstterm>morphological dictionaries</>, which can normalize many
2619     different linguistic forms of a word into the same lexeme.  For example,
2620     an English <application>Ispell</> dictionary can match all declensions and
2621     conjugations of the search term <literal>bank</literal>, e.g.,
2622     <literal>banking</>, <literal>banked</>, <literal>banks</>,
2623     <literal>banks'</>, and <literal>bank's</>.
2624    </para>
2625
2626    <para>
2627     The standard <productname>PostgreSQL</productname> distribution does
2628     not include any <application>Ispell</> configuration files.
2629     Dictionaries for a large number of languages are available from <ulink
2630     url="http://ficus-www.cs.ucla.edu/geoff/ispell.html">Ispell</ulink>.
2631     Also, some more modern dictionary file formats are supported &mdash; <ulink
2632     url="http://en.wikipedia.org/wiki/MySpell">MySpell</ulink> (OO &lt; 2.0.1)
2633     and <ulink url="http://sourceforge.net/projects/hunspell/">Hunspell</ulink>
2634     (OO &gt;= 2.0.2).  A large list of dictionaries is available on the <ulink
2635     url="http://wiki.services.openoffice.org/wiki/Dictionaries">OpenOffice
2636     Wiki</ulink>.
2637    </para>
2638
2639    <para>
2640     To create an <application>Ispell</> dictionary, use the built-in
2641     <literal>ispell</literal> template and specify several parameters:
2642    </para>
2643
2644 <programlisting>
2645 CREATE TEXT SEARCH DICTIONARY english_ispell (
2646     TEMPLATE = ispell,
2647     DictFile = english,
2648     AffFile = english,
2649     StopWords = english
2650 );
2651 </programlisting>
2652
2653    <para>
2654     Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
2655     specify the base names of the dictionary, affixes, and stop-words files.
2656     The stop-words file has the same format explained above for the
2657     <literal>simple</> dictionary type.  The format of the other files is
2658     not specified here but is available from the above-mentioned web sites.
2659    </para>
2660
2661    <para>
2662     Ispell dictionaries usually recognize a limited set of words, so they
2663     should be followed by another broader dictionary; for
2664     example, a Snowball dictionary, which recognizes everything.
2665    </para>
2666
2667    <para>
2668     Ispell dictionaries support splitting compound words;
2669     a useful feature.
2670     Notice that the affix file should specify a special flag using the
2671     <literal>compoundwords controlled</literal> statement that marks dictionary
2672     words that can participate in compound formation:
2673
2674 <programlisting>
2675 compoundwords  controlled z
2676 </programlisting>
2677
2678     Here are some examples for the Norwegian language:
2679
2680 <programlisting>
2681 SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
2682    {over,buljong,terning,pakk,mester,assistent}
2683 SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
2684    {sjokoladefabrikk,sjokolade,fabrikk}
2685 </programlisting>
2686    </para>
2687
2688    <note>
2689     <para>
2690      <application>MySpell</> does not support compound words.
2691      <application>Hunspell</> has sophisticated support for compound words. At
2692      present, <productname>PostgreSQL</productname> implements only the basic
2693      compound word operations of Hunspell.
2694     </para>
2695    </note>
2696
2697   </sect2>
2698
2699   <sect2 id="textsearch-snowball-dictionary">
2700    <title><application>Snowball</> Dictionary</title>
2701
2702    <para>
2703     The <application>Snowball</> dictionary template is based on a project
2704     by Martin Porter, inventor of the popular Porter's stemming algorithm
2705     for the English language.  Snowball now provides stemming algorithms for
2706     many languages (see the <ulink url="http://snowball.tartarus.org">Snowball
2707     site</ulink> for more information).  Each algorithm understands how to
2708     reduce common variant forms of words to a base, or stem, spelling within
2709     its language.  A Snowball dictionary requires a <literal>language</>
2710     parameter to identify which stemmer to use, and optionally can specify a
2711     <literal>stopword</> file name that gives a list of words to eliminate.
2712     (<productname>PostgreSQL</productname>'s standard stopword lists are also
2713     provided by the Snowball project.)
2714     For example, there is a built-in definition equivalent to
2715
2716 <programlisting>
2717 CREATE TEXT SEARCH DICTIONARY english_stem (
2718     TEMPLATE = snowball,
2719     Language = english,
2720     StopWords = english
2721 );
2722 </programlisting>
2723
2724     The stopword file format is the same as already explained.
2725    </para>
2726
2727    <para>
2728     A <application>Snowball</> dictionary recognizes everything, whether
2729     or not it is able to simplify the word, so it should be placed
2730     at the end of the dictionary list. It is useless to have it
2731     before any other dictionary because a token will never pass through it to
2732     the next dictionary.
2733    </para>
2734
2735   </sect2>
2736
2737  </sect1>
2738
2739  <sect1 id="textsearch-configuration">
2740   <title>Configuration Example</title>
2741
2742    <para>
2743     A text search configuration specifies all options necessary to transform a
2744     document into a <type>tsvector</type>: the parser to use to break text
2745     into tokens, and the dictionaries to use to transform each token into a
2746     lexeme.  Every call of
2747     <function>to_tsvector</function> or <function>to_tsquery</function>
2748     needs a text search configuration to perform its processing.
2749     The configuration parameter
2750     <xref linkend="guc-default-text-search-config">
2751     specifies the name of the default configuration, which is the
2752     one used by text search functions if an explicit configuration
2753     parameter is omitted.
2754     It can be set in <filename>postgresql.conf</filename>, or set for an
2755     individual session using the <command>SET</> command.
2756    </para>
2757
2758    <para>
2759     Several predefined text search configurations are available, and
2760     you can create custom configurations easily.  To facilitate management
2761     of text search objects, a set of <acronym>SQL</acronym> commands
2762     is available, and there are several <application>psql</application> commands that display information
2763     about text search objects (<xref linkend="textsearch-psql">).
2764    </para>
2765
2766    <para>
2767     As an example we will create a configuration
2768     <literal>pg</literal>, starting by duplicating the built-in
2769     <literal>english</> configuration:
2770
2771 <programlisting>
2772 CREATE TEXT SEARCH CONFIGURATION public.pg ( COPY = pg_catalog.english );
2773 </programlisting>
2774    </para>
2775
2776    <para>
2777     We will use a PostgreSQL-specific synonym list
2778     and store it in <filename>$SHAREDIR/tsearch_data/pg_dict.syn</filename>.
2779     The file contents look like:
2780
2781 <programlisting>
2782 postgres    pg
2783 pgsql       pg
2784 postgresql  pg
2785 </programlisting>
2786
2787     We define the synonym dictionary like this:
2788
2789 <programlisting>
2790 CREATE TEXT SEARCH DICTIONARY pg_dict (
2791     TEMPLATE = synonym,
2792     SYNONYMS = pg_dict
2793 );
2794 </programlisting>
2795
2796     Next we register the <productname>Ispell</> dictionary
2797     <literal>english_ispell</literal>, which has its own configuration files:
2798
2799 <programlisting>
2800 CREATE TEXT SEARCH DICTIONARY english_ispell (
2801     TEMPLATE = ispell,
2802     DictFile = english,
2803     AffFile = english,
2804     StopWords = english
2805 );
2806 </programlisting>
2807
2808     Now we can set up the mappings for words in configuration
2809     <literal>pg</>:
2810
2811 <programlisting>
2812 ALTER TEXT SEARCH CONFIGURATION pg
2813     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
2814                       word, hword, hword_part
2815     WITH pg_dict, english_ispell, english_stem;
2816 </programlisting>
2817
2818     We choose not to index or search some token types that the built-in
2819     configuration does handle:
2820
2821 <programlisting>
2822 ALTER TEXT SEARCH CONFIGURATION pg
2823     DROP MAPPING FOR email, url, url_path, sfloat, float;
2824 </programlisting>
2825    </para>
2826
2827    <para>
2828     Now we can test our configuration:
2829
2830 <programlisting>
2831 SELECT * FROM ts_debug('public.pg', '
2832 PostgreSQL, the highly scalable, SQL compliant, open source object-relational
2833 database management system, is now undergoing beta testing of the next
2834 version of our software.
2835 ');
2836 </programlisting>
2837    </para>
2838
2839    <para>
2840     The next step is to set the session to use the new configuration, which was
2841     created in the <literal>public</> schema:
2842
2843 <screen>
2844 =&gt; \dF
2845    List of text search configurations
2846  Schema  | Name | Description
2847 ---------+------+-------------
2848  public  | pg   |
2849
2850 SET default_text_search_config = 'public.pg';
2851 SET
2852
2853 SHOW default_text_search_config;
2854  default_text_search_config
2855 ----------------------------
2856  public.pg
2857 </screen>
2858   </para>
2859
2860  </sect1>
2861
2862  <sect1 id="textsearch-debugging">
2863   <title>Testing and Debugging Text Search</title>
2864
2865   <para>
2866    The behavior of a custom text search configuration can easily become
2867    confusing.  The functions described
2868    in this section are useful for testing text search objects.  You can
2869    test a complete configuration, or test parsers and dictionaries separately.
2870   </para>
2871
2872   <sect2 id="textsearch-configuration-testing">
2873    <title>Configuration Testing</title>
2874
2875   <para>
2876    The function <function>ts_debug</function> allows easy testing of a
2877    text search configuration.
2878   </para>
2879
2880   <indexterm>
2881    <primary>ts_debug</primary>
2882   </indexterm>
2883
2884 <synopsis>
2885 ts_debug(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">document</replaceable> <type>text</>,
2886          OUT <replaceable class="PARAMETER">alias</> <type>text</>,
2887          OUT <replaceable class="PARAMETER">description</> <type>text</>,
2888          OUT <replaceable class="PARAMETER">token</> <type>text</>,
2889          OUT <replaceable class="PARAMETER">dictionaries</> <type>regdictionary[]</>,
2890          OUT <replaceable class="PARAMETER">dictionary</> <type>regdictionary</>,
2891          OUT <replaceable class="PARAMETER">lexemes</> <type>text[]</>)
2892          returns setof record
2893 </synopsis>
2894
2895   <para>
2896    <function>ts_debug</> displays information about every token of
2897    <replaceable class="PARAMETER">document</replaceable> as produced by the
2898    parser and processed by the configured dictionaries.  It uses the
2899    configuration specified by <replaceable
2900    class="PARAMETER">config</replaceable>,
2901    or <varname>default_text_search_config</varname> if that argument is
2902    omitted.
2903   </para>
2904
2905   <para>
2906    <function>ts_debug</> returns one row for each token identified in the text
2907    by the parser.  The columns returned are
2908
2909     <itemizedlist  spacing="compact" mark="bullet">
2910      <listitem>
2911       <para>
2912        <replaceable>alias</> <type>text</> &mdash; short name of the token type
2913       </para>
2914      </listitem>
2915      <listitem>
2916       <para>
2917        <replaceable>description</> <type>text</> &mdash; description of the
2918        token type
2919       </para>
2920      </listitem>
2921      <listitem>
2922       <para>
2923        <replaceable>token</> <type>text</> &mdash; text of the token
2924       </para>
2925      </listitem>
2926      <listitem>
2927       <para>
2928        <replaceable>dictionaries</> <type>regdictionary[]</> &mdash; the
2929        dictionaries selected by the configuration for this token type
2930       </para>
2931      </listitem>
2932      <listitem>
2933       <para>
2934        <replaceable>dictionary</> <type>regdictionary</> &mdash; the dictionary
2935        that recognized the token, or <literal>NULL</> if none did
2936       </para>
2937      </listitem>
2938      <listitem>
2939       <para>
2940        <replaceable>lexemes</> <type>text[]</> &mdash; the lexeme(s) produced
2941        by the dictionary that recognized the token, or <literal>NULL</> if
2942        none did; an empty array (<literal>{}</>) means it was recognized as a
2943        stop word
2944       </para>
2945      </listitem>
2946     </itemizedlist>
2947   </para>
2948
2949   <para>
2950    Here is a simple example:
2951
2952 <screen>
2953 SELECT * FROM ts_debug('english','a fat  cat sat on a mat - it ate a fat rats');
2954    alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
2955 -----------+-----------------+-------+----------------+--------------+---------
2956  asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {}
2957  blank     | Space symbols   |       | {}             |              |
2958  asciiword | Word, all ASCII | fat   | {english_stem} | english_stem | {fat}
2959  blank     | Space symbols   |       | {}             |              |
2960  asciiword | Word, all ASCII | cat   | {english_stem} | english_stem | {cat}
2961  blank     | Space symbols   |       | {}             |              |
2962  asciiword | Word, all ASCII | sat   | {english_stem} | english_stem | {sat}
2963  blank     | Space symbols   |       | {}             |              |
2964  asciiword | Word, all ASCII | on    | {english_stem} | english_stem | {}
2965  blank     | Space symbols   |       | {}             |              |
2966  asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {}
2967  blank     | Space symbols   |       | {}             |              |
2968  asciiword | Word, all ASCII | mat   | {english_stem} | english_stem | {mat}
2969  blank     | Space symbols   |       | {}             |              |
2970  blank     | Space symbols   | -     | {}             |              |
2971  asciiword | Word, all ASCII | it    | {english_stem} | english_stem | {}
2972  blank     | Space symbols   |       | {}             |              |
2973  asciiword | Word, all ASCII | ate   | {english_stem} | english_stem | {ate}
2974  blank     | Space symbols   |       | {}             |              |
2975  asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {}
2976  blank     | Space symbols   |       | {}             |              |
2977  asciiword | Word, all ASCII | fat   | {english_stem} | english_stem | {fat}
2978  blank     | Space symbols   |       | {}             |              |
2979  asciiword | Word, all ASCII | rats  | {english_stem} | english_stem | {rat}
2980 </screen>
2981   </para>
2982
2983   <para>
2984    For a more extensive demonstration, we
2985    first create a <literal>public.english</literal> configuration and
2986    Ispell dictionary for the English language:
2987   </para>
2988
2989 <programlisting>
2990 CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english );
2991
2992 CREATE TEXT SEARCH DICTIONARY english_ispell (
2993     TEMPLATE = ispell,
2994     DictFile = english,
2995     AffFile = english,
2996     StopWords = english
2997 );
2998
2999 ALTER TEXT SEARCH CONFIGURATION public.english
3000    ALTER MAPPING FOR asciiword WITH english_ispell, english_stem;
3001 </programlisting>
3002
3003 <screen>
3004 SELECT * FROM ts_debug('public.english','The Brightest supernovaes');
3005    alias   |   description   |    token    |         dictionaries          |   dictionary   |   lexemes
3006 -----------+-----------------+-------------+-------------------------------+----------------+-------------
3007  asciiword | Word, all ASCII | The         | {english_ispell,english_stem} | english_ispell | {}
3008  blank     | Space symbols   |             | {}                            |                |
3009  asciiword | Word, all ASCII | Brightest   | {english_ispell,english_stem} | english_ispell | {bright}
3010  blank     | Space symbols   |             | {}                            |                |
3011  asciiword | Word, all ASCII | supernovaes | {english_ispell,english_stem} | english_stem   | {supernova}
3012 </screen>
3013
3014   <para>
3015    In this example, the word <literal>Brightest</> was recognized by the
3016    parser as an <literal>ASCII word</literal> (alias <literal>asciiword</literal>).
3017    For this token type the dictionary list is
3018    <literal>english_ispell</> and
3019    <literal>english_stem</literal>. The word was recognized by
3020    <literal>english_ispell</literal>, which reduced it to the noun
3021    <literal>bright</literal>. The word <literal>supernovaes</literal> is
3022    unknown to the <literal>english_ispell</literal> dictionary so it
3023    was passed to the next dictionary, and, fortunately, was recognized (in
3024    fact, <literal>english_stem</literal> is a Snowball dictionary which
3025    recognizes everything; that is why it was placed at the end of the
3026    dictionary list).
3027   </para>
3028
3029   <para>
3030    The word <literal>The</literal> was recognized by the
3031    <literal>english_ispell</literal> dictionary as a stop word (<xref
3032    linkend="textsearch-stopwords">) and will not be indexed.
3033    The spaces are discarded too, since the configuration provides no
3034    dictionaries at all for them.
3035   </para>
3036
3037   <para>
3038    You can reduce the width of the output by explicitly specifying which columns
3039    you want to see:
3040
3041 <screen>
3042 SELECT alias, token, dictionary, lexemes
3043 FROM ts_debug('public.english','The Brightest supernovaes');
3044    alias   |    token    |   dictionary   |   lexemes
3045 -----------+-------------+----------------+-------------
3046  asciiword | The         | english_ispell | {}
3047  blank     |             |                |
3048  asciiword | Brightest   | english_ispell | {bright}
3049  blank     |             |                |
3050  asciiword | supernovaes | english_stem   | {supernova}
3051 </screen>
3052   </para>
3053
3054   </sect2>
3055
3056   <sect2 id="textsearch-parser-testing">
3057    <title>Parser Testing</title>
3058
3059   <para>
3060    The following functions allow direct testing of a text search parser.
3061   </para>
3062
3063   <indexterm>
3064    <primary>ts_parse</primary>
3065   </indexterm>
3066
3067 <synopsis>
3068 ts_parse(<replaceable class="PARAMETER">parser_name</replaceable> <type>text</>, <replaceable class="PARAMETER">document</replaceable> <type>text</>,
3069          OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">token</> <type>text</>) returns <type>setof record</>
3070 ts_parse(<replaceable class="PARAMETER">parser_oid</replaceable> <type>oid</>, <replaceable class="PARAMETER">document</replaceable> <type>text</>,
3071          OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">token</> <type>text</>) returns <type>setof record</>
3072 </synopsis>
3073
3074   <para>
3075    <function>ts_parse</> parses the given <replaceable>document</replaceable>
3076    and returns a series of records, one for each token produced by
3077    parsing. Each record includes a <varname>tokid</varname> showing the
3078    assigned token type and a <varname>token</varname> which is the text of the
3079    token.  For example:
3080
3081 <screen>
3082 SELECT * FROM ts_parse('default', '123 - a number');
3083  tokid | token
3084 -------+--------
3085     22 | 123
3086     12 |
3087     12 | -
3088      1 | a
3089     12 |
3090      1 | number
3091 </screen>
3092   </para>
3093
3094   <indexterm>
3095    <primary>ts_token_type</primary>
3096   </indexterm>
3097
3098 <synopsis>
3099 ts_token_type(<replaceable class="PARAMETER">parser_name</> <type>text</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>,
3100               OUT <replaceable class="PARAMETER">alias</> <type>text</>, OUT <replaceable class="PARAMETER">description</> <type>text</>) returns <type>setof record</>
3101 ts_token_type(<replaceable class="PARAMETER">parser_oid</> <type>oid</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>,
3102               OUT <replaceable class="PARAMETER">alias</> <type>text</>, OUT <replaceable class="PARAMETER">description</> <type>text</>) returns <type>setof record</>
3103 </synopsis>
3104
3105   <para>
3106    <function>ts_token_type</> returns a table which describes each type of
3107    token the specified parser can recognize.  For each token type, the table
3108    gives the integer <varname>tokid</varname> that the parser uses to label a
3109    token of that type, the <varname>alias</varname> that names the token type
3110    in configuration commands, and a short <varname>description</varname>.  For
3111    example:
3112
3113 <screen>
3114 SELECT * FROM ts_token_type('default');
3115  tokid |      alias      |               description
3116 -------+-----------------+------------------------------------------
3117      1 | asciiword       | Word, all ASCII
3118      2 | word            | Word, all letters
3119      3 | numword         | Word, letters and digits
3120      4 | email           | Email address
3121      5 | url             | URL
3122      6 | host            | Host
3123      7 | sfloat          | Scientific notation
3124      8 | version         | Version number
3125      9 | hword_numpart   | Hyphenated word part, letters and digits
3126     10 | hword_part      | Hyphenated word part, all letters
3127     11 | hword_asciipart | Hyphenated word part, all ASCII
3128     12 | blank           | Space symbols
3129     13 | tag             | XML tag
3130     14 | protocol        | Protocol head
3131     15 | numhword        | Hyphenated word, letters and digits
3132     16 | asciihword      | Hyphenated word, all ASCII
3133     17 | hword           | Hyphenated word, all letters
3134     18 | url_path        | URL path
3135     19 | file            | File or path name
3136     20 | float           | Decimal notation
3137     21 | int             | Signed integer
3138     22 | uint            | Unsigned integer
3139     23 | entity          | XML entity
3140 </screen>
3141    </para>
3142
3143   </sect2>
3144
3145   <sect2 id="textsearch-dictionary-testing">
3146    <title>Dictionary Testing</title>
3147
3148    <para>
3149     The <function>ts_lexize</> function facilitates dictionary testing.
3150    </para>
3151
3152    <indexterm>
3153     <primary>ts_lexize</primary>
3154    </indexterm>
3155
3156 <synopsis>
3157 ts_lexize(<replaceable class="PARAMETER">dict</replaceable> <type>regdictionary</>, <replaceable class="PARAMETER">token</replaceable> <type>text</>) returns <type>text[]</>
3158 </synopsis>
3159
3160    <para>
3161     <function>ts_lexize</> returns an array of lexemes if the input
3162     <replaceable>token</replaceable> is known to the dictionary,
3163     or an empty array if the token
3164     is known to the dictionary but it is a stop word, or
3165     <literal>NULL</literal> if it is an unknown word.
3166    </para>
3167
3168    <para>
3169     Examples:
3170
3171 <screen>
3172 SELECT ts_lexize('english_stem', 'stars');
3173  ts_lexize
3174 -----------
3175  {star}
3176
3177 SELECT ts_lexize('english_stem', 'a');
3178  ts_lexize
3179 -----------
3180  {}
3181 </screen>
3182    </para>
3183
3184    <note>
3185     <para>
3186      The <function>ts_lexize</function> function expects a single
3187      <emphasis>token</emphasis>, not text. Here is a case
3188      where this can be confusing:
3189
3190 <screen>
3191 SELECT ts_lexize('thesaurus_astro','supernovae stars') is null;
3192  ?column?
3193 ----------
3194  t
3195 </screen>
3196
3197      The thesaurus dictionary <literal>thesaurus_astro</literal> does know the
3198      phrase <literal>supernovae stars</literal>, but <function>ts_lexize</>
3199      fails since it does not parse the input text but treats it as a single
3200      token. Use <function>plainto_tsquery</> or <function>to_tsvector</> to
3201      test thesaurus dictionaries, for example:
3202
3203 <screen>
3204 SELECT plainto_tsquery('supernovae stars');
3205  plainto_tsquery
3206 -----------------
3207  'sn'
3208 </screen>
3209     </para>
3210    </note>
3211
3212   </sect2>
3213
3214  </sect1>
3215
3216  <sect1 id="textsearch-indexes">
3217   <title>GiST and GIN Index Types</title>
3218
3219   <indexterm zone="textsearch-indexes">
3220    <primary>text search</primary>
3221    <secondary>indexes</secondary>
3222   </indexterm>
3223
3224   <para>
3225    There are two kinds of indexes that can be used to speed up full text
3226    searches.
3227    Note that indexes are not mandatory for full text searching, but in
3228    cases where a column is searched on a regular basis, an index is
3229    usually desirable.
3230
3231    <variablelist>
3232
3233     <varlistentry>
3234
3235      <indexterm zone="textsearch-indexes">
3236       <primary>index</primary>
3237       <secondary>GiST</secondary>
3238       <tertiary>text search</tertiary>
3239      </indexterm>
3240
3241      <term>
3242 <synopsis>
3243 CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING gist(<replaceable>column</replaceable>);
3244 </synopsis>
3245      </term>
3246
3247      <listitem>
3248       <para>
3249        Creates a GiST (Generalized Search Tree)-based index.
3250        The <replaceable>column</replaceable> can be of <type>tsvector</> or
3251        <type>tsquery</> type.
3252       </para>
3253      </listitem>
3254     </varlistentry>
3255
3256     <varlistentry>
3257
3258      <indexterm zone="textsearch-indexes">
3259       <primary>index</primary>
3260       <secondary>GIN</secondary>
3261       <tertiary>text search</tertiary>
3262      </indexterm>
3263
3264      <term>
3265 <synopsis>
3266 CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING gin(<replaceable>column</replaceable>);
3267 </synopsis>
3268      </term>
3269
3270      <listitem>
3271       <para>
3272        Creates a GIN (Generalized Inverted Index)-based index.
3273        The <replaceable>column</replaceable> must be of <type>tsvector</> type.
3274       </para>
3275      </listitem>
3276     </varlistentry>
3277
3278    </variablelist>
3279   </para>
3280
3281   <para>
3282    There are substantial performance differences between the two index types,
3283    so it is important to understand their characteristics.
3284   </para>
3285
3286   <para>
3287    A GiST index is <firstterm>lossy</firstterm>, meaning that the index
3288    may produce false matches, and it is necessary
3289    to check the actual table row to eliminate such false matches.
3290    (<productname>PostgreSQL</productname> does this automatically when needed.)
3291    GiST indexes are lossy because each document is represented in the
3292    index by a fixed-length signature. The signature is generated by hashing
3293    each word into a single bit in an n-bit string, with all these bits OR-ed
3294    together to produce an n-bit document signature.  When two words hash to
3295    the same bit position there will be a false match.  If all words in
3296    the query have matches (real or false) then the table row must be
3297    retrieved to see if the match is correct.
3298   </para>
3299
3300   <para>
3301    Lossiness causes performance degradation due to unnecessary fetches of table
3302    records that turn out to be false matches.  Since random access to table
3303    records is slow, this limits the usefulness of GiST indexes.  The
3304    likelihood of false matches depends on several factors, in particular the
3305    number of unique words, so using dictionaries to reduce this number is
3306    recommended.
3307   </para>
3308
3309   <para>
3310    GIN indexes are not lossy for standard queries, but their performance
3311    depends logarithmically on the number of unique words.
3312    (However, GIN indexes store only the words (lexemes) of <type>tsvector</>
3313    values, and not their weight labels.  Thus a table row recheck is needed
3314    when using a query that involves weights.)
3315   </para>
3316
3317   <para>
3318    In choosing which index type to use, GiST or GIN, consider these
3319    performance differences:
3320
3321    <itemizedlist  spacing="compact" mark="bullet">
3322     <listitem>
3323      <para>
3324       GIN index lookups are about three times faster than GiST
3325      </para>
3326     </listitem>
3327     <listitem>
3328      <para>
3329       GIN indexes take about three times longer to build than GiST
3330      </para>
3331     </listitem>
3332     <listitem>
3333      <para>
3334       GIN indexes are moderately slower to update than GiST indexes, but
3335       about 10 times slower if fast-update support was disabled
3336       (see <xref linkend="gin-fast-update"> for details)
3337      </para>
3338     </listitem>
3339     <listitem>
3340      <para>
3341       GIN indexes are two-to-three times larger than GiST indexes
3342      </para>
3343     </listitem>
3344    </itemizedlist>
3345   </para>
3346
3347   <para>
3348    As a rule of thumb, <acronym>GIN</acronym> indexes are best for static data
3349    because lookups are faster.  For dynamic data, GiST indexes are
3350    faster to update.  Specifically, <acronym>GiST</acronym> indexes are very
3351    good for dynamic data and fast if the number of unique words (lexemes) is
3352    under 100,000, while <acronym>GIN</acronym> indexes will handle 100,000+
3353    lexemes better but are slower to update.
3354   </para>
3355
3356   <para>
3357    Note that <acronym>GIN</acronym> index build time can often be improved
3358    by increasing <xref linkend="guc-maintenance-work-mem">, while
3359    <acronym>GiST</acronym> index build time is not sensitive to that
3360    parameter.
3361   </para>
3362
3363   <para>
3364    Partitioning of big collections and the proper use of GiST and GIN indexes
3365    allows the implementation of very fast searches with online update.
3366    Partitioning can be done at the database level using table inheritance,
3367    or by distributing documents over
3368    servers and collecting search results using the <xref linkend="dblink">
3369    module. The latter is possible because ranking functions use
3370    only local information.
3371   </para>
3372
3373  </sect1>
3374
3375  <sect1 id="textsearch-psql">
3376   <title><application>psql</> Support</title>
3377
3378   <para>
3379    Information about text search configuration objects can be obtained
3380    in <application>psql</application> using a set of commands:
3381 <synopsis>
3382 \dF{d,p,t}<optional>+</optional> <optional>PATTERN</optional>
3383 </synopsis>
3384    An optional <literal>+</literal> produces more details.
3385   </para>
3386
3387   <para>
3388    The optional parameter <literal>PATTERN</literal> can be the name of
3389    a text search object, optionally schema-qualified.  If
3390    <literal>PATTERN</literal> is omitted then information about all
3391    visible objects will be displayed.  <literal>PATTERN</literal> can be a
3392    regular expression and can provide <emphasis>separate</emphasis> patterns
3393    for the schema and object names.  The following examples illustrate this:
3394
3395 <screen>
3396 =&gt; \dF *fulltext*
3397        List of text search configurations
3398  Schema |  Name        | Description
3399 --------+--------------+-------------
3400  public | fulltext_cfg |
3401 </screen>
3402
3403 <screen>
3404 =&gt; \dF *.fulltext*
3405        List of text search configurations
3406  Schema   |  Name        | Description
3407 ----------+----------------------------
3408  fulltext | fulltext_cfg |
3409  public   | fulltext_cfg |
3410 </screen>
3411
3412    The available commands are:
3413   </para>
3414
3415   <variablelist>
3416    <varlistentry>
3417     <term><synopsis>\dF<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
3418     <listitem>
3419      <para>
3420       List text search configurations (add <literal>+</> for more detail).
3421 <screen>
3422 =&gt; \dF russian
3423             List of text search configurations
3424    Schema   |  Name   |            Description
3425 ------------+---------+------------------------------------
3426  pg_catalog | russian | configuration for russian language
3427
3428 =&gt; \dF+ russian
3429 Text search configuration "pg_catalog.russian"
3430 Parser: "pg_catalog.default"
3431       Token      | Dictionaries
3432 -----------------+--------------
3433  asciihword      | english_stem
3434  asciiword       | english_stem
3435  email           | simple
3436  file            | simple
3437  float           | simple
3438  host            | simple
3439  hword           | russian_stem
3440  hword_asciipart | english_stem
3441  hword_numpart   | simple
3442  hword_part      | russian_stem
3443  int             | simple
3444  numhword        | simple
3445  numword         | simple
3446  sfloat          | simple
3447  uint            | simple
3448  url             | simple
3449  url_path        | simple
3450  version         | simple
3451  word            | russian_stem
3452 </screen>
3453      </para>
3454     </listitem>
3455    </varlistentry>
3456
3457    <varlistentry>
3458     <term><synopsis>\dFd<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
3459     <listitem>
3460      <para>
3461       List text search dictionaries (add <literal>+</> for more detail).
3462 <screen>
3463 =&gt; \dFd
3464                             List of text search dictionaries
3465    Schema   |      Name       |                        Description
3466 ------------+-----------------+-----------------------------------------------------------
3467  pg_catalog | danish_stem     | snowball stemmer for danish language
3468  pg_catalog | dutch_stem      | snowball stemmer for dutch language
3469  pg_catalog | english_stem    | snowball stemmer for english language
3470  pg_catalog | finnish_stem    | snowball stemmer for finnish language
3471  pg_catalog | french_stem     | snowball stemmer for french language
3472  pg_catalog | german_stem     | snowball stemmer for german language
3473  pg_catalog | hungarian_stem  | snowball stemmer for hungarian language
3474  pg_catalog | italian_stem    | snowball stemmer for italian language
3475  pg_catalog | norwegian_stem  | snowball stemmer for norwegian language
3476  pg_catalog | portuguese_stem | snowball stemmer for portuguese language
3477  pg_catalog | romanian_stem   | snowball stemmer for romanian language
3478  pg_catalog | russian_stem    | snowball stemmer for russian language
3479  pg_catalog | simple          | simple dictionary: just lower case and check for stopword
3480  pg_catalog | spanish_stem    | snowball stemmer for spanish language
3481  pg_catalog | swedish_stem    | snowball stemmer for swedish language
3482  pg_catalog | turkish_stem    | snowball stemmer for turkish language
3483 </screen>
3484      </para>
3485     </listitem>
3486    </varlistentry>
3487
3488    <varlistentry>
3489    <term><synopsis>\dFp<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
3490     <listitem>
3491      <para>
3492       List text search parsers (add <literal>+</> for more detail).
3493 <screen>
3494 =&gt; \dFp
3495         List of text search parsers
3496    Schema   |  Name   |     Description
3497 ------------+---------+---------------------
3498  pg_catalog | default | default word parser
3499 =&gt; \dFp+
3500     Text search parser "pg_catalog.default"
3501      Method      |    Function    | Description
3502 -----------------+----------------+-------------
3503  Start parse     | prsd_start     |
3504  Get next token  | prsd_nexttoken |
3505  End parse       | prsd_end       |
3506  Get headline    | prsd_headline  |
3507  Get token types | prsd_lextype   |
3508
3509         Token types for parser "pg_catalog.default"
3510    Token name    |               Description
3511 -----------------+------------------------------------------
3512  asciihword      | Hyphenated word, all ASCII
3513  asciiword       | Word, all ASCII
3514  blank           | Space symbols
3515  email           | Email address
3516  entity          | XML entity
3517  file            | File or path name
3518  float           | Decimal notation
3519  host            | Host
3520  hword           | Hyphenated word, all letters
3521  hword_asciipart | Hyphenated word part, all ASCII
3522  hword_numpart   | Hyphenated word part, letters and digits
3523  hword_part      | Hyphenated word part, all letters
3524  int             | Signed integer
3525  numhword        | Hyphenated word, letters and digits
3526  numword         | Word, letters and digits
3527  protocol        | Protocol head
3528  sfloat          | Scientific notation
3529  tag             | XML tag
3530  uint            | Unsigned integer
3531  url             | URL
3532  url_path        | URL path
3533  version         | Version number
3534  word            | Word, all letters
3535 (23 rows)
3536 </screen>
3537      </para>
3538     </listitem>
3539    </varlistentry>
3540
3541    <varlistentry>
3542    <term><synopsis>\dFt<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
3543     <listitem>
3544      <para>
3545       List text search templates (add <literal>+</> for more detail).
3546 <screen>
3547 =&gt; \dFt
3548                            List of text search templates
3549    Schema   |   Name    |                        Description
3550 ------------+-----------+-----------------------------------------------------------
3551  pg_catalog | ispell    | ispell dictionary
3552  pg_catalog | simple    | simple dictionary: just lower case and check for stopword
3553  pg_catalog | snowball  | snowball stemmer
3554  pg_catalog | synonym   | synonym dictionary: replace word by its synonym
3555  pg_catalog | thesaurus | thesaurus dictionary: phrase by phrase substitution
3556 </screen>
3557      </para>
3558     </listitem>
3559    </varlistentry>
3560   </variablelist>
3561
3562  </sect1>
3563
3564  <sect1 id="textsearch-limitations">
3565   <title>Limitations</title>
3566
3567   <para>
3568    The current limitations of <productname>PostgreSQL</productname>'s
3569    text search features are:
3570    <itemizedlist  spacing="compact" mark="bullet">
3571     <listitem>
3572      <para>The length of each lexeme must be less than 2K bytes</para>
3573     </listitem>
3574     <listitem>
3575      <para>The length of a <type>tsvector</type> (lexemes + positions) must be
3576      less than 1 megabyte</para>
3577     </listitem>
3578     <listitem>
3579      <!-- TODO: number of lexemes in what?  This is unclear -->
3580      <para>The number of lexemes must be less than
3581      2<superscript>64</superscript></para>
3582     </listitem>
3583     <listitem>
3584      <para>Position values in <type>tsvector</> must be greater than 0 and
3585      no more than 16,383</para>
3586     </listitem>
3587     <listitem>
3588      <para>No more than 256 positions per lexeme</para>
3589     </listitem>
3590     <listitem>
3591      <para>The number of nodes (lexemes + operators) in a <type>tsquery</type>
3592      must be less than 32,768</para>
3593     </listitem>
3594    </itemizedlist>
3595   </para>
3596
3597   <para>
3598    For comparison, the <productname>PostgreSQL</productname> 8.1 documentation
3599    contained 10,441 unique words, a total of 335,420 words, and the most
3600    frequent word <quote>postgresql</> was mentioned 6,127 times in 655
3601    documents.
3602   </para>
3603
3604    <!-- TODO we need to put a date on these numbers? -->
3605   <para>
3606    Another example &mdash; the <productname>PostgreSQL</productname> mailing
3607    list archives contained 910,989 unique words with 57,491,343 lexemes in
3608    461,020 messages.
3609   </para>
3610
3611  </sect1>
3612
3613  <sect1 id="textsearch-migration">
3614   <title>Migration from Pre-8.3 Text Search</title>
3615
3616   <para>
3617    Applications that use the <xref linkend="tsearch2">
3618    module for text searching will need some adjustments to work
3619    with the
3620    built-in features:
3621   </para>
3622
3623   <itemizedlist>
3624    <listitem>
3625     <para>
3626      Some functions have been renamed or had small adjustments in their
3627      argument lists, and all of them are now in the <literal>pg_catalog</>
3628      schema, whereas in a previous installation they would have been in
3629      <literal>public</> or another non-system schema.  There is a new
3630      version of <application>tsearch2</>
3631      that provides a compatibility layer to solve most problems in this
3632      area.
3633     </para>
3634    </listitem>
3635
3636    <listitem>
3637     <para>
3638      The old <application>tsearch2</> functions and other objects
3639      <emphasis>must</> be suppressed when loading <application>pg_dump</>
3640      output from a pre-8.3 database.  While many of them won't load anyway,
3641      a few will and then cause problems.  One simple way to deal with this
3642      is to load the new <application>tsearch2</> module before restoring
3643      the dump; then it will block the old objects from being loaded.
3644     </para>
3645    </listitem>
3646
3647    <listitem>
3648     <para>
3649      Text search configuration setup is completely different now.
3650      Instead of manually inserting rows into configuration tables,
3651      search is configured through the specialized SQL commands shown
3652      earlier in this chapter.  There is no automated
3653      support for converting an existing custom configuration for 8.3;
3654      you're on your own here.
3655     </para>
3656    </listitem>
3657
3658    <listitem>
3659     <para>
3660      Most types of dictionaries rely on some outside-the-database
3661      configuration files.  These are largely compatible with pre-8.3
3662      usage, but note the following differences:
3663
3664      <itemizedlist  spacing="compact" mark="bullet">
3665       <listitem>
3666        <para>
3667         Configuration files now must be placed in a single specified
3668         directory (<filename>$SHAREDIR/tsearch_data</>), and must have
3669         a specific extension depending on the type of file, as noted
3670         previously in the descriptions of the various dictionary types.
3671         This restriction was added to forestall security problems.
3672        </para>
3673       </listitem>
3674
3675       <listitem>
3676        <para>
3677         Configuration files must be encoded in UTF-8 encoding,
3678         regardless of what database encoding is used.
3679        </para>
3680       </listitem>
3681
3682       <listitem>
3683        <para>
3684         In thesaurus configuration files, stop words must be marked with
3685         <literal>?</>.
3686        </para>
3687       </listitem>
3688      </itemizedlist>
3689     </para>
3690    </listitem>
3691
3692   </itemizedlist>
3693
3694  </sect1>
3695
3696 </chapter>