granicus.if.org Git - postgresql/blob - doc/src/sgml/textsearch.sgml

   1 <!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.49 2009/04/14 00:49:56 tgl Exp $ -->
   2
   3 <chapter id="textsearch">
   4  <title id="textsearch-title">Full Text Search</title>
   5
   6   <indexterm zone="textsearch">
   7    <primary>full text search</primary>
   8   </indexterm>
   9
  10   <indexterm zone="textsearch">
  11    <primary>text search</primary>
  12   </indexterm>
  13
  14  <sect1 id="textsearch-intro">
  15   <title>Introduction</title>
  16
  17   <para>
  18    Full Text Searching (or just <firstterm>text search</firstterm>) provides
  19    the capability to identify natural-language <firstterm>documents</> that
  20    satisfy a <firstterm>query</firstterm>, and optionally to sort them by
  21    relevance to the query.  The most common type of search
  22    is to find all documents containing given <firstterm>query terms</firstterm>
  23    and return them in order of their <firstterm>similarity</firstterm> to the
  24    query.  Notions of <varname>query</varname> and
  25    <varname>similarity</varname> are very flexible and depend on the specific
  26    application. The simplest search considers <varname>query</varname> as a
  27    set of words and <varname>similarity</varname> as the frequency of query
  28    words in the document.
  29   </para>
  30
  31   <para>
  32    Textual search operators have existed in databases for years.
  33    <productname>PostgreSQL</productname> has
  34    <literal>~</literal>, <literal>~*</literal>, <literal>LIKE</literal>, and
  35    <literal>ILIKE</literal> operators for textual data types, but they lack
  36    many essential properties required by modern information systems:
  37   </para>
  38
  39   <itemizedlist  spacing="compact" mark="bullet">
  40    <listitem>
  41     <para>
  42      There is no linguistic support, even for English.  Regular expressions
  43      are not sufficient because they cannot easily handle derived words, e.g.,
  44      <literal>satisfies</literal> and <literal>satisfy</literal>. You might
  45      miss documents that contain <literal>satisfies</literal>, although you
  46      probably would like to find them when searching for
  47      <literal>satisfy</literal>. It is possible to use <literal>OR</literal>
  48      to search for multiple derived forms, but this is tedious and error-prone
  49      (some words can have several thousand derivatives).
  50     </para>
  51    </listitem>
  52
  53    <listitem>
  54     <para>
  55      They provide no ordering (ranking) of search results, which makes them
  56      ineffective when thousands of matching documents are found.
  57     </para>
  58    </listitem>
  59
  60    <listitem>
  61     <para>
  62      They tend to be slow because there is no index support, so they must
  63      process all documents for every search.
  64     </para>
  65    </listitem>
  66   </itemizedlist>
  67
  68   <para>
  69    Full text indexing allows documents to be <emphasis>preprocessed</emphasis>
  70    and an index saved for later rapid searching. Preprocessing includes:
  71   </para>
  72
  73   <itemizedlist  mark="none">
  74    <listitem>
  75     <para>
  76      <emphasis>Parsing documents into <firstterm>tokens</></emphasis>. It is
  77      useful to identify various classes of tokens, e.g. numbers, words,
  78      complex words, email addresses, so that they can be processed
  79      differently.  In principle token classes depend on the specific
  80      application, but for most purposes it is adequate to use a predefined
  81      set of classes.
  82      <productname>PostgreSQL</productname> uses a <firstterm>parser</> to
  83      perform this step.  A standard parser is provided, and custom parsers
  84      can be created for specific needs.
  85     </para>
  86    </listitem>
  87
  88    <listitem>
  89     <para>
  90      <emphasis>Converting tokens into <firstterm>lexemes</></emphasis>.
  91      A lexeme is a string, just like a token, but it has been
  92      <firstterm>normalized</> so that different forms of the same word
  93      are made alike.  For example, normalization almost always includes
  94      folding upper-case letters to lower-case, and often involves removal
  95      of suffixes (such as <literal>s</> or <literal>es</> in English).
  96      This allows searches to find variant forms of the
  97      same word, without tediously entering all the possible variants.
  98      Also, this step typically eliminates <firstterm>stop words</>, which
  99      are words that are so common that they are useless for searching.
 100      (In short, then, tokens are raw fragments of the document text, while
 101      lexemes are words that are believed useful for indexing and searching.)
 102      <productname>PostgreSQL</productname> uses <firstterm>dictionaries</> to
 103      perform this step.  Various standard dictionaries are provided, and
 104      custom ones can be created for specific needs.
 105     </para>
 106    </listitem>
 107
 108    <listitem>
 109     <para>
 110      <emphasis>Storing preprocessed documents optimized for
 111      searching</emphasis>.  For example, each document can be represented
 112      as a sorted array of normalized lexemes. Along with the lexemes it is
 113      often desirable to store positional information to use for
 114      <firstterm>proximity ranking</firstterm>, so that a document that
 115      contains a more <quote>dense</> region of query words is
 116      assigned a higher rank than one with scattered query words.
 117     </para>
 118    </listitem>
 119   </itemizedlist>
 120
 121   <para>
 122    Dictionaries allow fine-grained control over how tokens are normalized.
 123    With appropriate dictionaries, you can:
 124   </para>
 125
 126   <itemizedlist  spacing="compact" mark="bullet">
 127    <listitem>
 128     <para>
 129      Define stop words that should not be indexed.
 130     </para>
 131    </listitem>
 132
 133    <listitem>
 134     <para>
 135      Map synonyms to a single word using <application>Ispell</>.
 136     </para>
 137    </listitem>
 138
 139    <listitem>
 140     <para>
 141      Map phrases to a single word using a thesaurus.
 142     </para>
 143    </listitem>
 144
 145    <listitem>
 146     <para>
 147      Map different variations of a word to a canonical form using
 148      an <application>Ispell</> dictionary.
 149     </para>
 150    </listitem>
 151
 152    <listitem>
 153     <para>
 154      Map different variations of a word to a canonical form using
 155      <application>Snowball</> stemmer rules.
 156     </para>
 157    </listitem>
 158   </itemizedlist>
 159
 160   <para>
 161    A data type <type>tsvector</type> is provided for storing preprocessed
 162    documents, along with a type <type>tsquery</type> for representing processed
 163    queries (<xref linkend="datatype-textsearch">).  There are many
 164    functions and operators available for these data types
 165    (<xref linkend="functions-textsearch">), the most important of which is
 166    the match operator <literal>@@</literal>, which we introduce in
 167    <xref linkend="textsearch-matching">.  Full text searches can be accelerated
 168    using indexes (<xref linkend="textsearch-indexes">).
 169   </para>
 170
 171
 172   <sect2 id="textsearch-document">
 173    <title>What Is a Document?</title>
 174
 175    <indexterm zone="textsearch-document">
 176     <primary>document</primary>
 177     <secondary>text search</secondary>
 178    </indexterm>
 179
 180    <para>
 181     A <firstterm>document</> is the unit of searching in a full text search
 182     system; for example, a magazine article or email message.  The text search
 183     engine must be able to parse documents and store associations of lexemes
 184     (key words) with their parent document. Later, these associations are
 185     used to search for documents that contain query words.
 186    </para>
 187
 188    <para>
 189     For searches within <productname>PostgreSQL</productname>,
 190     a document is normally a textual field within a row of a database table,
 191     or possibly a combination (concatenation) of such fields, perhaps stored
 192     in several tables or obtained dynamically. In other words, a document can
 193     be constructed from different parts for indexing and it might not be
 194     stored anywhere as a whole. For example:
 195
 196 <programlisting>
 197 SELECT title || ' ' ||  author || ' ' ||  abstract || ' ' || body AS document
 198 FROM messages
 199 WHERE mid = 12;
 200
 201 SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document
 202 FROM messages m, docs d
 203 WHERE mid = did AND mid = 12;
 204 </programlisting>
 205    </para>
 206
 207    <note>
 208     <para>
 209      Actually, in these example queries, <function>coalesce</function>
 210      should be used to prevent a single <literal>NULL</literal> attribute from
 211      causing a <literal>NULL</literal> result for the whole document.
 212     </para>
 213    </note>
 214
 215    <para>
 216     Another possibility is to store the documents as simple text files in the
 217     file system. In this case, the database can be used to store the full text
 218     index and to execute searches, and some unique identifier can be used to
 219     retrieve the document from the file system.  However, retrieving files
 220     from outside the database requires superuser permissions or special
 221     function support, so this is usually less convenient than keeping all
 222     the data inside <productname>PostgreSQL</productname>.  Also, keeping
 223     everything inside the database allows easy access
 224     to document metadata to assist in indexing and display.
 225    </para>
 226
 227    <para>
 228     For text search purposes, each document must be reduced to the
 229     preprocessed <type>tsvector</> format.  Searching and ranking
 230     are performed entirely on the <type>tsvector</> representation
 231     of a document &mdash; the original text need only be retrieved
 232     when the document has been selected for display to a user.
 233     We therefore often speak of the <type>tsvector</> as being the
 234     document, but of course it is only a compact representation of
 235     the full document.
 236    </para>
 237   </sect2>
 238
 239   <sect2 id="textsearch-matching">
 240    <title>Basic Text Matching</title>
 241
 242    <para>
 243     Full text searching in <productname>PostgreSQL</productname> is based on
 244     the match operator <literal>@@</literal>, which returns
 245     <literal>true</literal> if a <type>tsvector</type>
 246     (document) matches a <type>tsquery</type> (query).
 247     It doesn't matter which data type is written first:
 248
 249 <programlisting>
 250 SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat &amp; rat'::tsquery;
 251  ?column?
 252 ----------
 253  t
 254
 255 SELECT 'fat &amp; cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
 256  ?column?
 257 ----------
 258  f
 259 </programlisting>
 260    </para>
 261
 262    <para>
 263     As the above example suggests, a <type>tsquery</type> is not just raw
 264     text, any more than a <type>tsvector</type> is.  A <type>tsquery</type>
 265     contains search terms, which must be already-normalized lexemes, and
 266     may combine multiple terms using AND, OR, and NOT operators.
 267     (For details see <xref linkend="datatype-textsearch">.)  There are
 268     functions <function>to_tsquery</> and <function>plainto_tsquery</>
 269     that are helpful in converting user-written text into a proper
 270     <type>tsquery</type>, for example by normalizing words appearing in
 271     the text.  Similarly, <function>to_tsvector</> is used to parse and
 272     normalize a document string.  So in practice a text search match would
 273     look more like this:
 274
 275 <programlisting>
 276 SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat &amp; rat');
 277  ?column?
 278 ----------
 279  t
 280 </programlisting>
 281
 282     Observe that this match would not succeed if written as
 283
 284 <programlisting>
 285 SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat &amp; rat');
 286  ?column?
 287 ----------
 288  f
 289 </programlisting>
 290
 291     since here no normalization of the word <literal>rats</> will occur.
 292     The elements of a <type>tsvector</> are lexemes, which are assumed
 293     already normalized, so <literal>rats</> does not match <literal>rat</>.
 294    </para>
 295
 296    <para>
 297     The <literal>@@</literal> operator also
 298     supports <type>text</type> input, allowing explicit conversion of a text
 299     string to <type>tsvector</type> or <type>tsquery</> to be skipped
 300     in simple cases.  The variants available are:
 301
 302 <programlisting>
 303 tsvector @@ tsquery
 304 tsquery  @@ tsvector
 305 text @@ tsquery
 306 text @@ text
 307 </programlisting>
 308    </para>
 309
 310    <para>
 311     The first two of these we saw already.
 312     The form <type>text</type> <literal>@@</literal> <type>tsquery</type>
 313     is equivalent to <literal>to_tsvector(x) @@ y</literal>.
 314     The form <type>text</type> <literal>@@</literal> <type>text</type>
 315     is equivalent to <literal>to_tsvector(x) @@ plainto_tsquery(y)</literal>.
 316    </para>
 317   </sect2>
 318
 319   <sect2 id="textsearch-intro-configurations">
 320    <title>Configurations</title>
 321
 322    <para>
 323     The above are all simple text search examples.  As mentioned before, full
 324     text search functionality includes the ability to do many more things:
 325     skip indexing certain words (stop words), process synonyms, and use
 326     sophisticated parsing, e.g. parse based on more than just white space.
 327     This functionality is controlled by <firstterm>text search
 328     configurations</>.  <productname>PostgreSQL</> comes with predefined
 329     configurations for many languages, and you can easily create your own
 330     configurations.  (<application>psql</>'s <command>\dF</> command
 331     shows all available configurations.)
 332    </para>
 333
 334    <para>
 335     During installation an appropriate configuration is selected and
 336     <xref linkend="guc-default-text-search-config"> is set accordingly
 337     in <filename>postgresql.conf</>.  If you are using the same text search
 338     configuration for the entire cluster you can use the value in
 339     <filename>postgresql.conf</>.  To use different configurations
 340     throughout the cluster but the same configuration within any one database,
 341     use <command>ALTER DATABASE ... SET</>.  Otherwise, you can set
 342     <varname>default_text_search_config</varname> in each session.
 343    </para>
 344
 345    <para>
 346     Each text search function that depends on a configuration has an optional
 347     <type>regconfig</> argument, so that the configuration to use can be
 348     specified explicitly.  <varname>default_text_search_config</varname>
 349     is used only when this argument is omitted.
 350    </para>
 351
 352    <para>
 353     To make it easier to build custom text search configurations, a
 354     configuration is built up from simpler database objects.
 355     <productname>PostgreSQL</>'s text search facility provides
 356     four types of configuration-related database objects:
 357    </para>
 358
 359   <itemizedlist  spacing="compact" mark="bullet">
 360    <listitem>
 361     <para>
 362      <firstterm>Text search parsers</> break documents into tokens
 363      and classify each token (for example, as words or numbers).
 364     </para>
 365    </listitem>
 366
 367    <listitem>
 368     <para>
 369      <firstterm>Text search dictionaries</> convert tokens to normalized
 370      form and reject stop words.
 371     </para>
 372    </listitem>
 373
 374    <listitem>
 375     <para>
 376      <firstterm>Text search templates</> provide the functions underlying
 377      dictionaries.  (A dictionary simply specifies a template and a set
 378      of parameters for the template.)
 379     </para>
 380    </listitem>
 381
 382    <listitem>
 383     <para>
 384      <firstterm>Text search configurations</> select a parser and a set
 385      of dictionaries to use to normalize the tokens produced by the parser.
 386     </para>
 387    </listitem>
 388   </itemizedlist>
 389
 390    <para>
 391     Text search parsers and templates are built from low-level C functions;
 392     therefore it requires C programming ability to develop new ones, and
 393     superuser privileges to install one into a database.  (There are examples
 394     of add-on parsers and templates in the <filename>contrib/</> area of the
 395     <productname>PostgreSQL</> distribution.)  Since dictionaries and
 396     configurations just parameterize and connect together some underlying
 397     parsers and templates, no special privilege is needed to create a new
 398     dictionary or configuration.  Examples of creating custom dictionaries and
 399     configurations appear later in this chapter.
 400    </para>
 401
 402   </sect2>
 403
 404  </sect1>
 405
 406  <sect1 id="textsearch-tables">
 407   <title>Tables and Indexes</title>
 408
 409   <para>
 410    The examples in the previous section illustrated full text matching using
 411    simple constant strings.  This section shows how to search table data,
 412    optionally using indexes.
 413   </para>
 414
 415   <sect2 id="textsearch-tables-search">
 416    <title>Searching a Table</title>
 417
 418    <para>
 419     It is possible to do full text search with no index.  A simple query
 420     to print the <structname>title</> of each row that contains the word
 421     <literal>friend</> in its <structfield>body</> field is:
 422
 423 <programlisting>
 424 SELECT title
 425 FROM pgweb
 426 WHERE to_tsvector('english', body) @@ to_tsquery('english', 'friend');
 427 </programlisting>
 428
 429     This will also find related words such as <literal>friends</>
 430     and <literal>friendly</>, since all these are reduced to the same
 431     normalized lexeme.
 432    </para>
 433
 434    <para>
 435     The query above specifies that the <literal>english</> configuration
 436     is to be used to parse and normalize the strings.  Alternatively we
 437     could omit the configuration parameters:
 438
 439 <programlisting>
 440 SELECT title
 441 FROM pgweb
 442 WHERE to_tsvector(body) @@ to_tsquery('friend');
 443 </programlisting>
 444
 445     This query will use the configuration set by <xref
 446     linkend="guc-default-text-search-config">.
 447    </para>
 448
 449    <para>
 450     A more complex example is to
 451     select the ten most recent documents that contain <literal>create</> and
 452     <literal>table</> in the <structname>title</> or <structname>body</>:
 453
 454 <programlisting>
 455 SELECT title
 456 FROM pgweb
 457 WHERE to_tsvector(title || body) @@ to_tsquery('create &amp; table')
 458 ORDER BY last_mod_date DESC LIMIT 10;
 459 </programlisting>
 460
 461     For clarity we omitted the <function>coalesce</function> function
 462     which would be needed to search rows that contain <literal>NULL</literal>
 463     in one of the two fields.
 464    </para>
 465
 466    <para>
 467     Although these queries will work without an index, most applications
 468     will find this approach too slow, except perhaps for occasional ad-hoc
 469     searches.  Practical use of text searching usually requires creating
 470     an index.
 471    </para>
 472
 473   </sect2>
 474
 475   <sect2 id="textsearch-tables-index">
 476    <title>Creating Indexes</title>
 477
 478    <para>
 479     We can create a <acronym>GIN</acronym> index (<xref
 480     linkend="textsearch-indexes">) to speed up text searches:
 481
 482 <programlisting>
 483 CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', body));
 484 </programlisting>
 485
 486     Notice that the 2-argument version of <function>to_tsvector</function> is
 487     used.  Only text search functions that specify a configuration name can
 488     be used in expression indexes (<xref linkend="indexes-expressional">).
 489     This is because the index contents must be unaffected by <xref
 490     linkend="guc-default-text-search-config">.  If they were affected, the
 491     index contents might be inconsistent because different entries could
 492     contain <type>tsvector</>s that were created with different text search
 493     configurations, and there would be no way to guess which was which.  It
 494     would be impossible to dump and restore such an index correctly.
 495    </para>
 496
 497    <para>
 498     Because the two-argument version of <function>to_tsvector</function> was
 499     used in the index above, only a query reference that uses the 2-argument
 500     version of <function>to_tsvector</function> with the same configuration
 501     name will use that index.  That is, <literal>WHERE
 502     to_tsvector('english', body) @@ 'a &amp; b'</> can use the index,
 503     but <literal>WHERE to_tsvector(body) @@ 'a &amp; b'</> cannot.
 504     This ensures that an index will be used only with the same configuration
 505     used to create the index entries.
 506    </para>
 507
 508   <para>
 509     It is possible to set up more complex expression indexes wherein the
 510     configuration name is specified by another column, e.g.:
 511
 512 <programlisting>
 513 CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(config_name, body));
 514 </programlisting>
 515
 516     where <literal>config_name</> is a column in the <literal>pgweb</>
 517     table.  This allows mixed configurations in the same index while
 518     recording which configuration was used for each index entry.  This
 519     would be useful, for example, if the document collection contained
 520     documents in different languages.  Again,
 521     queries that are to use the index must be phrased to match, e.g.
 522     <literal>WHERE to_tsvector(config_name, body) @@ 'a &amp; b'</>.
 523    </para>
 524
 525    <para>
 526     Indexes can even concatenate columns:
 527
 528 <programlisting>
 529 CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', title || body));
 530 </programlisting>
 531    </para>
 532
 533    <para>
 534     Another approach is to create a separate <type>tsvector</> column
 535     to hold the output of <function>to_tsvector</>.  This example is a
 536     concatenation of <literal>title</literal> and <literal>body</literal>,
 537     using <function>coalesce</> to ensure that one field will still be
 538     indexed when the other is <literal>NULL</>:
 539
 540 <programlisting>
 541 ALTER TABLE pgweb ADD COLUMN textsearchable_index_col tsvector;
 542 UPDATE pgweb SET textsearchable_index_col =
 543      to_tsvector('english', coalesce(title,'') || coalesce(body,''));
 544 </programlisting>
 545
 546     Then we create a <acronym>GIN</acronym> index to speed up the search:
 547
 548 <programlisting>
 549 CREATE INDEX textsearch_idx ON pgweb USING gin(textsearchable_index_col);
 550 </programlisting>
 551
 552     Now we are ready to perform a fast full text search:
 553
 554 <programlisting>
 555 SELECT title
 556 FROM pgweb
 557 WHERE textsearchable_index_col @@ to_tsquery('create &amp; table')
 558 ORDER BY last_mod_date DESC LIMIT 10;
 559 </programlisting>
 560    </para>
 561
 562    <para>
 563     When using a separate column to store the <type>tsvector</>
 564     representation,
 565     it is necessary to create a trigger to keep the <type>tsvector</>
 566     column current anytime <literal>title</> or <literal>body</> changes.
 567     <xref linkend="textsearch-update-triggers"> explains how to do that.
 568    </para>
 569
 570    <para>
 571     One advantage of the separate-column approach over an expression index
 572     is that it is not necessary to explicitly specify the text search
 573     configuration in queries in order to make use of the index.  As shown
 574     in the example above, the query can depend on
 575     <varname>default_text_search_config</>.  Another advantage is that
 576     searches will be faster, since it will not be necessary to redo the
 577     <function>to_tsvector</> calls to verify index matches.  (This is more
 578     important when using a GiST index than a GIN index; see <xref
 579     linkend="textsearch-indexes">.)  The expression-index approach is
 580     simpler to set up, however, and it requires less disk space since the
 581     <type>tsvector</> representation is not stored explicitly.
 582    </para>
 583
 584   </sect2>
 585
 586  </sect1>
 587
 588  <sect1 id="textsearch-controls">
 589   <title>Controlling Text Search</title>
 590
 591   <para>
 592    To implement full text searching there must be a function to create a
 593    <type>tsvector</type> from a document and a <type>tsquery</type> from a
 594    user query. Also, we need to return results in a useful order, so we need
 595    a function that compares documents with respect to their relevance to
 596    the query. It's also important to be able to display the results nicely.
 597    <productname>PostgreSQL</productname> provides support for all of these
 598    functions.
 599   </para>
 600
 601   <sect2 id="textsearch-parsing-documents">
 602    <title>Parsing Documents</title>
 603
 604    <para>
 605     <productname>PostgreSQL</productname> provides the
 606     function <function>to_tsvector</function> for converting a document to
 607     the <type>tsvector</type> data type.
 608    </para>
 609
 610    <indexterm>
 611     <primary>to_tsvector</primary>
 612    </indexterm>
 613
 614    <synopsis>
 615     to_tsvector(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">document</replaceable> <type>text</>) returns <type>tsvector</>
 616    </synopsis>
 617
 618    <para>
 619     <function>to_tsvector</function> parses a textual document into tokens,
 620     reduces the tokens to lexemes, and returns a <type>tsvector</type> which
 621     lists the lexemes together with their positions in the document.
 622     The document is processed according to the specified or default
 623     text search configuration.
 624     Here is a simple example:
 625
 626 <programlisting>
 627 SELECT to_tsvector('english', 'a fat  cat sat on a mat - it ate a fat rats');
 628                   to_tsvector
 629 -----------------------------------------------------
 630  'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
 631 </programlisting>
 632    </para>
 633
 634    <para>
 635     In the example above we see that the resulting <type>tsvector</type> does not
 636     contain the words <literal>a</literal>, <literal>on</literal>, or
 637     <literal>it</literal>, the word <literal>rats</literal> became
 638     <literal>rat</literal>, and the punctuation sign <literal>-</literal> was
 639     ignored.
 640    </para>
 641
 642    <para>
 643     The <function>to_tsvector</function> function internally calls a parser
 644     which breaks the document text into tokens and assigns a type to
 645     each token.  For each token, a list of
 646     dictionaries (<xref linkend="textsearch-dictionaries">) is consulted,
 647     where the list can vary depending on the token type.  The first dictionary
 648     that <firstterm>recognizes</> the token emits one or more normalized
 649     <firstterm>lexemes</firstterm> to represent the token.  For example,
 650     <literal>rats</literal> became <literal>rat</literal> because one of the
 651     dictionaries recognized that the word <literal>rats</literal> is a plural
 652     form of <literal>rat</literal>.  Some words are recognized as
 653     <firstterm>stop words</> (<xref linkend="textsearch-stopwords">), which
 654     causes them to be ignored since they occur too frequently to be useful in
 655     searching.  In our example these are
 656     <literal>a</literal>, <literal>on</literal>, and <literal>it</literal>.
 657     If no dictionary in the list recognizes the token then it is also ignored.
 658     In this example that happened to the punctuation sign <literal>-</literal>
 659     because there are in fact no dictionaries assigned for its token type
 660     (<literal>Space symbols</literal>), meaning space tokens will never be
 661     indexed. The choices of parser, dictionaries and which types of tokens to
 662     index are determined by the selected text search configuration (<xref
 663     linkend="textsearch-configuration">).  It is possible to have
 664     many different configurations in the same database, and predefined
 665     configurations are available for various languages. In our example
 666     we used the default configuration <literal>english</literal> for the
 667     English language.
 668    </para>
 669
 670    <para>
 671     The function <function>setweight</function> can be used to label the
 672     entries of a <type>tsvector</type> with a given <firstterm>weight</>,
 673     where a weight is one of the letters <literal>A</>, <literal>B</>,
 674     <literal>C</>, or <literal>D</>.
 675     This is typically used to mark entries coming from
 676     different parts of a document, such as title versus body.  Later, this
 677     information can be used for ranking of search results.
 678    </para>
 679
 680    <para>
 681     Because <function>to_tsvector</function>(<literal>NULL</literal>) will
 682     return <literal>NULL</literal>, it is recommended to use
 683     <function>coalesce</function> whenever a field might be null.
 684     Here is the recommended method for creating
 685     a <type>tsvector</type> from a structured document:
 686
 687 <programlisting>
 688 UPDATE tt SET ti =
 689     setweight(to_tsvector(coalesce(title,'')), 'A')    ||
 690     setweight(to_tsvector(coalesce(keyword,'')), 'B')  ||
 691     setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
 692     setweight(to_tsvector(coalesce(body,'')), 'D');
 693 </programlisting>
 694
 695     Here we have used <function>setweight</function> to label the source
 696     of each lexeme in the finished <type>tsvector</type>, and then merged
 697     the labeled <type>tsvector</type> values using the <type>tsvector</>
 698     concatenation operator <literal>||</>.  (<xref
 699     linkend="textsearch-manipulate-tsvector"> gives details about these
 700     operations.)
 701    </para>
 702
 703   </sect2>
 704
 705   <sect2 id="textsearch-parsing-queries">
 706    <title>Parsing Queries</title>
 707
 708    <para>
 709     <productname>PostgreSQL</productname> provides the
 710     functions <function>to_tsquery</function> and
 711     <function>plainto_tsquery</function> for converting a query to
 712     the <type>tsquery</type> data type.  <function>to_tsquery</function>
 713     offers access to more features than <function>plainto_tsquery</function>,
 714     but is less forgiving about its input.
 715    </para>
 716
 717    <indexterm>
 718     <primary>to_tsquery</primary>
 719    </indexterm>
 720
 721    <synopsis>
 722     to_tsquery(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">querytext</replaceable> <type>text</>) returns <type>tsquery</>
 723    </synopsis>
 724
 725    <para>
 726     <function>to_tsquery</function> creates a <type>tsquery</> value from
 727     <replaceable>querytext</replaceable>, which must consist of single tokens
 728     separated by the Boolean operators <literal>&amp;</literal> (AND),
 729     <literal>|</literal> (OR) and <literal>!</literal> (NOT).  These operators
 730     can be grouped using parentheses.  In other words, the input to
 731     <function>to_tsquery</function> must already follow the general rules for
 732     <type>tsquery</> input, as described in <xref
 733     linkend="datatype-textsearch">.  The difference is that while basic
 734     <type>tsquery</> input takes the tokens at face value,
 735     <function>to_tsquery</function> normalizes each token to a lexeme using
 736     the specified or default configuration, and discards any tokens that are
 737     stop words according to the configuration.  For example:
 738
 739 <programlisting>
 740 SELECT to_tsquery('english', 'The &amp; Fat &amp; Rats');
 741   to_tsquery
 742 ---------------
 743  'fat' &amp; 'rat'
 744 </programlisting>
 745
 746     As in basic <type>tsquery</> input, weight(s) can be attached to each
 747     lexeme to restrict it to match only <type>tsvector</> lexemes of those
 748     weight(s).  For example:
 749
 750 <programlisting>
 751 SELECT to_tsquery('english', 'Fat | Rats:AB');
 752     to_tsquery
 753 ------------------
 754  'fat' | 'rat':AB
 755 </programlisting>
 756
 757     Also, <literal>*</> can be attached to a lexeme to specify prefix matching:
 758
 759 <programlisting>
 760 SELECT to_tsquery('supern:*A &amp; star:A*B');
 761         to_tsquery
 762 --------------------------
 763  'supern':*A &amp; 'star':*AB
 764 </programlisting>
 765
 766     Such a lexeme will match any word in a <type>tsvector</> that begins
 767     with the given string.
 768    </para>
 769
 770    <para>
 771     <function>to_tsquery</function> can also accept single-quoted
 772     phrases.  This is primarily useful when the configuration includes a
 773     thesaurus dictionary that may trigger on such phrases.
 774     In the example below, a thesaurus contains the rule <literal>supernovae
 775     stars : sn</literal>:
 776
 777 <programlisting>
 778 SELECT to_tsquery('''supernovae stars'' &amp; !crab');
 779   to_tsquery
 780 ---------------
 781  'sn' &amp; !'crab'
 782 </programlisting>
 783
 784     Without quotes, <function>to_tsquery</function> will generate a syntax
 785     error for tokens that are not separated by an AND or OR operator.
 786    </para>
 787
 788    <indexterm>
 789     <primary>plainto_tsquery</primary>
 790    </indexterm>
 791
 792    <synopsis>
 793     plainto_tsquery(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">querytext</replaceable> <type>text</>) returns <type>tsquery</>
 794    </synopsis>
 795
 796    <para>
 797     <function>plainto_tsquery</> transforms unformatted text
 798     <replaceable>querytext</replaceable> to <type>tsquery</type>.
 799     The text is parsed and normalized much as for <function>to_tsvector</>,
 800     then the <literal>&amp;</literal> (AND) Boolean operator is inserted
 801     between surviving words.
 802    </para>
 803
 804    <para>
 805     Example:
 806
 807 <programlisting>
 808  SELECT plainto_tsquery('english', 'The Fat Rats');
 809  plainto_tsquery
 810 -----------------
 811  'fat' &amp; 'rat'
 812 </programlisting>
 813
 814     Note that <function>plainto_tsquery</> cannot
 815     recognize Boolean operators, weight labels, or prefix-match labels
 816     in its input:
 817
 818 <programlisting>
 819 SELECT plainto_tsquery('english', 'The Fat &amp; Rats:C');
 820    plainto_tsquery
 821 ---------------------
 822  'fat' &amp; 'rat' &amp; 'c'
 823 </programlisting>
 824
 825     Here, all the input punctuation was discarded as being space symbols.
 826    </para>
 827
 828   </sect2>
 829
 830   <sect2 id="textsearch-ranking">
 831    <title>Ranking Search Results</title>
 832
 833    <para>
 834     Ranking attempts to measure how relevant documents are to a particular
 835     query, so that when there are many matches the most relevant ones can be
 836     shown first.  <productname>PostgreSQL</productname> provides two
 837     predefined ranking functions, which take into account lexical, proximity,
 838     and structural information; that is, they consider how often the query
 839     terms appear in the document, how close together the terms are in the
 840     document, and how important is the part of the document where they occur.
 841     However, the concept of relevancy is vague and very application-specific.
 842     Different applications might require additional information for ranking,
 843     e.g. document modification time.  The built-in ranking functions are only
 844     examples.  You can write your own ranking functions and/or combine their
 845     results with additional factors to fit your specific needs.
 846    </para>
 847
 848    <para>
 849     The two ranking functions currently available are:
 850
 851     <variablelist>
 852
 853      <varlistentry>
 854
 855       <indexterm>
 856        <primary>ts_rank</primary>
 857       </indexterm>
 858
 859       <term>
 860        <synopsis>
 861         ts_rank(<optional> <replaceable class="PARAMETER">weights</replaceable> <type>float4[]</>, </optional> <replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>, <replaceable class="PARAMETER">query</replaceable> <type>tsquery</> <optional>, <replaceable class="PARAMETER">normalization</replaceable> <type>integer</> </optional>) returns <type>float4</>
 862        </synopsis>
 863       </term>
 864
 865       <listitem>
 866        <para>
 867         Standard ranking function.<!-- TODO document this better -->
 868        </para>
 869       </listitem>
 870      </varlistentry>
 871
 872      <varlistentry>
 873
 874       <indexterm>
 875        <primary>ts_rank_cd</primary>
 876       </indexterm>
 877
 878       <term>
 879        <synopsis>
 880         ts_rank_cd(<optional> <replaceable class="PARAMETER">weights</replaceable> <type>float4[]</>, </optional> <replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>, <replaceable class="PARAMETER">query</replaceable> <type>tsquery</> <optional>, <replaceable class="PARAMETER">normalization</replaceable> <type>integer</> </optional>) returns <type>float4</>
 881        </synopsis>
 882       </term>
 883
 884       <listitem>
 885        <para>
 886         This function computes the <firstterm>cover density</firstterm>
 887         ranking for the given document vector and query, as described in
 888         Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three
 889         Term Queries" in the journal "Information Processing and Management",
 890         1999.
 891        </para>
 892
 893        <para>
 894         This function requires positional information in its input.
 895         Therefore it will not work on <quote>stripped</> <type>tsvector</>
 896         values &mdash; it will always return zero.
 897        </para>
 898       </listitem>
 899      </varlistentry>
 900
 901     </variablelist>
 902
 903    </para>
 904
 905    <para>
 906     For both these functions,
 907     the optional <replaceable class="PARAMETER">weights</replaceable>
 908     argument offers the ability to weigh word instances more or less
 909     heavily depending on how they are labeled.  The weight arrays specify
 910     how heavily to weigh each category of word, in the order:
 911
 912 <programlisting>
 913 {D-weight, C-weight, B-weight, A-weight}
 914 </programlisting>
 915
 916     If no <replaceable class="PARAMETER">weights</replaceable> are provided,
 917     then these defaults are used:
 918
 919 <programlisting>
 920 {0.1, 0.2, 0.4, 1.0}
 921 </programlisting>
 922
 923     Typically weights are used to mark words from special areas of the
 924     document, like the title or an initial abstract, so that they can be
 925     treated as more or less important than words in the document body.
 926    </para>
 927
 928    <para>
 929     Since a longer document has a greater chance of containing a query term
 930     it is reasonable to take into account document size, e.g. a hundred-word
 931     document with five instances of a search word is probably more relevant
 932     than a thousand-word document with five instances.  Both ranking functions
 933     take an integer <replaceable>normalization</replaceable> option that
 934     specifies whether and how a document's length should impact its rank.
 935     The integer option controls several behaviors, so it is a bit mask:
 936     you can specify one or more behaviors using
 937     <literal>|</literal> (for example, <literal>2|4</literal>).
 938
 939     <itemizedlist  spacing="compact" mark="bullet">
 940      <listitem>
 941       <para>
 942        0 (the default) ignores the document length
 943       </para>
 944      </listitem>
 945      <listitem>
 946       <para>
 947        1 divides the rank by 1 + the logarithm of the document length
 948       </para>
 949      </listitem>
 950      <listitem>
 951       <para>
 952        2 divides the rank by the document length
 953       </para>
 954      </listitem>
 955      <listitem>
 956       <para>
 957        4 divides the rank by the mean harmonic distance between extents
 958        (this is implemented only by <function>ts_rank_cd</>)
 959       </para>
 960      </listitem>
 961      <listitem>
 962       <para>
 963        8 divides the rank by the number of unique words in document
 964       </para>
 965      </listitem>
 966      <listitem>
 967       <para>
 968        16 divides the rank by 1 + the logarithm of the number
 969        of unique words in document
 970       </para>
 971      </listitem>
 972      <listitem>
 973       <para>
 974        32 divides the rank by itself + 1
 975       </para>
 976      </listitem>
 977     </itemizedlist>
 978
 979     If more than one flag bit is specified, the transformations are
 980     applied in the order listed.
 981    </para>
 982
 983    <para>
 984     It is important to note that the ranking functions do not use any global
 985     information, so it is impossible to produce a fair normalization to 1% or
 986     100% as sometimes desired.  Normalization option 32
 987     (<literal>rank/(rank+1)</literal>) can be applied to scale all ranks
 988     into the range zero to one, but of course this is just a cosmetic change;
 989     it will not affect the ordering of the search results.
 990    </para>
 991
 992    <para>
 993     Here is an example that selects only the ten highest-ranked matches:
 994
 995 <programlisting>
 996 SELECT title, ts_rank_cd(textsearch, query) AS rank
 997 FROM apod, to_tsquery('neutrino|(dark &amp; matter)') query
 998 WHERE query @@ textsearch
 999 ORDER BY rank DESC LIMIT 10;
1000                      title                     |   rank
1001 -----------------------------------------------+----------
1002  Neutrinos in the Sun                          |      3.1
1003  The Sudbury Neutrino Detector                 |      2.4
1004  A MACHO View of Galactic Dark Matter          |  2.01317
1005  Hot Gas and Dark Matter                       |  1.91171
1006  The Virgo Cluster: Hot Plasma and Dark Matter |  1.90953
1007  Rafting for Solar Neutrinos                   |      1.9
1008  NGC 4650A: Strange Galaxy and Dark Matter     |  1.85774
1009  Hot Gas and Dark Matter                       |   1.6123
1010  Ice Fishing for Cosmic Neutrinos              |      1.6
1011  Weak Lensing Distorts the Universe            | 0.818218
1012 </programlisting>
1013
1014     This is the same example using normalized ranking:
1015
1016 <programlisting>
1017 SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank
1018 FROM apod, to_tsquery('neutrino|(dark &amp; matter)') query
1019 WHERE  query @@ textsearch
1020 ORDER BY rank DESC LIMIT 10;
1021                      title                     |        rank
1022 -----------------------------------------------+-------------------
1023  Neutrinos in the Sun                          | 0.756097569485493
1024  The Sudbury Neutrino Detector                 | 0.705882361190954
1025  A MACHO View of Galactic Dark Matter          | 0.668123210574724
1026  Hot Gas and Dark Matter                       |  0.65655958650282
1027  The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
1028  Rafting for Solar Neutrinos                   | 0.655172410958162
1029  NGC 4650A: Strange Galaxy and Dark Matter     | 0.650072921219637
1030  Hot Gas and Dark Matter                       | 0.617195790024749
1031  Ice Fishing for Cosmic Neutrinos              | 0.615384618911517
1032  Weak Lensing Distorts the Universe            | 0.450010798361481
1033 </programlisting>
1034    </para>
1035
1036    <para>
1037     Ranking can be expensive since it requires consulting the
1038     <type>tsvector</type> of each matching document, which can be I/O bound and
1039     therefore slow. Unfortunately, it is almost impossible to avoid since
1040     practical queries often result in large numbers of matches.
1041    </para>
1042
1043   </sect2>
1044
1045   <sect2 id="textsearch-headline">
1046    <title>Highlighting Results</title>
1047
1048    <para>
1049     To present search results it is ideal to show a part of each document and
1050     how it is related to the query. Usually, search engines show fragments of
1051     the document with marked search terms.  <productname>PostgreSQL</>
1052     provides a function <function>ts_headline</function> that
1053     implements this functionality.
1054    </para>
1055
1056    <indexterm>
1057     <primary>ts_headline</primary>
1058    </indexterm>
1059
1060    <synopsis>
1061     ts_headline(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">document</replaceable> <type>text</>, <replaceable class="PARAMETER">query</replaceable> <type>tsquery</> <optional>, <replaceable class="PARAMETER">options</replaceable> <type>text</> </optional>) returns <type>text</>
1062    </synopsis>
1063
1064    <para>
1065     <function>ts_headline</function> accepts a document along
1066     with a query, and returns an excerpt from
1067     the document in which terms from the query are highlighted.  The
1068     configuration to be used to parse the document can be specified by
1069     <replaceable>config</replaceable>; if <replaceable>config</replaceable>
1070     is omitted, the
1071     <varname>default_text_search_config</varname> configuration is used.
1072    </para>
1073
1074    <para>
1075     If an <replaceable>options</replaceable> string is specified it must
1076     consist of a comma-separated list of one or more
1077     <replaceable>option</><literal>=</><replaceable>value</> pairs.
1078     The available options are:
1079
1080     <itemizedlist  spacing="compact" mark="bullet">
1081      <listitem>
1082       <para>
1083        <literal>StartSel</>, <literal>StopSel</literal>: the strings with which
1084        query words appearing in the document should be delimited to distinguish
1085        them from other excerpted words.  You must double-quote these strings
1086        if they contain spaces or commas.
1087       </para>
1088      </listitem>
1089      <listitem >
1090       <para>
1091        <literal>MaxWords</>, <literal>MinWords</literal>: these numbers
1092        determine the longest and shortest headlines to output.
1093       </para>
1094      </listitem>
1095      <listitem>
1096       <para>
1097        <literal>ShortWord</literal>: words of this length or less will be
1098        dropped at the start and end of a headline. The default
1099        value of three eliminates common English articles.
1100       </para>
1101      </listitem>
1102      <listitem>
1103       <para>
1104        <literal>HighlightAll</literal>: Boolean flag;  if
1105        <literal>true</literal> the whole document will be used as the
1106        headline, ignoring the preceding three parameters.
1107       </para>
1108      </listitem>
1109      <listitem>
1110       <para>
1111        <literal>MaxFragments</literal>: maximum number of text excerpts
1112        or fragments to display.  The default value of zero selects a
1113        non-fragment-oriented headline generation method.  A value greater than
1114        zero selects fragment-based headline generation.  This method
1115        finds text fragments with as many query words as possible and
1116        stretches those fragments around the query words.  As a result
1117        query words are close to the middle of each fragment and have words on
1118        each side. Each fragment will be of at most <literal>MaxWords</> and
1119        words of length <literal>ShortWord</> or less are dropped at the start
1120        and end of each fragment. If not all query words are found in the
1121        document, then a single fragment of the first <literal>MinWords</>
1122        in the document will be displayed.
1123       </para>
1124      </listitem>
1125      <listitem>
1126       <para>
1127        <literal>FragmentDelimiter</literal>: When more than one fragment is
1128        displayed, the fragments will be separated by this string.
1129       </para>
1130      </listitem>
1131     </itemizedlist>
1132
1133     Any unspecified options receive these defaults:
1134
1135 <programlisting>
1136 StartSel=&lt;b&gt;, StopSel=&lt;/b&gt;,
1137 MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE,
1138 MaxFragments=0, FragmentDelimiter=" ... "
1139 </programlisting>
1140    </para>
1141
1142    <para>
1143     For example:
1144
1145 <programlisting>
1146 SELECT ts_headline('english',
1147   'The most common type of search
1148 is to find all documents containing given query terms
1149 and return them in order of their similarity to the
1150 query.',
1151   to_tsquery('query &amp; similarity'));
1152                         ts_headline
1153 ------------------------------------------------------------
1154  containing given &lt;b&gt;query&lt;/b&gt; terms
1155  and return them in order of their &lt;b&gt;similarity&lt;/b&gt; to the
1156  &lt;b&gt;query&lt;/b&gt;.
1157
1158 SELECT ts_headline('english',
1159   'The most common type of search
1160 is to find all documents containing given query terms
1161 and return them in order of their similarity to the
1162 query.',
1163   to_tsquery('query &amp; similarity'),
1164   'StartSel = &lt;, StopSel = &gt;');
1165                       ts_headline
1166 -------------------------------------------------------
1167  containing given &lt;query&gt; terms
1168  and return them in order of their &lt;similarity&gt; to the
1169  &lt;query&gt;.
1170 </programlisting>
1171    </para>
1172
1173    <para>
1174     <function>ts_headline</> uses the original document, not a
1175     <type>tsvector</type> summary, so it can be slow and should be used with
1176     care.  A typical mistake is to call <function>ts_headline</function> for
1177     <emphasis>every</emphasis> matching document when only ten documents are
1178     to be shown. <acronym>SQL</acronym> subqueries can help; here is an
1179     example:
1180
1181 <programlisting>
1182 SELECT id, ts_headline(body, q), rank
1183 FROM (SELECT id, body, q, ts_rank_cd(ti, q) AS rank
1184       FROM apod, to_tsquery('stars') q
1185       WHERE ti @@ q
1186       ORDER BY rank DESC LIMIT 10) AS foo;
1187 </programlisting>
1188    </para>
1189
1190   </sect2>
1191
1192  </sect1>
1193
1194  <sect1 id="textsearch-features">
1195   <title>Additional Features</title>
1196
1197   <para>
1198    This section describes additional functions and operators that are
1199    useful in connection with text search.
1200   </para>
1201
1202   <sect2 id="textsearch-manipulate-tsvector">
1203    <title>Manipulating Documents</title>
1204
1205    <para>
1206     <xref linkend="textsearch-parsing-documents"> showed how raw textual
1207     documents can be converted into <type>tsvector</> values.
1208     <productname>PostgreSQL</productname> also provides functions and
1209     operators that can be used to manipulate documents that are already
1210     in <type>tsvector</> form.
1211    </para>
1212
1213    <variablelist>
1214
1215     <varlistentry>
1216
1217      <indexterm>
1218       <primary>tsvector concatenation</primary>
1219      </indexterm>
1220
1221      <term>
1222       <synopsis>
1223        <type>tsvector</> || <type>tsvector</>
1224       </synopsis>
1225      </term>
1226
1227      <listitem>
1228       <para>
1229        The <type>tsvector</> concatenation operator
1230        returns a vector which combines the lexemes and positional information
1231        of the two vectors given as arguments.  Positions and weight labels
1232        are retained during the concatenation.
1233        Positions appearing in the right-hand vector are offset by the largest
1234        position mentioned in the left-hand vector, so that the result is
1235        nearly equivalent to the result of performing <function>to_tsvector</>
1236        on the concatenation of the two original document strings.  (The
1237        equivalence is not exact, because any stop-words removed from the
1238        end of the left-hand argument will not affect the result, whereas
1239        they would have affected the positions of the lexemes in the
1240        right-hand argument if textual concatenation were used.)
1241       </para>
1242
1243       <para>
1244        One advantage of using concatenation in the vector form, rather than
1245        concatenating text before applying <function>to_tsvector</>, is that
1246        you can use different configurations to parse different sections
1247        of the document.  Also, because the <function>setweight</> function
1248        marks all lexemes of the given vector the same way, it is necessary
1249        to parse the text and do <function>setweight</> before concatenating
1250        if you want to label different parts of the document with different
1251        weights.
1252       </para>
1253      </listitem>
1254     </varlistentry>
1255
1256     <varlistentry>
1257
1258      <indexterm>
1259       <primary>setweight</primary>
1260      </indexterm>
1261
1262      <term>
1263       <synopsis>
1264        setweight(<replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>, <replaceable class="PARAMETER">weight</replaceable> <type>"char"</>) returns <type>tsvector</>
1265       </synopsis>
1266      </term>
1267
1268      <listitem>
1269       <para>
1270        This function returns a copy of the input vector in which every
1271        position has been labeled with the given <replaceable>weight</>, either
1272        <literal>A</literal>, <literal>B</literal>, <literal>C</literal>, or
1273        <literal>D</literal>.  (<literal>D</literal> is the default for new
1274        vectors and as such is not displayed on output.)  These labels are
1275        retained when vectors are concatenated, allowing words from different
1276        parts of a document to be weighted differently by ranking functions.
1277       </para>
1278
1279       <para>
1280        Note that weight labels apply to <emphasis>positions</>, not
1281        <emphasis>lexemes</>.  If the input vector has been stripped of
1282        positions then <function>setweight</> does nothing.
1283       </para>
1284      </listitem>
1285     </varlistentry>
1286
1287     <varlistentry>
1288      <indexterm>
1289       <primary>length(tsvector)</primary>
1290      </indexterm>
1291
1292      <term>
1293       <synopsis>
1294        length(<replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>) returns <type>integer</>
1295       </synopsis>
1296      </term>
1297
1298      <listitem>
1299       <para>
1300        Returns the number of lexemes stored in the vector.
1301       </para>
1302      </listitem>
1303     </varlistentry>
1304
1305     <varlistentry>
1306
1307      <indexterm>
1308       <primary>strip</primary>
1309      </indexterm>
1310
1311      <term>
1312       <synopsis>
1313        strip(<replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>) returns <type>tsvector</>
1314       </synopsis>
1315      </term>
1316
1317      <listitem>
1318       <para>
1319        Returns a vector which lists the same lexemes as the given vector, but
1320        which lacks any position or weight information.  While the returned
1321        vector is much less useful than an unstripped vector for relevance
1322        ranking, it will usually be much smaller.
1323       </para>
1324      </listitem>
1325
1326     </varlistentry>
1327
1328    </variablelist>
1329
1330   </sect2>
1331
1332   <sect2 id="textsearch-manipulate-tsquery">
1333    <title>Manipulating Queries</title>
1334
1335    <para>
1336     <xref linkend="textsearch-parsing-queries"> showed how raw textual
1337     queries can be converted into <type>tsquery</> values.
1338     <productname>PostgreSQL</productname> also provides functions and
1339     operators that can be used to manipulate queries that are already
1340     in <type>tsquery</> form.
1341    </para>
1342
1343    <variablelist>
1344
1345     <varlistentry>
1346
1347      <term>
1348       <synopsis>
1349        <type>tsquery</> &amp;&amp; <type>tsquery</>
1350       </synopsis>
1351      </term>
1352
1353      <listitem>
1354       <para>
1355        Returns the AND-combination of the two given queries.
1356       </para>
1357      </listitem>
1358
1359     </varlistentry>
1360
1361     <varlistentry>
1362
1363      <term>
1364       <synopsis>
1365        <type>tsquery</> || <type>tsquery</>
1366       </synopsis>
1367      </term>
1368
1369      <listitem>
1370       <para>
1371        Returns the OR-combination of the two given queries.
1372       </para>
1373      </listitem>
1374
1375     </varlistentry>
1376
1377     <varlistentry>
1378
1379      <term>
1380       <synopsis>
1381        !! <type>tsquery</>
1382       </synopsis>
1383      </term>
1384
1385      <listitem>
1386       <para>
1387        Returns the negation (NOT) of the given query.
1388       </para>
1389      </listitem>
1390
1391     </varlistentry>
1392
1393     <varlistentry>
1394
1395      <indexterm>
1396       <primary>numnode</primary>
1397      </indexterm>
1398
1399      <term>
1400       <synopsis>
1401        numnode(<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>) returns <type>integer</>
1402       </synopsis>
1403      </term>
1404
1405      <listitem>
1406       <para>
1407        Returns the number of nodes (lexemes plus operators) in a
1408        <type>tsquery</>. This function is useful
1409        to determine if the <replaceable>query</replaceable> is meaningful
1410        (returns &gt; 0), or contains only stop words (returns 0).
1411        Examples:
1412
1413 <programlisting>
1414 SELECT numnode(plainto_tsquery('the any'));
1415 NOTICE:  query contains only stopword(s) or doesn't contain lexeme(s), ignored
1416  numnode
1417 ---------
1418        0
1419
1420 SELECT numnode('foo &amp; bar'::tsquery);
1421  numnode
1422 ---------
1423        3
1424 </programlisting>
1425       </para>
1426      </listitem>
1427     </varlistentry>
1428
1429     <varlistentry>
1430
1431      <indexterm>
1432       <primary>querytree</primary>
1433      </indexterm>
1434
1435      <term>
1436       <synopsis>
1437        querytree(<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>) returns <type>text</>
1438       </synopsis>
1439      </term>
1440
1441      <listitem>
1442       <para>
1443        Returns the portion of a <type>tsquery</> that can be used for
1444        searching an index.  This function is useful for detecting
1445        unindexable queries, for example those containing only stop words
1446        or only negated terms.  For example:
1447
1448 <programlisting>
1449 SELECT querytree(to_tsquery('!defined'));
1450  querytree
1451 -----------
1452
1453 </programlisting>
1454       </para>
1455      </listitem>
1456     </varlistentry>
1457
1458    </variablelist>
1459
1460    <sect3 id="textsearch-query-rewriting">
1461     <title>Query Rewriting</title>
1462
1463     <indexterm zone="textsearch-query-rewriting">
1464      <primary>ts_rewrite</primary>
1465     </indexterm>
1466
1467     <para>
1468      The <function>ts_rewrite</function> family of functions search a
1469      given <type>tsquery</> for occurrences of a target
1470      subquery, and replace each occurrence with another
1471      substitute subquery.  In essence this operation is a
1472      <type>tsquery</>-specific version of substring replacement.
1473      A target and substitute combination can be
1474      thought of as a <firstterm>query rewrite rule</>.  A collection
1475      of such rewrite rules can be a powerful search aid.
1476      For example, you can expand the search using synonyms
1477      (e.g., <literal>new york</>, <literal>big apple</>, <literal>nyc</>,
1478      <literal>gotham</>) or narrow the search to direct the user to some hot
1479      topic.  There is some overlap in functionality between this feature
1480      and thesaurus dictionaries (<xref linkend="textsearch-thesaurus">).
1481      However, you can modify a set of rewrite rules on-the-fly without
1482      reindexing, whereas updating a thesaurus requires reindexing to be
1483      effective.
1484     </para>
1485
1486     <variablelist>
1487
1488      <varlistentry>
1489
1490       <term>
1491        <synopsis>
1492         ts_rewrite (<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">target</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">substitute</replaceable> <type>tsquery</>) returns <type>tsquery</>
1493        </synopsis>
1494       </term>
1495
1496       <listitem>
1497        <para>
1498         This form of <function>ts_rewrite</> simply applies a single
1499         rewrite rule: <replaceable class="PARAMETER">target</replaceable>
1500         is replaced by <replaceable class="PARAMETER">substitute</replaceable>
1501         wherever it appears in <replaceable
1502         class="PARAMETER">query</replaceable>.  For example:
1503
1504 <programlisting>
1505 SELECT ts_rewrite('a &amp; b'::tsquery, 'a'::tsquery, 'c'::tsquery);
1506  ts_rewrite
1507 ------------
1508  'b' &amp; 'c'
1509 </programlisting>
1510        </para>
1511       </listitem>
1512      </varlistentry>
1513
1514      <varlistentry>
1515
1516       <term>
1517        <synopsis>
1518         ts_rewrite (<replaceable class="PARAMETER">query</> <type>tsquery</>, <replaceable class="PARAMETER">select</> <type>text</>) returns <type>tsquery</>
1519        </synopsis>
1520       </term>
1521
1522       <listitem>
1523        <para>
1524         This form of <function>ts_rewrite</> accepts a starting
1525         <replaceable>query</> and a SQL <replaceable>select</> command, which
1526         is given as a text string.  The <replaceable>select</> must yield two
1527         columns of <type>tsquery</> type.  For each row of the
1528         <replaceable>select</> result, occurrences of the first column value
1529         (the target) are replaced by the second column value (the substitute)
1530         within the current <replaceable>query</> value.  For example:
1531
1532 <programlisting>
1533 CREATE TABLE aliases (t tsquery PRIMARY KEY, s tsquery);
1534 INSERT INTO aliases VALUES('a', 'c');
1535
1536 SELECT ts_rewrite('a &amp; b'::tsquery, 'SELECT t,s FROM aliases');
1537  ts_rewrite
1538 ------------
1539  'b' &amp; 'c'
1540 </programlisting>
1541        </para>
1542
1543        <para>
1544         Note that when multiple rewrite rules are applied in this way,
1545         the order of application can be important; so in practice you will
1546         want the source query to <literal>ORDER BY</> some ordering key.
1547        </para>
1548       </listitem>
1549      </varlistentry>
1550
1551     </variablelist>
1552
1553     <para>
1554      Let's consider a real-life astronomical example. We'll expand query
1555      <literal>supernovae</literal> using table-driven rewriting rules:
1556
1557 <programlisting>
1558 CREATE TABLE aliases (t tsquery primary key, s tsquery);
1559 INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn'));
1560
1561 SELECT ts_rewrite(to_tsquery('supernovae &amp; crab'), 'SELECT * FROM aliases');
1562            ts_rewrite
1563 ---------------------------------
1564  'crab' &amp; ( 'supernova' | 'sn' )
1565 </programlisting>
1566
1567      We can change the rewriting rules just by updating the table:
1568
1569 <programlisting>
1570 UPDATE aliases SET s = to_tsquery('supernovae|sn &amp; !nebulae') WHERE t = to_tsquery('supernovae');
1571
1572 SELECT ts_rewrite(to_tsquery('supernovae &amp; crab'), 'SELECT * FROM aliases');
1573                  ts_rewrite
1574 ---------------------------------------------
1575  'crab' &amp; ( 'supernova' | 'sn' &amp; !'nebula' )
1576 </programlisting>
1577     </para>
1578
1579     <para>
1580      Rewriting can be slow when there are many rewriting rules, since it
1581      checks every rule for a possible hit. To filter out obvious non-candidate
1582      rules we can use the containment operators for the <type>tsquery</type>
1583      type. In the example below, we select only those rules which might match
1584      the original query:
1585
1586 <programlisting>
1587 SELECT ts_rewrite('a &amp; b'::tsquery,
1588                   'SELECT t,s FROM aliases WHERE ''a &amp; b''::tsquery @&gt; t');
1589  ts_rewrite
1590 ------------
1591  'b' &amp; 'c'
1592 </programlisting>
1593     </para>
1594
1595    </sect3>
1596
1597   </sect2>
1598
1599   <sect2 id="textsearch-update-triggers">
1600    <title>Triggers for Automatic Updates</title>
1601
1602    <indexterm>
1603     <primary>trigger</primary>
1604     <secondary>for updating a derived tsvector column</secondary>
1605    </indexterm>
1606
1607    <para>
1608     When using a separate column to store the <type>tsvector</> representation
1609     of your documents, it is necessary to create a trigger to update the
1610     <type>tsvector</> column when the document content columns change.
1611     Two built-in trigger functions are available for this, or you can write
1612     your own.
1613    </para>
1614
1615    <synopsis>
1616     tsvector_update_trigger(<replaceable class="PARAMETER">tsvector_column_name</replaceable>, <replaceable class="PARAMETER">config_name</replaceable>, <replaceable class="PARAMETER">text_column_name</replaceable> <optional>, ... </optional>)
1617     tsvector_update_trigger_column(<replaceable class="PARAMETER">tsvector_column_name</replaceable>, <replaceable class="PARAMETER">config_column_name</replaceable>, <replaceable class="PARAMETER">text_column_name</replaceable> <optional>, ... </optional>)
1618    </synopsis>
1619
1620    <para>
1621     These trigger functions automatically compute a <type>tsvector</>
1622     column from one or more textual columns, under the control of
1623     parameters specified in the <command>CREATE TRIGGER</> command.
1624     An example of their use is:
1625
1626 <programlisting>
1627 CREATE TABLE messages (
1628     title       text,
1629     body        text,
1630     tsv         tsvector
1631 );
1632
1633 CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
1634 ON messages FOR EACH ROW EXECUTE PROCEDURE
1635 tsvector_update_trigger(tsv, 'pg_catalog.english', title, body);
1636
1637 INSERT INTO messages VALUES('title here', 'the body text is here');
1638
1639 SELECT * FROM messages;
1640    title    |         body          |            tsv
1641 ------------+-----------------------+----------------------------
1642  title here | the body text is here | 'bodi':4 'text':5 'titl':1
1643
1644 SELECT title, body FROM messages WHERE tsv @@ to_tsquery('title &amp; body');
1645    title    |         body
1646 ------------+-----------------------
1647  title here | the body text is here
1648 </programlisting>
1649
1650     Having created this trigger, any change in <structfield>title</> or
1651     <structfield>body</> will automatically be reflected into
1652     <structfield>tsv</>, without the application having to worry about it.
1653    </para>
1654
1655    <para>
1656     The first trigger argument must be the name of the <type>tsvector</>
1657     column to be updated.  The second argument specifies the text search
1658     configuration to be used to perform the conversion.  For
1659     <function>tsvector_update_trigger</>, the configuration name is simply
1660     given as the second trigger argument.  It must be schema-qualified as
1661     shown above, so that the trigger behavior will not change with changes
1662     in <varname>search_path</>.  For
1663     <function>tsvector_update_trigger_column</>, the second trigger argument
1664     is the name of another table column, which must be of type
1665     <type>regconfig</>.  This allows a per-row selection of configuration
1666     to be made.  The remaining argument(s) are the names of textual columns
1667     (of type <type>text</>, <type>varchar</>, or <type>char</>).  These
1668     will be included in the document in the order given.  NULL values will
1669     be skipped (but the other columns will still be indexed).
1670    </para>
1671
1672    <para>
1673     A limitation of the built-in triggers is that they treat all the
1674     input columns alike.  To process columns differently &mdash; for
1675     example, to weight title differently from body &mdash; it is necessary
1676     to write a custom trigger.  Here is an example using
1677     <application>PL/pgSQL</application> as the trigger language:
1678
1679 <programlisting>
1680 CREATE FUNCTION messages_trigger() RETURNS trigger AS $$
1681 begin
1682   new.tsv :=
1683      setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'A') ||
1684      setweight(to_tsvector('pg_catalog.english', coalesce(new.body,'')), 'D');
1685   return new;
1686 end
1687 $$ LANGUAGE plpgsql;
1688
1689 CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
1690 ON messages FOR EACH ROW EXECUTE PROCEDURE messages_trigger();
1691 </programlisting>
1692    </para>
1693
1694    <para>
1695     Keep in mind that it is important to specify the configuration name
1696     explicitly when creating <type>tsvector</> values inside triggers,
1697     so that the column's contents will not be affected by changes to
1698     <varname>default_text_search_config</>.  Failure to do this is likely to
1699     lead to problems such as search results changing after a dump and reload.
1700    </para>
1701
1702   </sect2>
1703
1704   <sect2 id="textsearch-statistics">
1705    <title>Gathering Document Statistics</title>
1706
1707    <indexterm>
1708     <primary>ts_stat</primary>
1709    </indexterm>
1710
1711    <para>
1712     The function <function>ts_stat</> is useful for checking your
1713     configuration and for finding stop-word candidates.
1714    </para>
1715
1716    <synopsis>
1717     ts_stat(<replaceable class="PARAMETER">sqlquery</replaceable> <type>text</>, <optional> <replaceable class="PARAMETER">weights</replaceable> <type>text</>, </optional> OUT <replaceable class="PARAMETER">word</replaceable> <type>text</>, OUT <replaceable class="PARAMETER">ndoc</replaceable> <type>integer</>, OUT <replaceable class="PARAMETER">nentry</replaceable> <type>integer</>) returns <type>setof record</>
1718    </synopsis>
1719
1720    <para>
1721     <replaceable>sqlquery</replaceable> is a text value containing a SQL
1722     query which must return a single <type>tsvector</type> column.
1723     <function>ts_stat</> executes the query and returns statistics about
1724     each distinct lexeme (word) contained in the <type>tsvector</type>
1725     data.  The columns returned are
1726
1727     <itemizedlist  spacing="compact" mark="bullet">
1728      <listitem>
1729       <para>
1730        <replaceable>word</> <type>text</> &mdash; the value of a lexeme
1731       </para>
1732      </listitem>
1733      <listitem>
1734       <para>
1735        <replaceable>ndoc</> <type>integer</> &mdash; number of documents
1736        (<type>tsvector</>s) the word occurred in
1737       </para>
1738      </listitem>
1739      <listitem>
1740       <para>
1741        <replaceable>nentry</> <type>integer</> &mdash; total number of
1742        occurrences of the word
1743       </para>
1744      </listitem>
1745     </itemizedlist>
1746
1747     If <replaceable>weights</replaceable> is supplied, only occurrences
1748     having one of those weights are counted.
1749    </para>
1750
1751    <para>
1752     For example, to find the ten most frequent words in a document collection:
1753
1754 <programlisting>
1755 SELECT * FROM ts_stat('SELECT vector FROM apod')
1756 ORDER BY nentry DESC, ndoc DESC, word
1757 LIMIT 10;
1758 </programlisting>
1759
1760     The same, but counting only word occurrences with weight <literal>A</>
1761     or <literal>B</>:
1762
1763 <programlisting>
1764 SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
1765 ORDER BY nentry DESC, ndoc DESC, word
1766 LIMIT 10;
1767 </programlisting>
1768    </para>
1769
1770   </sect2>
1771
1772  </sect1>
1773
1774  <sect1 id="textsearch-parsers">
1775   <title>Parsers</title>
1776
1777   <para>
1778    Text search parsers are responsible for splitting raw document text
1779    into <firstterm>tokens</> and identifying each token's type, where
1780    the set of possible types is defined by the parser itself.
1781    Note that a parser does not modify the text at all &mdash; it simply
1782    identifies plausible word boundaries.  Because of this limited scope,
1783    there is less need for application-specific custom parsers than there is
1784    for custom dictionaries.  At present <productname>PostgreSQL</productname>
1785    provides just one built-in parser, which has been found to be useful for a
1786    wide range of applications.
1787   </para>
1788
1789   <para>
1790    The built-in parser is named <literal>pg_catalog.default</>.
1791    It recognizes 23 token types:
1792   </para>
1793
1794   <table id="textsearch-default-parser">
1795    <title>Default Parser's Token Types</title>
1796    <tgroup cols="3">
1797     <thead>
1798      <row>
1799       <entry>Alias</entry>
1800       <entry>Description</entry>
1801       <entry>Example</entry>
1802      </row>
1803     </thead>
1804     <tbody>
1805      <row>
1806       <entry><literal>asciiword</></entry>
1807       <entry>Word, all ASCII letters</entry>
1808       <entry><literal>elephant</literal></entry>
1809      </row>
1810      <row>
1811       <entry><literal>word</></entry>
1812       <entry>Word, all letters</entry>
1813       <entry><literal>ma&ntilde;ana</literal></entry>
1814      </row>
1815      <row>
1816       <entry><literal>numword</></entry>
1817       <entry>Word, letters and digits</entry>
1818       <entry><literal>beta1</literal></entry>
1819      </row>
1820      <row>
1821       <entry><literal>asciihword</></entry>
1822       <entry>Hyphenated word, all ASCII</entry>
1823       <entry><literal>up-to-date</literal></entry>
1824      </row>
1825      <row>
1826       <entry><literal>hword</></entry>
1827       <entry>Hyphenated word, all letters</entry>
1828       <entry><literal>l&oacute;gico-matem&aacute;tica</literal></entry>
1829      </row>
1830      <row>
1831       <entry><literal>numhword</></entry>
1832       <entry>Hyphenated word, letters and digits</entry>
1833       <entry><literal>postgresql-beta1</literal></entry>
1834      </row>
1835      <row>
1836       <entry><literal>hword_asciipart</></entry>
1837       <entry>Hyphenated word part, all ASCII</entry>
1838       <entry><literal>postgresql</literal> in the context <literal>postgresql-beta1</literal></entry>
1839      </row>
1840      <row>
1841       <entry><literal>hword_part</></entry>
1842       <entry>Hyphenated word part, all letters</entry>
1843       <entry><literal>l&oacute;gico</literal> or <literal>matem&aacute;tica</literal>
1844        in the context <literal>l&oacute;gico-matem&aacute;tica</literal></entry>
1845      </row>
1846      <row>
1847       <entry><literal>hword_numpart</></entry>
1848       <entry>Hyphenated word part, letters and digits</entry>
1849       <entry><literal>beta1</literal> in the context
1850        <literal>postgresql-beta1</literal></entry>
1851      </row>
1852      <row>
1853       <entry><literal>email</></entry>
1854       <entry>Email address</entry>
1855       <entry><literal>foo@example.com</literal></entry>
1856      </row>
1857      <row>
1858       <entry><literal>protocol</></entry>
1859       <entry>Protocol head</entry>
1860       <entry><literal>http://</literal></entry>
1861      </row>
1862      <row>
1863       <entry><literal>url</></entry>
1864       <entry>URL</entry>
1865       <entry><literal>example.com/stuff/index.html</literal></entry>
1866      </row>
1867      <row>
1868       <entry><literal>host</></entry>
1869       <entry>Host</entry>
1870       <entry><literal>example.com</literal></entry>
1871      </row>
1872      <row>
1873       <entry><literal>url_path</></entry>
1874       <entry>URL path</entry>
1875       <entry><literal>/stuff/index.html</literal>, in the context of a URL</entry>
1876      </row>
1877      <row>
1878       <entry><literal>file</></entry>
1879       <entry>File or path name</entry>
1880       <entry><literal>/usr/local/foo.txt</literal>, if not within a URL</entry>
1881      </row>
1882      <row>
1883       <entry><literal>sfloat</></entry>
1884       <entry>Scientific notation</entry>
1885       <entry><literal>-1.234e56</literal></entry>
1886      </row>
1887      <row>
1888       <entry><literal>float</></entry>
1889       <entry>Decimal notation</entry>
1890       <entry><literal>-1.234</literal></entry>
1891      </row>
1892      <row>
1893       <entry><literal>int</></entry>
1894       <entry>Signed integer</entry>
1895       <entry><literal>-1234</literal></entry>
1896      </row>
1897      <row>
1898       <entry><literal>uint</></entry>
1899       <entry>Unsigned integer</entry>
1900       <entry><literal>1234</literal></entry>
1901      </row>
1902      <row>
1903       <entry><literal>version</></entry>
1904       <entry>Version number</entry>
1905       <entry><literal>8.3.0</literal></entry>
1906      </row>
1907      <row>
1908       <entry><literal>tag</></entry>
1909       <entry>XML tag</entry>
1910       <entry><literal>&lt;a href="dictionaries.html"&gt;</literal></entry>
1911      </row>
1912      <row>
1913       <entry><literal>entity</></entry>
1914       <entry>XML entity</entry>
1915       <entry><literal>&amp;amp;</literal></entry>
1916      </row>
1917      <row>
1918       <entry><literal>blank</></entry>
1919       <entry>Space symbols</entry>
1920       <entry>(any whitespace or punctuation not otherwise recognized)</entry>
1921      </row>
1922     </tbody>
1923    </tgroup>
1924   </table>
1925
1926   <note>
1927    <para>
1928     The parser's notion of a <quote>letter</> is determined by the database's
1929     locale setting, specifically <varname>lc_ctype</>.  Words containing
1930     only the basic ASCII letters are reported as a separate token type,
1931     since it is sometimes useful to distinguish them.  In most European
1932     languages, token types <literal>word</> and <literal>asciiword</>
1933     should always be treated alike.
1934    </para>
1935   </note>
1936
1937   <para>
1938    It is possible for the parser to produce overlapping tokens from the same
1939    piece of text.  As an example, a hyphenated word will be reported both
1940    as the entire word and as each component:
1941
1942 <programlisting>
1943 SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
1944       alias      |               description                |     token
1945 -----------------+------------------------------------------+---------------
1946  numhword        | Hyphenated word, letters and digits      | foo-bar-beta1
1947  hword_asciipart | Hyphenated word part, all ASCII          | foo
1948  blank           | Space symbols                            | -
1949  hword_asciipart | Hyphenated word part, all ASCII          | bar
1950  blank           | Space symbols                            | -
1951  hword_numpart   | Hyphenated word part, letters and digits | beta1
1952 </programlisting>
1953
1954    This behavior is desirable since it allows searches to work for both
1955    the whole compound word and for components.  Here is another
1956    instructive example:
1957
1958 <programlisting>
1959 SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');
1960   alias   |  description  |            token
1961 ----------+---------------+------------------------------
1962  protocol | Protocol head | http://
1963  url      | URL           | example.com/stuff/index.html
1964  host     | Host          | example.com
1965  url_path | URL path      | /stuff/index.html
1966 </programlisting>
1967   </para>
1968
1969  </sect1>
1970
1971  <sect1 id="textsearch-dictionaries">
1972   <title>Dictionaries</title>
1973
1974   <para>
1975    Dictionaries are used to eliminate words that should not be considered in a
1976    search (<firstterm>stop words</>), and to <firstterm>normalize</> words so
1977    that different derived forms of the same word will match.  A successfully
1978    normalized word is called a <firstterm>lexeme</>.  Aside from
1979    improving search quality, normalization and removal of stop words reduce the
1980    size of the <type>tsvector</type> representation of a document, thereby
1981    improving performance.  Normalization does not always have linguistic meaning
1982    and usually depends on application semantics.
1983   </para>
1984
1985   <para>
1986    Some examples of normalization:
1987
1988    <itemizedlist  spacing="compact" mark="bullet">
1989
1990     <listitem>
1991      <para>
1992       Linguistic - Ispell dictionaries try to reduce input words to a
1993       normalized form; stemmer dictionaries remove word endings
1994      </para>
1995     </listitem>
1996     <listitem>
1997      <para>
1998       <acronym>URL</acronym> locations can be canonicalized to make
1999       equivalent URLs match:
2000
2001       <itemizedlist  spacing="compact" mark="bullet">
2002        <listitem>
2003         <para>
2004          http://www.pgsql.ru/db/mw/index.html
2005         </para>
2006        </listitem>
2007        <listitem>
2008         <para>
2009          http://www.pgsql.ru/db/mw/
2010         </para>
2011        </listitem>
2012        <listitem>
2013         <para>
2014          http://www.pgsql.ru/db/../db/mw/index.html
2015         </para>
2016        </listitem>
2017       </itemizedlist>
2018      </para>
2019     </listitem>
2020     <listitem>
2021      <para>
2022       Color names can be replaced by their hexadecimal values, e.g.,
2023       <literal>red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF</literal>
2024      </para>
2025     </listitem>
2026     <listitem>
2027      <para>
2028       If indexing numbers, we can
2029       remove some fractional digits to reduce the range of possible
2030       numbers, so for example <emphasis>3.14</emphasis>159265359,
2031       <emphasis>3.14</emphasis>15926, <emphasis>3.14</emphasis> will be the same
2032       after normalization if only two digits are kept after the decimal point.
2033      </para>
2034     </listitem>
2035    </itemizedlist>
2036
2037   </para>
2038
2039   <para>
2040    A dictionary is a program that accepts a token as
2041    input and returns:
2042    <itemizedlist  spacing="compact" mark="bullet">
2043     <listitem>
2044      <para>
2045       an array of lexemes if the input token is known to the dictionary
2046       (notice that one token can produce more than one lexeme)
2047      </para>
2048     </listitem>
2049     <listitem>
2050      <para>
2051       an empty array if the dictionary knows the token, but it is a stop word
2052      </para>
2053     </listitem>
2054     <listitem>
2055      <para>
2056       <literal>NULL</literal> if the dictionary does not recognize the input token
2057      </para>
2058     </listitem>
2059    </itemizedlist>
2060   </para>
2061
2062   <para>
2063    <productname>PostgreSQL</productname> provides predefined dictionaries for
2064    many languages.  There are also several predefined templates that can be
2065    used to create new dictionaries with custom parameters.  Each predefined
2066    dictionary template is described below.  If no existing
2067    template is suitable, it is possible to create new ones; see the
2068    <filename>contrib/</> area of the <productname>PostgreSQL</> distribution
2069    for examples.
2070   </para>
2071
2072   <para>
2073    A text search configuration binds a parser together with a set of
2074    dictionaries to process the parser's output tokens.  For each token
2075    type that the parser can return, a separate list of dictionaries is
2076    specified by the configuration.  When a token of that type is found
2077    by the parser, each dictionary in the list is consulted in turn,
2078    until some dictionary recognizes it as a known word.  If it is identified
2079    as a stop word, or if no dictionary recognizes the token, it will be
2080    discarded and not indexed or searched for.
2081    The general rule for configuring a list of dictionaries
2082    is to place first the most narrow, most specific dictionary, then the more
2083    general dictionaries, finishing with a very general dictionary, like
2084    a <application>Snowball</> stemmer or <literal>simple</>, which
2085    recognizes everything.  For example, for an astronomy-specific search
2086    (<literal>astro_en</literal> configuration) one could bind token type
2087    <type>asciiword</type> (ASCII word) to a synonym dictionary of astronomical
2088    terms, a general English dictionary and a <application>Snowball</> English
2089    stemmer:
2090
2091 <programlisting>
2092 ALTER TEXT SEARCH CONFIGURATION astro_en
2093     ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;
2094 </programlisting>
2095   </para>
2096
2097   <sect2 id="textsearch-stopwords">
2098    <title>Stop Words</title>
2099
2100    <para>
2101     Stop words are words that are very common, appear in almost every
2102     document, and have no discrimination value. Therefore, they can be ignored
2103     in the context of full text searching. For example, every English text
2104     contains words like <literal>a</literal> and <literal>the</>, so it is
2105     useless to store them in an index.  However, stop words do affect the
2106     positions in <type>tsvector</type>, which in turn affect ranking:
2107
2108 <programlisting>
2109 SELECT to_tsvector('english','in the list of stop words');
2110         to_tsvector
2111 ----------------------------
2112  'list':3 'stop':5 'word':6
2113 </programlisting>
2114
2115     The missing positions 1,2,4 are because of stop words.  Ranks
2116     calculated for documents with and without stop words are quite different:
2117
2118 <programlisting>
2119 SELECT ts_rank_cd (to_tsvector('english','in the list of stop words'), to_tsquery('list &amp; stop'));
2120  ts_rank_cd
2121 ------------
2122        0.05
2123
2124 SELECT ts_rank_cd (to_tsvector('english','list stop words'), to_tsquery('list &amp; stop'));
2125  ts_rank_cd
2126 ------------
2127         0.1
2128 </programlisting>
2129
2130    </para>
2131
2132    <para>
2133     It is up to the specific dictionary how it treats stop words. For example,
2134     <literal>ispell</literal> dictionaries first normalize words and then
2135     look at the list of stop words, while <literal>Snowball</literal> stemmers
2136     first check the list of stop words. The reason for the different
2137     behavior is an attempt to decrease noise.
2138    </para>
2139
2140   </sect2>
2141
2142   <sect2 id="textsearch-simple-dictionary">
2143    <title>Simple Dictionary</title>
2144
2145    <para>
2146     The <literal>simple</> dictionary template operates by converting the
2147     input token to lower case and checking it against a file of stop words.
2148     If it is found in the file then an empty array is returned, causing
2149     the token to be discarded.  If not, the lower-cased form of the word
2150     is returned as the normalized lexeme.  Alternatively, the dictionary
2151     can be configured to report non-stop-words as unrecognized, allowing
2152     them to be passed on to the next dictionary in the list.
2153    </para>
2154
2155    <para>
2156     Here is an example of a dictionary definition using the <literal>simple</>
2157     template:
2158
2159 <programlisting>
2160 CREATE TEXT SEARCH DICTIONARY public.simple_dict (
2161     TEMPLATE = pg_catalog.simple,
2162     STOPWORDS = english
2163 );
2164 </programlisting>
2165
2166     Here, <literal>english</literal> is the base name of a file of stop words.
2167     The file's full name will be
2168     <filename>$SHAREDIR/tsearch_data/english.stop</>,
2169     where <literal>$SHAREDIR</> means the
2170     <productname>PostgreSQL</productname> installation's shared-data directory,
2171     often <filename>/usr/local/share/postgresql</> (use <command>pg_config
2172     --sharedir</> to determine it if you're not sure).
2173     The file format is simply a list
2174     of words, one per line.  Blank lines and trailing spaces are ignored,
2175     and upper case is folded to lower case, but no other processing is done
2176     on the file contents.
2177    </para>
2178
2179    <para>
2180     Now we can test our dictionary:
2181
2182 <programlisting>
2183 SELECT ts_lexize('public.simple_dict','YeS');
2184  ts_lexize
2185 -----------
2186  {yes}
2187
2188 SELECT ts_lexize('public.simple_dict','The');
2189  ts_lexize
2190 -----------
2191  {}
2192 </programlisting>
2193    </para>
2194
2195    <para>
2196     We can also choose to return <literal>NULL</>, instead of the lower-cased
2197     word, if it is not found in the stop words file.  This behavior is
2198     selected by setting the dictionary's <literal>Accept</> parameter to
2199     <literal>false</>.  Continuing the example:
2200
2201 <programlisting>
2202 ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );
2203
2204 SELECT ts_lexize('public.simple_dict','YeS');
2205  ts_lexize
2206 -----------
2207
2208
2209 SELECT ts_lexize('public.simple_dict','The');
2210  ts_lexize
2211 -----------
2212  {}
2213 </programlisting>
2214    </para>
2215
2216    <para>
2217     With the default setting of <literal>Accept</> = <literal>true</>,
2218     it is only useful to place a <literal>simple</> dictionary at the end
2219     of a list of dictionaries, since it will never pass on any token to
2220     a following dictionary.  Conversely, <literal>Accept</> = <literal>false</>
2221     is only useful when there is at least one following dictionary.
2222    </para>
2223
2224    <caution>
2225     <para>
2226      Most types of dictionaries rely on configuration files, such as files of
2227      stop words.  These files <emphasis>must</> be stored in UTF-8 encoding.
2228      They will be translated to the actual database encoding, if that is
2229      different, when they are read into the server.
2230     </para>
2231    </caution>
2232
2233    <caution>
2234     <para>
2235      Normally, a database session will read a dictionary configuration file
2236      only once, when it is first used within the session.  If you modify a
2237      configuration file and want to force existing sessions to pick up the
2238      new contents, issue an <command>ALTER TEXT SEARCH DICTIONARY</> command
2239      on the dictionary.  This can be a <quote>dummy</> update that doesn't
2240      actually change any parameter values.
2241     </para>
2242    </caution>
2243
2244   </sect2>
2245
2246   <sect2 id="textsearch-synonym-dictionary">
2247    <title>Synonym Dictionary</title>
2248
2249    <para>
2250     This dictionary template is used to create dictionaries that replace a
2251     word with a synonym. Phrases are not supported (use the thesaurus
2252     template (<xref linkend="textsearch-thesaurus">) for that).  A synonym
2253     dictionary can be used to overcome linguistic problems, for example, to
2254     prevent an English stemmer dictionary from reducing the word 'Paris' to
2255     'pari'.  It is enough to have a <literal>Paris paris</literal> line in the
2256     synonym dictionary and put it before the <literal>english_stem</>
2257     dictionary.  For example:
2258
2259 <programlisting>
2260 SELECT * FROM ts_debug('english', 'Paris');
2261    alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
2262 -----------+-----------------+-------+----------------+--------------+---------
2263  asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari}
2264
2265 CREATE TEXT SEARCH DICTIONARY my_synonym (
2266     TEMPLATE = synonym,
2267     SYNONYMS = my_synonyms
2268 );
2269
2270 ALTER TEXT SEARCH CONFIGURATION english
2271     ALTER MAPPING FOR asciiword WITH my_synonym, english_stem;
2272
2273 SELECT * FROM ts_debug('english', 'Paris');
2274    alias   |   description   | token |       dictionaries        | dictionary | lexemes
2275 -----------+-----------------+-------+---------------------------+------------+---------
2276  asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}
2277 </programlisting>
2278    </para>
2279
2280    <para>
2281     The only parameter required by the <literal>synonym</> template is
2282     <literal>SYNONYMS</>, which is the base name of its configuration file
2283     &mdash; <literal>my_synonyms</> in the above example.
2284     The file's full name will be
2285     <filename>$SHAREDIR/tsearch_data/my_synonyms.syn</>
2286     (where <literal>$SHAREDIR</> means the
2287     <productname>PostgreSQL</> installation's shared-data directory).
2288     The file format is just one line
2289     per word to be substituted, with the word followed by its synonym,
2290     separated by white space.  Blank lines and trailing spaces are ignored.
2291    </para>
2292
2293    <para>
2294     The <literal>synonym</> template also has an optional parameter
2295     <literal>CaseSensitive</>, which defaults to <literal>false</>.  When
2296     <literal>CaseSensitive</> is <literal>false</>, words in the synonym file
2297     are folded to lower case, as are input tokens.  When it is
2298     <literal>true</>, words and tokens are not folded to lower case,
2299     but are compared as-is.
2300    </para>
2301   </sect2>
2302
2303   <sect2 id="textsearch-thesaurus">
2304    <title>Thesaurus Dictionary</title>
2305
2306    <para>
2307     A thesaurus dictionary (sometimes abbreviated as <acronym>TZ</acronym>) is
2308     a collection of words that includes information about the relationships
2309     of words and phrases, i.e., broader terms (<acronym>BT</acronym>), narrower
2310     terms (<acronym>NT</acronym>), preferred terms, non-preferred terms, related
2311     terms, etc.
2312    </para>
2313
2314    <para>
2315     Basically a thesaurus dictionary replaces all non-preferred terms by one
2316     preferred term and, optionally, preserves the original terms for indexing
2317     as well.  <productname>PostgreSQL</>'s current implementation of the
2318     thesaurus dictionary is an extension of the synonym dictionary with added
2319     <firstterm>phrase</firstterm> support.  A thesaurus dictionary requires
2320     a configuration file of the following format:
2321
2322 <programlisting>
2323 # this is a comment
2324 sample word(s) : indexed word(s)
2325 more sample word(s) : more indexed word(s)
2326 ...
2327 </programlisting>
2328
2329     where  the colon (<symbol>:</symbol>) symbol acts as a delimiter between a
2330     a phrase and its replacement.
2331    </para>
2332
2333    <para>
2334     A thesaurus dictionary uses a <firstterm>subdictionary</firstterm> (which
2335     is specified in the dictionary's configuration) to normalize the input
2336     text before checking for phrase matches. It is only possible to select one
2337     subdictionary.  An error is reported if the subdictionary fails to
2338     recognize a word. In that case, you should remove the use of the word or
2339     teach the subdictionary about it.  You can place an asterisk
2340     (<symbol>*</symbol>) at the beginning of an indexed word to skip applying
2341     the subdictionary to it, but all sample words <emphasis>must</> be known
2342     to the subdictionary.
2343    </para>
2344
2345    <para>
2346     The thesaurus dictionary chooses the longest match if there are multiple
2347     phrases matching the input, and ties are broken by using the last
2348     definition.
2349    </para>
2350
2351    <para>
2352     Specific stop words recognized by the subdictionary cannot be
2353     specified;  instead use <literal>?</> to mark the location where any
2354     stop word can appear.  For example, assuming that <literal>a</> and
2355     <literal>the</> are stop words according to the subdictionary:
2356
2357 <programlisting>
2358 ? one ? two : swsw
2359 </programlisting>
2360
2361     matches <literal>a one the two</> and <literal>the one a two</>;
2362     both would be replaced by <literal>swsw</>.
2363    </para>
2364
2365    <para>
2366     Since a thesaurus dictionary has the capability to recognize phrases it
2367     must remember its state and interact with the parser. A thesaurus dictionary
2368     uses these assignments to check if it should handle the next word or stop
2369     accumulation.  The thesaurus dictionary must be configured
2370     carefully. For example, if the thesaurus dictionary is assigned to handle
2371     only the <literal>asciiword</literal> token, then a thesaurus dictionary
2372     definition like <literal>one 7</> will not work since token type
2373     <literal>uint</literal> is not assigned to the thesaurus dictionary.
2374    </para>
2375
2376    <caution>
2377     <para>
2378      Thesauruses are used during indexing so any change in the thesaurus
2379      dictionary's parameters <emphasis>requires</emphasis> reindexing.
2380      For most other dictionary types, small changes such as adding or
2381      removing stopwords does not force reindexing.
2382     </para>
2383    </caution>
2384
2385   <sect3 id="textsearch-thesaurus-config">
2386    <title>Thesaurus Configuration</title>
2387
2388    <para>
2389     To define a new thesaurus dictionary, use the <literal>thesaurus</>
2390     template.  For example:
2391
2392 <programlisting>
2393 CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
2394     TEMPLATE = thesaurus,
2395     DictFile = mythesaurus,
2396     Dictionary = pg_catalog.english_stem
2397 );
2398 </programlisting>
2399
2400     Here:
2401     <itemizedlist  spacing="compact" mark="bullet">
2402      <listitem>
2403       <para>
2404        <literal>thesaurus_simple</literal> is the new dictionary's name
2405       </para>
2406      </listitem>
2407      <listitem>
2408       <para>
2409        <literal>mythesaurus</literal> is the base name of the thesaurus
2410        configuration file.
2411        (Its full name will be <filename>$SHAREDIR/tsearch_data/mythesaurus.ths</>,
2412        where <literal>$SHAREDIR</> means the installation shared-data
2413        directory.)
2414       </para>
2415      </listitem>
2416      <listitem>
2417       <para>
2418        <literal>pg_catalog.english_stem</literal> is the subdictionary (here,
2419        a Snowball English stemmer) to use for thesaurus normalization.
2420        Notice that the subdictionary will have its own
2421        configuration (for example, stop words), which is not shown here.
2422       </para>
2423      </listitem>
2424     </itemizedlist>
2425
2426     Now it is possible to bind the thesaurus dictionary <literal>thesaurus_simple</literal>
2427     to the desired token types in a configuration, for example:
2428
2429 <programlisting>
2430 ALTER TEXT SEARCH CONFIGURATION russian
2431     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart WITH thesaurus_simple;
2432 </programlisting>
2433    </para>
2434
2435   </sect3>
2436
2437   <sect3 id="textsearch-thesaurus-examples">
2438    <title>Thesaurus Example</title>
2439
2440    <para>
2441     Consider a simple astronomical thesaurus <literal>thesaurus_astro</literal>,
2442     which contains some astronomical word combinations:
2443
2444 <programlisting>
2445 supernovae stars : sn
2446 crab nebulae : crab
2447 </programlisting>
2448
2449     Below we create a dictionary and bind some token types to
2450     an astronomical thesaurus and English stemmer:
2451
2452 <programlisting>
2453 CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
2454     TEMPLATE = thesaurus,
2455     DictFile = thesaurus_astro,
2456     Dictionary = english_stem
2457 );
2458
2459 ALTER TEXT SEARCH CONFIGURATION russian
2460     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart WITH thesaurus_astro, english_stem;
2461 </programlisting>
2462
2463     Now we can see how it works.
2464     <function>ts_lexize</function> is not very useful for testing a thesaurus,
2465     because it treats its input as a single token.  Instead we can use
2466     <function>plainto_tsquery</function> and <function>to_tsvector</function>
2467     which will break their input strings into multiple tokens:
2468
2469 <programlisting>
2470 SELECT plainto_tsquery('supernova star');
2471  plainto_tsquery
2472 -----------------
2473  'sn'
2474
2475 SELECT to_tsvector('supernova star');
2476  to_tsvector
2477 -------------
2478  'sn':1
2479 </programlisting>
2480
2481     In principle, one can use <function>to_tsquery</function> if you quote
2482     the argument:
2483
2484 <programlisting>
2485 SELECT to_tsquery('''supernova star''');
2486  to_tsquery
2487 ------------
2488  'sn'
2489 </programlisting>
2490
2491     Notice that <literal>supernova star</literal> matches <literal>supernovae
2492     stars</literal> in <literal>thesaurus_astro</literal> because we specified
2493     the <literal>english_stem</literal> stemmer in the thesaurus definition.
2494     The stemmer removed the <literal>e</> and <literal>s</>.
2495    </para>
2496
2497    <para>
2498     To index the original phrase as well as the substitute, just include it
2499     in the right-hand part of the definition:
2500
2501 <programlisting>
2502 supernovae stars : sn supernovae stars
2503
2504 SELECT plainto_tsquery('supernova star');
2505        plainto_tsquery
2506 -----------------------------
2507  'sn' &amp; 'supernova' &amp; 'star'
2508 </programlisting>
2509    </para>
2510
2511   </sect3>
2512
2513   </sect2>
2514
2515   <sect2 id="textsearch-ispell-dictionary">
2516    <title><application>Ispell</> Dictionary</title>
2517
2518    <para>
2519     The <application>Ispell</> dictionary template supports
2520     <firstterm>morphological dictionaries</>, which can normalize many
2521     different linguistic forms of a word into the same lexeme.  For example,
2522     an English <application>Ispell</> dictionary can match all declensions and
2523     conjugations of the search term <literal>bank</literal>, e.g.
2524     <literal>banking</>, <literal>banked</>, <literal>banks</>,
2525     <literal>banks'</>, and <literal>bank's</>.
2526    </para>
2527
2528    <para>
2529     The standard <productname>PostgreSQL</productname> distribution does
2530     not include any <application>Ispell</> configuration files.
2531     Dictionaries for a large number of languages are available from <ulink
2532     url="http://ficus-www.cs.ucla.edu/geoff/ispell.html">Ispell</ulink>.
2533     Also, some more modern dictionary file formats are supported &mdash; <ulink
2534     url="http://en.wikipedia.org/wiki/MySpell">MySpell</ulink> (OO &lt; 2.0.1)
2535     and <ulink url="http://sourceforge.net/projects/hunspell">Hunspell</ulink>
2536     (OO &gt;= 2.0.2).  A large list of dictionaries is available on the <ulink
2537     url="http://wiki.services.openoffice.org/wiki/Dictionaries">OpenOffice
2538     Wiki</ulink>.
2539    </para>
2540
2541    <para>
2542     To create an <application>Ispell</> dictionary, use the built-in
2543     <literal>ispell</literal> template and specify several parameters:
2544    </para>
2545
2546 <programlisting>
2547 CREATE TEXT SEARCH DICTIONARY english_ispell (
2548     TEMPLATE = ispell,
2549     DictFile = english,
2550     AffFile = english,
2551     StopWords = english
2552 );
2553 </programlisting>
2554
2555    <para>
2556     Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
2557     specify the base names of the dictionary, affixes, and stop-words files.
2558     The stop-words file has the same format explained above for the
2559     <literal>simple</> dictionary type.  The format of the other files is
2560     not specified here but is available from the above-mentioned web sites.
2561    </para>
2562
2563    <para>
2564     Ispell dictionaries usually recognize a limited set of words, so they
2565     should be followed by another broader dictionary; for
2566     example, a Snowball dictionary, which recognizes everything.
2567    </para>
2568
2569    <para>
2570     Ispell dictionaries support splitting compound words.
2571     This is a nice feature and
2572     <productname>PostgreSQL</productname> supports it.
2573     Notice that the affix file should specify a special flag using the
2574     <literal>compoundwords controlled</literal> statement that marks dictionary
2575     words that can participate in compound formation:
2576
2577 <programlisting>
2578 compoundwords  controlled z
2579 </programlisting>
2580
2581     Here are some examples for the Norwegian language:
2582
2583 <programlisting>
2584 SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
2585    {over,buljong,terning,pakk,mester,assistent}
2586 SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
2587    {sjokoladefabrikk,sjokolade,fabrikk}
2588 </programlisting>
2589    </para>
2590
2591    <note>
2592     <para>
2593      <application>MySpell</> does not support compound words.
2594      <application>Hunspell</> has sophisticated support for compound words. At
2595      present, <productname>PostgreSQL</productname> implements only the basic
2596      compound word operations of Hunspell.
2597     </para>
2598    </note>
2599
2600   </sect2>
2601
2602   <sect2 id="textsearch-snowball-dictionary">
2603    <title><application>Snowball</> Dictionary</title>
2604
2605    <para>
2606     The <application>Snowball</> dictionary template is based on the project
2607     of Martin Porter, inventor of the popular Porter's stemming algorithm
2608     for the English language.  Snowball now provides stemming algorithms for
2609     many languages (see the <ulink url="http://snowball.tartarus.org">Snowball
2610     site</ulink> for more information).  Each algorithm understands how to
2611     reduce common variant forms of words to a base, or stem, spelling within
2612     its language.  A Snowball dictionary requires a <literal>language</>
2613     parameter to identify which stemmer to use, and optionally can specify a
2614     <literal>stopword</> file name that gives a list of words to eliminate.
2615     (<productname>PostgreSQL</productname>'s standard stopword lists are also
2616     provided by the Snowball project.)
2617     For example, there is a built-in definition equivalent to
2618
2619 <programlisting>
2620 CREATE TEXT SEARCH DICTIONARY english_stem (
2621     TEMPLATE = snowball,
2622     Language = english,
2623     StopWords = english
2624 );
2625 </programlisting>
2626
2627     The stopword file format is the same as already explained.
2628    </para>
2629
2630    <para>
2631     A <application>Snowball</> dictionary recognizes everything, whether
2632     or not it is able to simplify the word, so it should be placed
2633     at the end of the dictionary list. It is useless to have it
2634     before any other dictionary because a token will never pass through it to
2635     the next dictionary.
2636    </para>
2637
2638   </sect2>
2639
2640  </sect1>
2641
2642  <sect1 id="textsearch-configuration">
2643   <title>Configuration Example</title>
2644
2645    <para>
2646     A text search configuration specifies all options necessary to transform a
2647     document into a <type>tsvector</type>: the parser to use to break text
2648     into tokens, and the dictionaries to use to transform each token into a
2649     lexeme.  Every call of
2650     <function>to_tsvector</function> or <function>to_tsquery</function>
2651     needs a text search configuration to perform its processing.
2652     The configuration parameter
2653     <xref linkend="guc-default-text-search-config">
2654     specifies the name of the default configuration, which is the
2655     one used by text search functions if an explicit configuration
2656     parameter is omitted.
2657     It can be set in <filename>postgresql.conf</filename>, or set for an
2658     individual session using the <command>SET</> command.
2659    </para>
2660
2661    <para>
2662     Several predefined text search configurations are available, and
2663     you can create custom configurations easily.  To facilitate management
2664     of text search objects, a set of <acronym>SQL</acronym> commands
2665     is available, and there are several <application>psql</application> commands that display information
2666     about text search objects (<xref linkend="textsearch-psql">).
2667    </para>
2668
2669    <para>
2670     As an example, we will create a configuration
2671     <literal>pg</literal>, starting from a duplicate of the built-in
2672     <literal>english</> configuration.
2673
2674 <programlisting>
2675 CREATE TEXT SEARCH CONFIGURATION public.pg ( COPY = pg_catalog.english );
2676 </programlisting>
2677    </para>
2678
2679    <para>
2680     We will use a PostgreSQL-specific synonym list
2681     and store it in <filename>$SHAREDIR/tsearch_data/pg_dict.syn</filename>.
2682     The file contents look like:
2683
2684 <programlisting>
2685 postgres    pg
2686 pgsql       pg
2687 postgresql  pg
2688 </programlisting>
2689
2690     We define the synonym dictionary like this:
2691
2692 <programlisting>
2693 CREATE TEXT SEARCH DICTIONARY pg_dict (
2694     TEMPLATE = synonym,
2695     SYNONYMS = pg_dict
2696 );
2697 </programlisting>
2698
2699     Next we register the <productname>Ispell</> dictionary
2700     <literal>english_ispell</literal>, which has its own configuration files:
2701
2702 <programlisting>
2703 CREATE TEXT SEARCH DICTIONARY english_ispell (
2704     TEMPLATE = ispell,
2705     DictFile = english,
2706     AffFile = english,
2707     StopWords = english
2708 );
2709 </programlisting>
2710
2711     Now we can set up the mappings for words in configuration
2712     <literal>pg</>:
2713
2714 <programlisting>
2715 ALTER TEXT SEARCH CONFIGURATION pg
2716     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
2717                       word, hword, hword_part
2718     WITH pg_dict, english_ispell, english_stem;
2719 </programlisting>
2720
2721     We choose not to index or search some token types that the built-in
2722     configuration does handle:
2723
2724 <programlisting>
2725 ALTER TEXT SEARCH CONFIGURATION pg
2726     DROP MAPPING FOR email, url, url_path, sfloat, float;
2727 </programlisting>
2728    </para>
2729
2730    <para>
2731     Now we can test our configuration:
2732
2733 <programlisting>
2734 SELECT * FROM ts_debug('public.pg', '
2735 PostgreSQL, the highly scalable, SQL compliant, open source object-relational
2736 database management system, is now undergoing beta testing of the next
2737 version of our software.
2738 ');
2739 </programlisting>
2740    </para>
2741
2742    <para>
2743     The next step is to set the session to use the new configuration, which was
2744     created in the <literal>public</> schema:
2745
2746 <programlisting>
2747 =&gt; \dF
2748    List of text search configurations
2749  Schema  | Name | Description
2750 ---------+------+-------------
2751  public  | pg   |
2752
2753 SET default_text_search_config = 'public.pg';
2754 SET
2755
2756 SHOW default_text_search_config;
2757  default_text_search_config
2758 ----------------------------
2759  public.pg
2760 </programlisting>
2761   </para>
2762
2763  </sect1>
2764
2765  <sect1 id="textsearch-debugging">
2766   <title>Testing and Debugging Text Search</title>
2767
2768   <para>
2769    The behavior of a custom text search configuration can easily become
2770    complicated enough to be confusing or undesirable.  The functions described
2771    in this section are useful for testing text search objects.  You can
2772    test a complete configuration, or test parsers and dictionaries separately.
2773   </para>
2774
2775   <sect2 id="textsearch-configuration-testing">
2776    <title>Configuration Testing</title>
2777
2778   <para>
2779    The function <function>ts_debug</function> allows easy testing of a
2780    text search configuration.
2781   </para>
2782
2783   <indexterm>
2784    <primary>ts_debug</primary>
2785   </indexterm>
2786
2787   <synopsis>
2788    ts_debug(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">document</replaceable> <type>text</>,
2789             OUT <replaceable class="PARAMETER">alias</> <type>text</>,
2790             OUT <replaceable class="PARAMETER">description</> <type>text</>,
2791             OUT <replaceable class="PARAMETER">token</> <type>text</>,
2792             OUT <replaceable class="PARAMETER">dictionaries</> <type>regdictionary[]</>,
2793             OUT <replaceable class="PARAMETER">dictionary</> <type>regdictionary</>,
2794             OUT <replaceable class="PARAMETER">lexemes</> <type>text[]</>)
2795             returns setof record
2796   </synopsis>
2797
2798   <para>
2799    <function>ts_debug</> displays information about every token of
2800    <replaceable class="PARAMETER">document</replaceable> as produced by the
2801    parser and processed by the configured dictionaries.  It uses the
2802    configuration specified by <replaceable
2803    class="PARAMETER">config</replaceable>,
2804    or <varname>default_text_search_config</varname> if that argument is
2805    omitted.
2806   </para>
2807
2808   <para>
2809    <function>ts_debug</> returns one row for each token identified in the text
2810    by the parser.  The columns returned are
2811
2812     <itemizedlist  spacing="compact" mark="bullet">
2813      <listitem>
2814       <para>
2815        <replaceable>alias</> <type>text</> &mdash; short name of the token type
2816       </para>
2817      </listitem>
2818      <listitem>
2819       <para>
2820        <replaceable>description</> <type>text</> &mdash; description of the
2821        token type
2822       </para>
2823      </listitem>
2824      <listitem>
2825       <para>
2826        <replaceable>token</> <type>text</> &mdash; text of the token
2827       </para>
2828      </listitem>
2829      <listitem>
2830       <para>
2831        <replaceable>dictionaries</> <type>regdictionary[]</> &mdash; the
2832        dictionaries selected by the configuration for this token type
2833       </para>
2834      </listitem>
2835      <listitem>
2836       <para>
2837        <replaceable>dictionary</> <type>regdictionary</> &mdash; the dictionary
2838        that recognized the token, or <literal>NULL</> if none did
2839       </para>
2840      </listitem>
2841      <listitem>
2842       <para>
2843        <replaceable>lexemes</> <type>text[]</> &mdash; the lexeme(s) produced
2844        by the dictionary that recognized the token, or <literal>NULL</> if
2845        none did; an empty array (<literal>{}</>) means it was recognized as a
2846        stop word
2847       </para>
2848      </listitem>
2849     </itemizedlist>
2850   </para>
2851
2852   <para>
2853    Here is a simple example:
2854
2855 <programlisting>
2856 SELECT * FROM ts_debug('english','a fat  cat sat on a mat - it ate a fat rats');
2857    alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
2858 -----------+-----------------+-------+----------------+--------------+---------
2859  asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {}
2860  blank     | Space symbols   |       | {}             |              |
2861  asciiword | Word, all ASCII | fat   | {english_stem} | english_stem | {fat}
2862  blank     | Space symbols   |       | {}             |              |
2863  asciiword | Word, all ASCII | cat   | {english_stem} | english_stem | {cat}
2864  blank     | Space symbols   |       | {}             |              |
2865  asciiword | Word, all ASCII | sat   | {english_stem} | english_stem | {sat}
2866  blank     | Space symbols   |       | {}             |              |
2867  asciiword | Word, all ASCII | on    | {english_stem} | english_stem | {}
2868  blank     | Space symbols   |       | {}             |              |
2869  asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {}
2870  blank     | Space symbols   |       | {}             |              |
2871  asciiword | Word, all ASCII | mat   | {english_stem} | english_stem | {mat}
2872  blank     | Space symbols   |       | {}             |              |
2873  blank     | Space symbols   | -     | {}             |              |
2874  asciiword | Word, all ASCII | it    | {english_stem} | english_stem | {}
2875  blank     | Space symbols   |       | {}             |              |
2876  asciiword | Word, all ASCII | ate   | {english_stem} | english_stem | {ate}
2877  blank     | Space symbols   |       | {}             |              |
2878  asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {}
2879  blank     | Space symbols   |       | {}             |              |
2880  asciiword | Word, all ASCII | fat   | {english_stem} | english_stem | {fat}
2881  blank     | Space symbols   |       | {}             |              |
2882  asciiword | Word, all ASCII | rats  | {english_stem} | english_stem | {rat}
2883 </programlisting>
2884   </para>
2885
2886   <para>
2887    For a more extensive demonstration, we
2888    first create a <literal>public.english</literal> configuration and
2889    Ispell dictionary for the English language:
2890   </para>
2891
2892 <programlisting>
2893 CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english );
2894
2895 CREATE TEXT SEARCH DICTIONARY english_ispell (
2896     TEMPLATE = ispell,
2897     DictFile = english,
2898     AffFile = english,
2899     StopWords = english
2900 );
2901
2902 ALTER TEXT SEARCH CONFIGURATION public.english
2903    ALTER MAPPING FOR asciiword WITH english_ispell, english_stem;
2904 </programlisting>
2905
2906 <programlisting>
2907 SELECT * FROM ts_debug('public.english','The Brightest supernovaes');
2908    alias   |   description   |    token    |         dictionaries          |   dictionary   |   lexemes
2909 -----------+-----------------+-------------+-------------------------------+----------------+-------------
2910  asciiword | Word, all ASCII | The         | {english_ispell,english_stem} | english_ispell | {}
2911  blank     | Space symbols   |             | {}                            |                |
2912  asciiword | Word, all ASCII | Brightest   | {english_ispell,english_stem} | english_ispell | {bright}
2913  blank     | Space symbols   |             | {}                            |                |
2914  asciiword | Word, all ASCII | supernovaes | {english_ispell,english_stem} | english_stem   | {supernova}
2915 </programlisting>
2916
2917   <para>
2918    In this example, the word <literal>Brightest</> was recognized by the
2919    parser as an <literal>ASCII word</literal> (alias <literal>asciiword</literal>).
2920    For this token type the dictionary list is
2921    <literal>english_ispell</> and
2922    <literal>english_stem</literal>. The word was recognized by
2923    <literal>english_ispell</literal>, which reduced it to the noun
2924    <literal>bright</literal>. The word <literal>supernovaes</literal> is
2925    unknown to the <literal>english_ispell</literal> dictionary so it
2926    was passed to the next dictionary, and, fortunately, was recognized (in
2927    fact, <literal>english_stem</literal> is a Snowball dictionary which
2928    recognizes everything; that is why it was placed at the end of the
2929    dictionary list).
2930   </para>
2931
2932   <para>
2933    The word <literal>The</literal> was recognized by the
2934    <literal>english_ispell</literal> dictionary as a stop word (<xref
2935    linkend="textsearch-stopwords">) and will not be indexed.
2936    The spaces are discarded too, since the configuration provides no
2937    dictionaries at all for them.
2938   </para>
2939
2940   <para>
2941    You can reduce the volume of output by explicitly specifying which columns
2942    you want to see:
2943
2944 <programlisting>
2945 SELECT alias, token, dictionary, lexemes
2946 FROM ts_debug('public.english','The Brightest supernovaes');
2947    alias   |    token    |   dictionary   |   lexemes
2948 -----------+-------------+----------------+-------------
2949  asciiword | The         | english_ispell | {}
2950  blank     |             |                |
2951  asciiword | Brightest   | english_ispell | {bright}
2952  blank     |             |                |
2953  asciiword | supernovaes | english_stem   | {supernova}
2954 </programlisting>
2955   </para>
2956
2957   </sect2>
2958
2959   <sect2 id="textsearch-parser-testing">
2960    <title>Parser Testing</title>
2961
2962   <para>
2963    The following functions allow direct testing of a text search parser.
2964   </para>
2965
2966   <indexterm>
2967    <primary>ts_parse</primary>
2968   </indexterm>
2969
2970   <synopsis>
2971    ts_parse(<replaceable class="PARAMETER">parser_name</replaceable> <type>text</>, <replaceable class="PARAMETER">document</replaceable> <type>text</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">token</> <type>text</>) returns <type>setof record</>
2972    ts_parse(<replaceable class="PARAMETER">parser_oid</replaceable> <type>oid</>, <replaceable class="PARAMETER">document</replaceable> <type>text</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">token</> <type>text</>) returns <type>setof record</>
2973   </synopsis>
2974
2975   <para>
2976    <function>ts_parse</> parses the given <replaceable>document</replaceable>
2977    and returns a series of records, one for each token produced by
2978    parsing. Each record includes a <varname>tokid</varname> showing the
2979    assigned token type and a <varname>token</varname> which is the text of the
2980    token.  For example:
2981
2982 <programlisting>
2983 SELECT * FROM ts_parse('default', '123 - a number');
2984  tokid | token
2985 -------+--------
2986     22 | 123
2987     12 |
2988     12 | -
2989      1 | a
2990     12 |
2991      1 | number
2992 </programlisting>
2993   </para>
2994
2995   <indexterm>
2996    <primary>ts_token_type</primary>
2997   </indexterm>
2998
2999   <synopsis>
3000    ts_token_type(<replaceable class="PARAMETER">parser_name</> <type>text</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">alias</> <type>text</>, OUT <replaceable class="PARAMETER">description</> <type>text</>) returns <type>setof record</>
3001    ts_token_type(<replaceable class="PARAMETER">parser_oid</> <type>oid</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">alias</> <type>text</>, OUT <replaceable class="PARAMETER">description</> <type>text</>) returns <type>setof record</>
3002   </synopsis>
3003
3004   <para>
3005    <function>ts_token_type</> returns a table which describes each type of
3006    token the specified parser can recognize.  For each token type, the table
3007    gives the integer <varname>tokid</varname> that the parser uses to label a
3008    token of that type, the <varname>alias</varname> that names the token type
3009    in configuration commands, and a short <varname>description</varname>.  For
3010    example:
3011
3012 <programlisting>
3013 SELECT * FROM ts_token_type('default');
3014  tokid |      alias      |               description
3015 -------+-----------------+------------------------------------------
3016      1 | asciiword       | Word, all ASCII
3017      2 | word            | Word, all letters
3018      3 | numword         | Word, letters and digits
3019      4 | email           | Email address
3020      5 | url             | URL
3021      6 | host            | Host
3022      7 | sfloat          | Scientific notation
3023      8 | version         | Version number
3024      9 | hword_numpart   | Hyphenated word part, letters and digits
3025     10 | hword_part      | Hyphenated word part, all letters
3026     11 | hword_asciipart | Hyphenated word part, all ASCII
3027     12 | blank           | Space symbols
3028     13 | tag             | XML tag
3029     14 | protocol        | Protocol head
3030     15 | numhword        | Hyphenated word, letters and digits
3031     16 | asciihword      | Hyphenated word, all ASCII
3032     17 | hword           | Hyphenated word, all letters
3033     18 | url_path        | URL path
3034     19 | file            | File or path name
3035     20 | float           | Decimal notation
3036     21 | int             | Signed integer
3037     22 | uint            | Unsigned integer
3038     23 | entity          | XML entity
3039 </programlisting>
3040    </para>
3041
3042   </sect2>
3043
3044   <sect2 id="textsearch-dictionary-testing">
3045    <title>Dictionary Testing</title>
3046
3047    <para>
3048     The <function>ts_lexize</> function facilitates dictionary testing.
3049    </para>
3050
3051    <indexterm>
3052     <primary>ts_lexize</primary>
3053    </indexterm>
3054
3055    <synopsis>
3056     ts_lexize(<replaceable class="PARAMETER">dict</replaceable> <type>regdictionary</>, <replaceable class="PARAMETER">token</replaceable> <type>text</>) returns <type>text[]</>
3057    </synopsis>
3058
3059    <para>
3060     <function>ts_lexize</> returns an array of lexemes if the input
3061     <replaceable>token</replaceable> is known to the dictionary,
3062     or an empty array if the token
3063     is known to the dictionary but it is a stop word, or
3064     <literal>NULL</literal> if it is an unknown word.
3065    </para>
3066
3067    <para>
3068     Examples:
3069
3070 <programlisting>
3071 SELECT ts_lexize('english_stem', 'stars');
3072  ts_lexize
3073 -----------
3074  {star}
3075
3076 SELECT ts_lexize('english_stem', 'a');
3077  ts_lexize
3078 -----------
3079  {}
3080 </programlisting>
3081    </para>
3082
3083    <note>
3084     <para>
3085      The <function>ts_lexize</function> function expects a single
3086      <emphasis>token</emphasis>, not text. Here is a case
3087      where this can be confusing:
3088
3089 <programlisting>
3090 SELECT ts_lexize('thesaurus_astro','supernovae stars') is null;
3091  ?column?
3092 ----------
3093  t
3094 </programlisting>
3095
3096      The thesaurus dictionary <literal>thesaurus_astro</literal> does know the
3097      phrase <literal>supernovae stars</literal>, but <function>ts_lexize</>
3098      fails since it does not parse the input text but treats it as a single
3099      token. Use <function>plainto_tsquery</> or <function>to_tsvector</> to
3100      test thesaurus dictionaries, for example:
3101
3102 <programlisting>
3103 SELECT plainto_tsquery('supernovae stars');
3104  plainto_tsquery
3105 -----------------
3106  'sn'
3107 </programlisting>
3108     </para>
3109    </note>
3110
3111   </sect2>
3112
3113  </sect1>
3114
3115  <sect1 id="textsearch-indexes">
3116   <title>GiST and GIN Index Types</title>
3117
3118   <indexterm zone="textsearch-indexes">
3119    <primary>text search</primary>
3120    <secondary>indexes</secondary>
3121   </indexterm>
3122
3123   <para>
3124    There are two kinds of indexes that can be used to speed up full text
3125    searches.
3126    Note that indexes are not mandatory for full text searching, but in
3127    cases where a column is searched on a regular basis, an index will
3128    usually be desirable.
3129
3130    <variablelist>
3131
3132     <varlistentry>
3133
3134      <indexterm zone="textsearch-indexes">
3135       <primary>index</primary>
3136       <secondary>GiST</secondary>
3137       <tertiary>text search</tertiary>
3138      </indexterm>
3139
3140      <term>
3141       <synopsis>
3142        CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING gist(<replaceable>column</replaceable>);
3143       </synopsis>
3144      </term>
3145
3146      <listitem>
3147       <para>
3148        Creates a GiST (Generalized Search Tree)-based index.
3149        The <replaceable>column</replaceable> can be of <type>tsvector</> or
3150        <type>tsquery</> type.
3151       </para>
3152      </listitem>
3153     </varlistentry>
3154
3155     <varlistentry>
3156
3157      <indexterm zone="textsearch-indexes">
3158       <primary>index</primary>
3159       <secondary>GIN</secondary>
3160       <tertiary>text search</tertiary>
3161      </indexterm>
3162
3163      <term>
3164       <synopsis>
3165        CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING gin(<replaceable>column</replaceable>);
3166       </synopsis>
3167      </term>
3168
3169      <listitem>
3170       <para>
3171        Creates a GIN (Generalized Inverted Index)-based index.
3172        The <replaceable>column</replaceable> must be of <type>tsvector</> type.
3173       </para>
3174      </listitem>
3175     </varlistentry>
3176
3177    </variablelist>
3178   </para>
3179
3180   <para>
3181    There are substantial performance differences between the two index types,
3182    so it is important to understand which to use.
3183   </para>
3184
3185   <para>
3186    A GiST index is <firstterm>lossy</firstterm>, meaning that the index
3187    may produce false matches, and it is necessary
3188    to check the actual table row to eliminate such false matches.
3189    (<productname>PostgreSQL</productname> does this automatically when needed.)
3190    GiST indexes are lossy because each document is represented in the
3191    index by a fixed-length signature. The signature is generated by hashing
3192    each word into a random bit in an n-bit string, with all these bits OR-ed
3193    together to produce an n-bit document signature.  When two words hash to
3194    the same bit position there will be a false match.  If all words in
3195    the query have matches (real or false) then the table row must be
3196    retrieved to see if the match is correct.
3197   </para>
3198
3199   <para>
3200    Lossiness causes performance degradation due to useless fetches of table
3201    records that turn out to be false matches.  Since random access to table
3202    records is slow, this limits the usefulness of GiST indexes.  The
3203    likelihood of false matches depends on several factors, in particular the
3204    number of unique words, so using dictionaries to reduce this number is
3205    recommended.
3206   </para>
3207
3208   <para>
3209    GIN indexes are not lossy for standard queries, but their performance
3210    depends logarithmically on the number of unique words.
3211    (However, GIN indexes store only the words (lexemes) of <type>tsvector</>
3212    values, and not their weight labels.  Thus a table row recheck is needed
3213    when using a query that involves weights.)
3214   </para>
3215
3216   <para>
3217    In choosing which index type to use, GiST or GIN, consider these
3218    performance differences:
3219
3220    <itemizedlist  spacing="compact" mark="bullet">
3221     <listitem>
3222      <para>
3223       GIN index lookups are about three times faster than GiST
3224      </para>
3225     </listitem>
3226     <listitem>
3227      <para>
3228       GIN indexes take about three times longer to build than GiST
3229      </para>
3230     </listitem>
3231     <listitem>
3232      <para>
3233       GIN indexes are moderately slower to update than GiST indexes, but
3234       about 10 times slower if fast-update support was disabled
3235       (see <xref linkend="gin-fast-update"> for details)
3236      </para>
3237     </listitem>
3238     <listitem>
3239      <para>
3240       GIN indexes are two-to-three times larger than GiST indexes
3241      </para>
3242     </listitem>
3243    </itemizedlist>
3244   </para>
3245
3246   <para>
3247    As a rule of thumb, <acronym>GIN</acronym> indexes are best for static data
3248    because lookups are faster.  For dynamic data, GiST indexes are
3249    faster to update.  Specifically, <acronym>GiST</acronym> indexes are very
3250    good for dynamic data and fast if the number of unique words (lexemes) is
3251    under 100,000, while <acronym>GIN</acronym> indexes will handle 100,000+
3252    lexemes better but are slower to update.
3253   </para>
3254
3255   <para>
3256    Note that <acronym>GIN</acronym> index build time can often be improved
3257    by increasing <xref linkend="guc-maintenance-work-mem">, while
3258    <acronym>GiST</acronym> index build time is not sensitive to that
3259    parameter.
3260   </para>
3261
3262   <para>
3263    Partitioning of big collections and the proper use of GiST and GIN indexes
3264    allows the implementation of very fast searches with online update.
3265    Partitioning can be done at the database level using table inheritance,
3266    or by distributing documents over
3267    servers and collecting search results using the <filename>contrib/dblink</>
3268    extension module. The latter is possible because ranking functions use
3269    only local information.
3270   </para>
3271
3272  </sect1>
3273
3274  <sect1 id="textsearch-psql">
3275   <title><application>psql</> Support</title>
3276
3277   <para>
3278    Information about text search configuration objects can be obtained
3279    in <application>psql</application> using a set of commands:
3280    <synopsis>
3281    \dF{d,p,t}<optional>+</optional> <optional>PATTERN</optional>
3282    </synopsis>
3283    An optional <literal>+</literal> produces more details.
3284   </para>
3285
3286   <para>
3287    The optional parameter <literal>PATTERN</literal> should be the name of
3288    a text search object, optionally schema-qualified.  If
3289    <literal>PATTERN</literal> is omitted then information about all
3290    visible objects will be displayed.  <literal>PATTERN</literal> can be a
3291    regular expression and can provide <emphasis>separate</emphasis> patterns
3292    for the schema and object names.  The following examples illustrate this:
3293
3294 <programlisting>
3295 =&gt; \dF *fulltext*
3296        List of text search configurations
3297  Schema |  Name        | Description
3298 --------+--------------+-------------
3299  public | fulltext_cfg |
3300 </programlisting>
3301
3302 <programlisting>
3303 =&gt; \dF *.fulltext*
3304        List of text search configurations
3305  Schema   |  Name        | Description
3306 ----------+----------------------------
3307  fulltext | fulltext_cfg |
3308  public   | fulltext_cfg |
3309 </programlisting>
3310
3311    The available commands are:
3312   </para>
3313
3314   <variablelist>
3315
3316    <varlistentry>
3317     <term><synopsis>\dF<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
3318
3319     <listitem>
3320      <para>
3321       List text search configurations (add <literal>+</> for more detail).
3322      </para>
3323
3324      <para>
3325
3326 <programlisting>
3327 =&gt; \dF russian
3328             List of text search configurations
3329    Schema   |  Name   |            Description
3330 ------------+---------+------------------------------------
3331  pg_catalog | russian | configuration for russian language
3332
3333 =&gt; \dF+ russian
3334 Text search configuration "pg_catalog.russian"
3335 Parser: "pg_catalog.default"
3336       Token      | Dictionaries
3337 -----------------+--------------
3338  asciihword      | english_stem
3339  asciiword       | english_stem
3340  email           | simple
3341  file            | simple
3342  float           | simple
3343  host            | simple
3344  hword           | russian_stem
3345  hword_asciipart | english_stem
3346  hword_numpart   | simple
3347  hword_part      | russian_stem
3348  int             | simple
3349  numhword        | simple
3350  numword         | simple
3351  sfloat          | simple
3352  uint            | simple
3353  url             | simple
3354  url_path        | simple
3355  version         | simple
3356  word            | russian_stem
3357 </programlisting>
3358      </para>
3359     </listitem>
3360    </varlistentry>
3361
3362    <varlistentry>
3363     <term><synopsis>\dFd<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
3364     <listitem>
3365      <para>
3366       List text search dictionaries (add <literal>+</> for more detail).
3367      </para>
3368
3369      <para>
3370 <programlisting>
3371 =&gt; \dFd
3372                             List of text search dictionaries
3373    Schema   |      Name       |                        Description
3374 ------------+-----------------+-----------------------------------------------------------
3375  pg_catalog | danish_stem     | snowball stemmer for danish language
3376  pg_catalog | dutch_stem      | snowball stemmer for dutch language
3377  pg_catalog | english_stem    | snowball stemmer for english language
3378  pg_catalog | finnish_stem    | snowball stemmer for finnish language
3379  pg_catalog | french_stem     | snowball stemmer for french language
3380  pg_catalog | german_stem     | snowball stemmer for german language
3381  pg_catalog | hungarian_stem  | snowball stemmer for hungarian language
3382  pg_catalog | italian_stem    | snowball stemmer for italian language
3383  pg_catalog | norwegian_stem  | snowball stemmer for norwegian language
3384  pg_catalog | portuguese_stem | snowball stemmer for portuguese language
3385  pg_catalog | romanian_stem   | snowball stemmer for romanian language
3386  pg_catalog | russian_stem    | snowball stemmer for russian language
3387  pg_catalog | simple          | simple dictionary: just lower case and check for stopword
3388  pg_catalog | spanish_stem    | snowball stemmer for spanish language
3389  pg_catalog | swedish_stem    | snowball stemmer for swedish language
3390  pg_catalog | turkish_stem    | snowball stemmer for turkish language
3391 </programlisting>
3392      </para>
3393     </listitem>
3394    </varlistentry>
3395
3396    <varlistentry>
3397
3398    <term><synopsis>\dFp<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
3399     <listitem>
3400      <para>
3401       List text search parsers (add <literal>+</> for more detail).
3402      </para>
3403
3404      <para>
3405 <programlisting>
3406 =&gt; \dFp
3407         List of text search parsers
3408    Schema   |  Name   |     Description
3409 ------------+---------+---------------------
3410  pg_catalog | default | default word parser
3411 =&gt; \dFp+
3412     Text search parser "pg_catalog.default"
3413      Method      |    Function    | Description
3414 -----------------+----------------+-------------
3415  Start parse     | prsd_start     |
3416  Get next token  | prsd_nexttoken |
3417  End parse       | prsd_end       |
3418  Get headline    | prsd_headline  |
3419  Get token types | prsd_lextype   |
3420
3421         Token types for parser "pg_catalog.default"
3422    Token name    |               Description
3423 -----------------+------------------------------------------
3424  asciihword      | Hyphenated word, all ASCII
3425  asciiword       | Word, all ASCII
3426  blank           | Space symbols
3427  email           | Email address
3428  entity          | XML entity
3429  file            | File or path name
3430  float           | Decimal notation
3431  host            | Host
3432  hword           | Hyphenated word, all letters
3433  hword_asciipart | Hyphenated word part, all ASCII
3434  hword_numpart   | Hyphenated word part, letters and digits
3435  hword_part      | Hyphenated word part, all letters
3436  int             | Signed integer
3437  numhword        | Hyphenated word, letters and digits
3438  numword         | Word, letters and digits
3439  protocol        | Protocol head
3440  sfloat          | Scientific notation
3441  tag             | XML tag
3442  uint            | Unsigned integer
3443  url             | URL
3444  url_path        | URL path
3445  version         | Version number
3446  word            | Word, all letters
3447 (23 rows)
3448 </programlisting>
3449      </para>
3450     </listitem>
3451    </varlistentry>
3452
3453    <varlistentry>
3454
3455    <term><synopsis>\dFt<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
3456     <listitem>
3457      <para>
3458       List text search templates (add <literal>+</> for more detail).
3459      </para>
3460
3461      <para>
3462 <programlisting>
3463 =&gt; \dFt
3464                            List of text search templates
3465    Schema   |   Name    |                        Description
3466 ------------+-----------+-----------------------------------------------------------
3467  pg_catalog | ispell    | ispell dictionary
3468  pg_catalog | simple    | simple dictionary: just lower case and check for stopword
3469  pg_catalog | snowball  | snowball stemmer
3470  pg_catalog | synonym   | synonym dictionary: replace word by its synonym
3471  pg_catalog | thesaurus | thesaurus dictionary: phrase by phrase substitution
3472 </programlisting>
3473      </para>
3474     </listitem>
3475    </varlistentry>
3476
3477   </variablelist>
3478
3479  </sect1>
3480
3481  <sect1 id="textsearch-limitations">
3482   <title>Limitations</title>
3483
3484   <para>
3485    The current limitations of <productname>PostgreSQL</productname>'s
3486    text search features are:
3487    <itemizedlist  spacing="compact" mark="bullet">
3488     <listitem>
3489      <para>The length of each lexeme must be less than 2K bytes</para>
3490     </listitem>
3491     <listitem>
3492      <para>The length of a <type>tsvector</type> (lexemes + positions) must be
3493      less than 1 megabyte</para>
3494     </listitem>
3495     <listitem>
3496      <!-- TODO: number of lexemes in what?  This is unclear -->
3497      <para>The number of lexemes must be less than
3498      2<superscript>64</superscript></para>
3499     </listitem>
3500     <listitem>
3501      <para>Position values in <type>tsvector</> must be greater than 0 and
3502      no more than 16,383</para>
3503     </listitem>
3504     <listitem>
3505      <para>No more than 256 positions per lexeme</para>
3506     </listitem>
3507     <listitem>
3508      <para>The number of nodes (lexemes + operators) in a <type>tsquery</type>
3509      must be less than 32,768</para>
3510     </listitem>
3511    </itemizedlist>
3512   </para>
3513
3514   <para>
3515    For comparison, the <productname>PostgreSQL</productname> 8.1 documentation
3516    contained 10,441 unique words, a total of 335,420 words, and the most
3517    frequent word <quote>postgresql</> was mentioned 6,127 times in 655
3518    documents.
3519   </para>
3520
3521    <!-- TODO we need to put a date on these numbers? -->
3522   <para>
3523    Another example &mdash; the <productname>PostgreSQL</productname> mailing
3524    list archives contained 910,989 unique words with 57,491,343 lexemes in
3525    461,020 messages.
3526   </para>
3527
3528  </sect1>
3529
3530  <sect1 id="textsearch-migration">
3531   <title>Migration from Pre-8.3 Text Search</title>
3532
3533   <para>
3534    Applications that used the <filename>contrib/tsearch2</> add-on module
3535    for text searching will need some adjustments to work with the
3536    built-in features:
3537   </para>
3538
3539   <itemizedlist>
3540    <listitem>
3541     <para>
3542      Some functions have been renamed or had small adjustments in their
3543      argument lists, and all of them are now in the <literal>pg_catalog</>
3544      schema, whereas in a previous installation they would have been in
3545      <literal>public</> or another non-system schema.  There is a new
3546      version of <filename>contrib/tsearch2</> (see <xref linkend="tsearch2">)
3547      that provides a compatibility layer to solve most problems in this
3548      area.
3549     </para>
3550    </listitem>
3551
3552    <listitem>
3553     <para>
3554      The old <filename>contrib/tsearch2</> functions and other objects
3555      <emphasis>must</> be suppressed when loading <application>pg_dump</>
3556      output from a pre-8.3 database.  While many of them won't load anyway,
3557      a few will and then cause problems.  One simple way to deal with this
3558      is to load the new <filename>contrib/tsearch2</> module before restoring
3559      the dump; then it will block the old objects from being loaded.
3560     </para>
3561    </listitem>
3562
3563    <listitem>
3564     <para>
3565      Text search configuration setup is completely different now.
3566      Instead of manually inserting rows into configuration tables,
3567      search is configured through the specialized SQL commands shown
3568      earlier in this chapter.  There is not currently any automated
3569      support for converting an existing custom configuration for 8.3;
3570      you're on your own here.
3571     </para>
3572    </listitem>
3573
3574    <listitem>
3575     <para>
3576      Most types of dictionaries rely on some outside-the-database
3577      configuration files.  These are largely compatible with pre-8.3
3578      usage, but note the following differences:
3579
3580      <itemizedlist  spacing="compact" mark="bullet">
3581       <listitem>
3582        <para>
3583         Configuration files now must be placed in a single specified
3584         directory (<filename>$SHAREDIR/tsearch_data</>), and must have
3585         a specific extension depending on the type of file, as noted
3586         previously in the descriptions of the various dictionary types.
3587         This restriction was added to forestall security problems.
3588        </para>
3589       </listitem>
3590
3591       <listitem>
3592        <para>
3593         Configuration files must be encoded in UTF-8 encoding,
3594         regardless of what database encoding is used.
3595        </para>
3596       </listitem>
3597
3598       <listitem>
3599        <para>
3600         In thesaurus configuration files, stop words must be marked with
3601         <literal>?</>.
3602        </para>
3603       </listitem>
3604      </itemizedlist>
3605     </para>
3606    </listitem>
3607
3608   </itemizedlist>
3609
3610  </sect1>
3611
3612 </chapter>