Add description of new features

author Teodor Sigaev <teodor@sigaev.ru>

Tue, 31 Oct 2006 16:23:05 +0000 (16:23 +0000)

committer Teodor Sigaev <teodor@sigaev.ru>

Tue, 31 Oct 2006 16:23:05 +0000 (16:23 +0000)
author Teodor Sigaev <teodor@sigaev.ru>
Tue, 31 Oct 2006 16:23:05 +0000 (16:23 +0000)
committer Teodor Sigaev <teodor@sigaev.ru>
Tue, 31 Oct 2006 16:23:05 +0000 (16:23 +0000)
diff --git a/contrib/tsearch2/docs/tsearch-V2-intro.html b/contrib/tsearch2/docs/tsearch-V2-intro.html

index b9cb80574e3cb31928656721f48311929651c1cf..8b2514e5bec0ecb3996b8deba508ec27a8d371ab 100644 (file)
--- a/contrib/tsearch2/docs/tsearch-V2-intro.html
+++ b/contrib/tsearch2/docs/tsearch-V2-intro.html
@@ -427,9 +427,9 @@ concatenation also works with NULL fields.</strong></p>
  <p>We need to create the index on the column idxFTI. Keep in mind
  that the database will update the index when some action is taken.
  In this case we _need_ the index (The whole point of Full Text
-INDEXINGi ;-)), so don't worry about any indexing overhead. We will
-create an index based on the gist function. GiST is an index
-structure for Generalized Search Tree.</p>
+INDEXING ;-)), so don't worry about any indexing overhead. We will
+create an index based on the gist or gin function. GiST is an index
+structure for Generalized Search Tree, GIN is a inverted index (see <a href="tsearch2-ref.html#indexes">The tsearch2 Reference: Indexes</a>).</p>
  <pre>
          CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);
          VACUUM FULL ANALYZE;
diff --git a/contrib/tsearch2/docs/tsearch2-guide.html b/contrib/tsearch2/docs/tsearch2-guide.html

index 5540e5d323c6632f69369282814ba8e042769caf..d2d764580c7b554c5f0f5ebc124df1903cf5a763 100644 (file)
--- a/contrib/tsearch2/docs/tsearch2-guide.html
+++ b/contrib/tsearch2/docs/tsearch2-guide.html
@@ -1,7 +1,6 @@
  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
  <html>
  <head>
-<link type="text/css" rel="stylesheet" href="/~megera/postgres/gist/tsearch/tsearch.css">
  <title>tsearch2 guide</title>
  </head>
  <body>
@@ -9,16 +8,13 @@
  
  <p align=center>
  Brandon Craig Rhodes<br>30 June 2003
+<br>Updated to 8.2 release by Oleg Bartunov, October 2006</br>
  <p>
  This Guide introduces the reader to the PostgreSQL tsearch2 module,
  version&nbsp;2.
  More formal descriptions of the module's types and functions
  are provided in the <a href="tsearch2-ref.html">tsearch2 Reference</a>,
  which is a companion to this document.
-You can retrieve a beta copy of the tsearch2 module from the
-<a href="http://www.sai.msu.su/~megera/postgres/gist/">GiST for PostgreSQL</a>
-page &mdash; look under the section entitled <i>Development History</i>
-for the current version.
  <p>
  First we will examine the <tt>tsvector</tt> and <tt>tsquery</tt> types
  and how they are used to search documents;
@@ -32,15 +28,40 @@ you should be able to run the examples here exactly as they are typed.
  <hr>
  <h2>Table of Contents</h2>
  <blockquote>
+<a href="#intro">Introduction to FTS with tsearch2</a><br>
  <a href="#vectors_queries">Vectors and Queries</a><br>
  <a href="#simple_search">A Simple Search Engine</a><br>
  <a href="#weights">Ranking and Position Weights</a><br>
  <a href="#casting">Casting Vectors and Queries</a><br>
  <a href="#parsing_lexing">Parsing and Lexing</a><br>
+<a href="#ref">Additional information</a>
  </blockquote>
  
  <hr>
  
+
+<h2><a name="intro">Introduction to FTS with tsearch2</a></h2>
+The purpose of FTS is to
+find <b>documents</b>, which satisfy <b>query</b> and optionally return 
+them in some <b>order</b>. 
+Most common case: Find documents containing all query terms and return them in order 
+of their similarity to the query. Document in database can be 
+any text attribute, or combination of text attributes from one or many tables
+(using joins).
+Text search operators existed for years, in PostgreSQL they are
+<tt><b>~,~*, LIKE, ILIKE</b></tt>, but they lack linguistic support,
+tends to be slow and have no relevance ranking. The idea behind tsearch2 is 
+is rather simple - preprocess document at index time to save time at search stage.
+Preprocessing includes
+<ul>
+<li>document parsing onto words
+<li>linguistic - normalize words to obtain lexemes
+<li>store document in optimized for searching way
+</ul>
+Tsearch2, in a nutshell, provides FTS operator (contains) for two new data types, 
+which represent document and query - <tt>tsquery  @@ tsvector</tt>.
+
+<P>
  <h2><a name=vectors_queries>Vectors and Queries</a></h2>
  
  <blockquote>
@@ -79,6 +100,8 @@ Preparing your document index involves two steps:
   on the <tt>tsvector</tt> column of a table,
   which implements a form of the Berkeley
   <a href="http://gist.cs.berkeley.edu/"><i>Generalized Search Tree</i></a>.
+ Since PostgreSQL 8.2 tsearch2 supports <a href="http://www.sigaev.ru/gin/">Gin</a> index,
+ which is an inverted index, commonly used in search engines. It adds scalability to tsearch2.
  </ul>
  Once your documents are indexed,
  performing a search involves:
@@ -251,7 +274,7 @@ and give you an error to prevent this mistake:
  
  <pre>
  =# <b>SELECT to_tsquery('the')</b>
-NOTICE:  Query contains only stopword(s) or doesn't contain lexeme(s), ignored
+NOTICE:  Query contains only stopword(s) or doesn't contain lexem(s), ignored
   to_tsquery 
  ------------
   
@@ -483,8 +506,8 @@ The <tt>rank()</tt> function existed in older versions of OpenFTS,
  and has the feature that you can assign different weights
  to words from different sections of your document.
  The <tt>rank_cd()</tt> uses a recent technique for weighting results
-but does not allow different weight to be given
-to different sections of your document.
+and also allows  different weight to be given
+to different sections of your document (since 8.2).
  <p>
  Both ranking functions allow you to specify,
  as an optional last argument,
@@ -511,9 +534,6 @@ for details
  see the <a href="tsearch2-ref.html#ranking">section on ranking</a>
  in the Reference.
  <p>
-The <tt>rank()</tt> function offers more flexibility
-because it pays attention to the <i>weights</i>
-with which you have labelled lexeme positions.
  Currently tsearch2 supports four different weight labels:
  <tt>'D'</tt>, the default weight;
  and <tt>'A'</tt>, <tt>'B'</tt>, and <tt>'C'</tt>.
@@ -730,7 +750,7 @@ The main problem is that the apostrophe and backslash
  are important <i>both</i> to PostgreSQL when it is interpreting a string,
  <i>and</i> to the <tt>tsvector</tt> conversion function.
  You may want to review section
-<a href="http://www.postgresql.org/docs/view.php?version=7.3&idoc=0&file=sql-syntax.html#SQL-SYNTAX-STRINGS">1.1.2.1,
+<a href="http://www.postgresql.org/docs/current/static/sql-syntax.html#SQL-SYNTAX-STRINGS">
  &ldquo;String Constants&rdquo;</a>
  in the PostgreSQL documentation before proceeding.
  <p>
@@ -1051,6 +1071,14 @@ using the same scheme to determine the dictionary for each token,
  with the difference that the query parser recognizes as special
  the boolean operators that separate query words.
  
+
+<h2><a name="ref">Additional information</a></h2>
+More information about tsearch2 is available from 
+<a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2">tsearch2</a> page.
+Also, it's worth to check 
+<a href="http://www.sai.msu.su/~megera/wiki/Tsearch2">tsearch2 wiki</a> pages.
+
+
  </body>
  </html>
  
diff --git a/contrib/tsearch2/docs/tsearch2-ref.html b/contrib/tsearch2/docs/tsearch2-ref.html

index 85401e83e7e93e143d960703f94784748c78c751..7edcc55a9b875eead1a783a0794f73fa86d9b517 100644 (file)
--- a/contrib/tsearch2/docs/tsearch2-ref.html
+++ b/contrib/tsearch2/docs/tsearch2-ref.html
@@ -1,53 +1,74 @@
-<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head>
-<link type="text/css" rel="stylesheet" href="tsearch2-ref_files/tsearch.txt"><title>tsearch2 reference</title></head>
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
+<html><head>
+
+<title>tsearch2 reference</title></head>
  
  <body>
  <h1 align="center">The tsearch2 Reference</h1>
  
  <p align="center">
  Brandon Craig Rhodes<br>30 June 2003 (edited by Oleg Bartunov, 2 Aug 2003).
-</p><p>
+<br>Massive update for 8.2 release by Oleg Bartunov, October 2006
+</p>
+<p>
  This Reference documents the user types and functions
  of the tsearch2 module for PostgreSQL.
  An introduction to the module is provided
-by the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html">tsearch2 Guide</a>,
+by the <a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html">tsearch2 Guide</a>,
  a companion document to this one.
-You can retrieve a beta copy of the tsearch2 module from the
-<a href="http://www.sai.msu.su/%7Emegera/postgres/gist/">GiST for PostgreSQL</a>
-page -- look under the section entitled <i>Development History</i>
-for the current version.
+</p>
+
+<h2>Table of Contents</h2>
+<blockquote>
+<a href="#vq">Vectors and Queries</a><br>
+<a href="#vqo">Vector Operations</a><br>
+<a href="#qo">Query Operations</a><br>
+<a href="#fts">Full Text Search Operator</a><br>
+<a href="#configurations">Configurations</a><br>
+<a href="#testing">Testing</a><br>
+<a href="#parsers">Parsers</a><br>
+<a href="#dictionaries">Dictionaries</a><br>
+<a href="#ranking">Ranking</a><br>
+<a href="#headlines">Headlines</a><br>
+<a href="#indexes">Indexes</a><br>
+<a href="#tz">Thesaurus dictionary</a><br>
+</blockquote>
+
+
  
-</p><h2><a name="vq">Vectors and Queries</a></h2>
  
-<a name="vq">Vectors and queries both store lexemes,
+<h2><a name="vq">Vectors and Queries</a></h2>
+
+Vectors and queries both store lexemes,
  but for different purposes.
  A <tt>tsvector</tt> stores the lexemes
  of the words that are parsed out of a document,
  and can also remember the position of each word.
  A <tt>tsquery</tt> specifies a boolean condition among lexemes.
-</a><p>
-<a name="vq">Any of the following functions with a <tt><i>configuration</i></tt> argument
+<p>
+Any of the following functions with a <tt><i>configuration</i></tt> argument
  can use either an integer <tt>id</tt> or textual <tt>ts_name</tt>
  to select a configuration;
  if the option is omitted, then the current configuration is used.
  For more information on the current configuration,
  read the next section on Configurations.
+</p>
  
-</a></p><h3><a name="vq">Vector Operations</a></h3>
+<h3><a name="vqo">Vector Operations</a></h3>
  
  <dl><dt>
-<a name="vq"> <tt>to_tsvector( <em>[</em><i>configuration</i>,<em>]</em>
- <i>document</i> TEXT) RETURNS tsvector</tt>
-</a></dt><dd>
-<a name="vq"> Parses a document into tokens,
+<tt>to_tsvector( <em>[</em><i>configuration</i>,<em>]</em>
+ <i>document</i> TEXT) RETURNS TSVECTOR</tt>
+</dt><dd>
+ Parses a document into tokens,
   reduces the tokens to lexemes,
   and returns a <tt>tsvector</tt> which lists the lexemes
   together with their positions in the document.
   For the best description of this process,
- see the section on </a><a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#ps">Parsing and Stemming</a>
+ see the section on <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#ps">Parsing and Stemming</a>
   in the accompanying tsearch2 Guide.
  </dd><dt>
- <tt>strip(<i>vector</i> tsvector) RETURNS tsvector</tt>
+ <tt>strip(<i>vector</i> TSVECTOR) RETURNS TSVECTOR</tt>
  </dt><dd>
   Return a vector which lists the same lexemes
   as the given <tt><i>vector</i></tt>,
@@ -56,10 +77,10 @@ read the next section on Configurations.
   While the returned vector is thus useless for relevance ranking,
   it will usually be much smaller.
  </dd><dt>
- <tt>setweight(<i>vector</i> tsvector, <i>letter</i>) RETURNS tsvector</tt>
+ <tt>setweight(<i>vector</i> TSVECTOR, <i>letter</i>) RETURNS TSVECTOR</tt>
  </dt><dd>
   This function returns a copy of the input vector
- in which every location has been labelled
+ in which every location has been labeled
   with either the <tt><i>letter</i></tt>
   <tt>'A'</tt>, <tt>'B'</tt>, or <tt>'C'</tt>,
   or the default label <tt>'D'</tt>
@@ -68,11 +89,11 @@ read the next section on Configurations.
   These labels are retained when vectors are concatenated,
   allowing words from different parts of a document
   to be weighted differently by ranking functions.
-</dd><dt>
- <tt><i>vector1</i> || <i>vector2</i></tt>
-</dt><dt class="br">
- <tt>concat(<i>vector1</i> tsvector, <i>vector2</i> tsvector)
- RETURNS tsvector</tt>
+</dd>
+<dt>
+ <tt><i>vector1</i> || <i>vector2</i></tt><BR>
+ <tt>concat(<i>vector1</i> TSVECTOR, <i>vector2</i> TSVECTOR)
+ RETURNS TSVECTOR</tt>
  </dt><dd>
   Returns a vector which combines the lexemes and position information
   in the two vectors given as arguments.
@@ -95,27 +116,81 @@ read the next section on Configurations.
   to the <tt>rank()</tt> function
   that assigns different weights to positions with different labels.
  </dd><dt>
- <tt>tsvector_size(<i>vector</i> tsvector) RETURNS INT4</tt>
+ <tt>length(<i>vector</i> TSVECTOR) RETURNS INT4</tt>
  </dt><dd>
   Returns the number of lexemes stored in the vector.
  </dd><dt>
- <tt><i>text</i>::tsvector RETURNS tsvector</tt>
+ <tt><i>text</i>::TSVECTOR RETURNS TSVECTOR</tt>
  </dt><dd>
   Directly casting text to a <tt>tsvector</tt>
   allows you to directly inject lexemes into a vector,
   with whatever positions and position weights you choose to specify.
   The <tt><i>text</i></tt> should be formatted
   like the vector would be printed by the output of a <tt>SELECT</tt>.
- See the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#casting">Casting</a>
+ See the <a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#casting">Casting</a>
   section in the Guide for details.
-</dd></dl>
+</dd><dt>
+ <tt>tsearch2(<i>vector_column_name</i>[, (<i>my_filter_name</i> | <i>text_column_name1</i>) [...] ], <i>text_column_nameN</i>)</tt> 
+ </dt><dd>
+<tt>tsearch2()</tt> trigger used to automatically update <i>vector_column_name</i>, <i>my_filter_name</i>
+is the function name to preprocess <i>text_column_name</i>.  There are can be many
+functions  and text columns specified in <tt>tsearch2()</tt> trigger. 
+The following rule used: 
+function applied to all subsequent text columns until next function occurs.
+Example, function <tt>dropatsymbol</tt> replaces all entries of <tt>@</tt>
+sign by space.
+<pre>
+CREATE FUNCTION dropatsymbol(text) RETURNS text 
+AS 'select replace($1, ''@'', '' '');'
+LANGUAGE SQL;
+
+CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT 
+ON tblMessages FOR EACH ROW EXECUTE PROCEDURE 
+tsearch2(tsvector_column,dropatsymbol, strMessage);
+</pre>
+</dd>
  
-<h3>Query Operations</h3>
+<dt>
+<tt>stat(<i>sqlquery</i> text [, <i>weight</i> text]) RETURNS SETOF statinfo</tt>
+</dt><dd>
+Here <tt>statinfo</tt> is a type, defined as 
+<tt>
+CREATE TYPE statinfo as (<i>word</i> text, <i>ndoc</i> int4, <i>nentry</i> int4)
+</tt> and <i>sqlquery</i> is a query, which returns column <tt>tsvector</tt>.
+<P>
+This returns statistics (the number of documents <i>ndoc</i> and total number <i>nentry</i> of <i>word</i> 
+in the collection) about column <i>vector</i> <tt>tsvector</tt>. 
+Useful to check how good is your configuration and
+to  find stop-words candidates.For example, find top 10 most frequent words:
+<pre>
+=# select * from stat('select vector from apod') order by ndoc desc, nentry desc,word limit 10;
+</pre>
+Optionally, one can specify <i>weight</i> to obtain statistics about words with specific weight.
+<pre>
+=# select * from stat('select vector from apod','a') order by ndoc desc, nentry desc,word limit 10;
+</pre>
  
-<dl><dt>
- <tt>to_tsquery( <em>[</em><i>configuration</i>,<em>]</em>
- <i>querytext</i> text) RETURNS tsvector</tt>
+</dd>
+<dt>
+<tt>TSVECTOR &lt; TSVECTOR</tt><BR>
+<tt>TSVECTOR &lt;= TSVECTOR</tt><BR>
+<tt>TSVECTOR = TSVECTOR</tt><BR>
+<tt>TSVECTOR >= TSVECTOR</tt><BR>
+<tt>TSVECTOR > TSVECTOR</tt>
  </dt><dd>
+All btree operations defined for <tt>tsvector</tt> type. <tt>tsvectors</tt> compares 
+with each other using lexicographical order.
+</dd>
+</dl>
+
+<h3><a name="qo">Query Operations</a></h3>
+
+<dl>
+<dt>
+ <tt>to_tsquery( <em>[</em><i>configuration</i>,<em>]</em>
+ <i>querytext</i> text) RETURNS TSQUERY[A</tt>
+</dt>
+<dd>
   Parses a query,
   which should be single words separated by the boolean operators
   "<tt>&amp;</tt>"&nbsp;and,
@@ -123,14 +198,27 @@ read the next section on Configurations.
   and&nbsp;"<tt>!</tt>"&nbsp;not,
   which can be grouped using parenthesis.
   Each word is reduced to a lexeme using the current
- or specified configuration.
-
+ or specified configuration. 
+ Weight class can be assigned to each lexeme entry
+ to restrict search region
+ (see <tt>setweight</tt> for explanation), for example 
+ "<tt>fat:a &amp; rats</tt>".
+</dd><dt>
+<dt>
+ <tt>plainto_tsquery( <em>[</em><i>configuration</i>,<em>]</em>
+ <i>querytext</i> text) RETURNS TSQUERY</tt>
+</dt>
+<dd>
+Transforms unformatted text to tsquery. It is the same as to_tsquery, 
+but assumes "<tt>&amp;</tt>" boolean operator between words and doesn't 
+recognizes weight classes.
  </dd><dt>
- <tt>querytree(<i>query</i> tsquery) RETURNS text</tt>
+
+ <tt>querytree(<i>query</i> TSQUERY) RETURNS text</tt>
  </dt><dd>
- This might return a textual representation of the given query.
+This returns a query which actually used in searching in GiST index.
  </dd><dt>
- <tt><i>text</i>::tsquery RETURNS tsquery</tt>
+ <tt><i>text</i>::TSQUERY RETURNS TSQUERY</tt>
  </dt><dd>
   Directly casting text to a <tt>tsquery</tt>
   allows you to directly inject lexemes into a query,
@@ -139,7 +227,117 @@ read the next section on Configurations.
   like the query would be printed by the output of a <tt>SELECT</tt>.
   See the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#casting">Casting</a>
   section in the Guide for details.
-</dd></dl>
+</dd>
+<dt>
+ <tt>numnode(<i>query</i> TSQUERY) RETURNS INTEGER</tt>
+</dt><dd>
+This returns the number of nodes in query tree
+</dd><dt>
+ <tt>TSQUERY && TSQUERY RETURNS TSQUERY</tt>
+</dt><dd>
+AND-ed TSQUERY
+</dd><dt>
+ <tt>TSQUERY || TSQUERY RETURNS TSQUERY</tt>
+</dt> <dd>
+ OR-ed TSQUERY
+</dd><dt>
+ <tt>!! TSQUERY RETURNS TSQUERY</tt>
+</dt> <dd>
+ negation of TSQUERY
+</dd>
+<dt>
+<tt>TSQUERY &lt; TSQUERY</tt><BR>
+<tt>TSQUERY &lt;= TSQUERY</tt><BR>
+<tt>TSQUERY = TSQUERY</tt><BR>
+<tt>TSQUERY >= TSQUERY</tt><BR>
+<tt>TSQUERY > TSQUERY</tt>
+</dt><dd>
+All btree operations defined for <tt>tsquery</tt> type. <tt>tsqueries</tt> compares 
+with each other using lexicographical order.
+</dd>
+</dl>
+
+<h3>Query rewriting</h3>
+Query rewriting is a set of functions and operators for tsquery type. 
+It allows to control search at query time without reindexing (opposite to thesaurus), for example,
+expand search using  synonyms (new york,  big apple, nyc, gotham).
+<P>
+<tt><b>rewrite()</b></tt> function changes original <i>query</i> by replacing <i>target</i> by <i>sample</i>.
+There are three possibilities to use <tt>rewrite()</tt> function. Notice, that arguments of <tt>rewrite()</tt> 
+function can be column names of type <tt>tsquery</tt>.
+<pre>
+create table rw (q TSQUERY, t TSQUERY, s TSQUERY);
+insert into rw values('a & b','a', 'c');
+</pre>
+<dl>
+<dt> <tt>rewrite (<i>query</i> TSQUERY, <i>target</i> TSQUERY, <i>sample</i> TSQUERY) RETURNS TSQUERY</tt>
+</dt>
+<dd>
+<pre>
+=# select rewrite('a & b'::TSQUERY, 'a'::TSQUERY, 'c'::TSQUERY);
+  rewrite
+  -----------
+   'c' & 'b'
+</pre>
+</dd>
+<dt> <tt>rewrite (ARRAY[<i>query</i> TSQUERY, <i>target</i> TSQUERY, <i>sample</i> TSQUERY])  RETURNS TSQUERY</tt>
+</dt>
+<dd>
+<pre>
+=# select rewrite(ARRAY['a & b'::TSQUERY, t,s]) from rw;
+  rewrite
+  -----------
+   'c' & 'b'
+</pre>
+</dd>
+<dt> <tt>rewrite (<i>query</i> TSQUERY,'select <i>target</i> ,<i>sample</i> from test'::text)  RETURNS TSQUERY</tt>
+</dt>
+<dd>
+<pre>
+=# select rewrite('a & b'::TSQUERY, 'select t,s from rw'::text);
+  rewrite
+  -----------
+   'c' & 'b'
+</pre>
+</dd>
+</dl>
+Two operators defined for <tt>tsquery</tt> type:
+<dl>
+<dt><tt>TSQUERY @ TSQUERY</tt></dt>
+<dd>
+  Returns <tt>TRUE</tt> if right agrument might contained in left argument.
+ </dd>
+ <dt><tt>TSQUERY ~ TSQUERY</tt></dt>
+ <dd> 
+  Returns <tt>TRUE</tt> if left agrument might contained in right argument.
+ </dd>
+</dl>                    
+To speed up these operators one can use GiST index with <tt>gist_tp_tsquery_ops</tt> opclass.
+<pre>
+create index qq on test_tsquery using gist (keyword gist_tp_tsquery_ops);
+</pre>
+
+<h2><a name="fts">Full Text Search operator</a></h2>
+
+<dl><dt>
+<tt>TSQUERY @@ TSVECTOR</tt><br>
+<tt>TSVECTOR @@ TSQUERY</tt>
+</dt>
+<dd>
+Returns <tt>TRUE</tt> if <tt>TSQUERY</tt> contained in <tt>TSVECTOR</tt> and 
+<tt>FALSE</tt> otherwise.
+<pre>
+=# select 'cat & rat':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
+ ?column?
+ ----------
+  t
+=# select 'fat & cow':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
+ ?column?
+ ----------
+  f
+</pre>
+</dd>
+</dl>
  
  <h2><a name="configurations">Configurations</a></h2>
  
@@ -147,7 +345,7 @@ A configuration specifies all of the equipment necessary
  to transform a document into a <tt>tsvector</tt>:
  the parser that breaks its text into tokens,
  and the dictionaries which then transform each token into a lexeme.
-Every call to <tt>to_tsvector()</tt> (described above)
+Every call to <tt>to_tsvector(), to_tsquery()</tt> (described above)
  uses a configuration to perform its processing.
  Three configurations come with tsearch2:
  
@@ -157,7 +355,10 @@ Three configurations come with tsearch2:
   and the <i>simple</i> dictionary for all others.
  </li><li><b>default_russian</b> -- Indexes words and numbers,
   using the <i>en_stem</i> English Snowball stemmer for Latin-alphabet words
- and the <i>ru_stem</i> Russian Snowball dictionary for all others.
+ and the <i>ru_stem</i> Russian Snowball dictionary for all others. It's default
+ for <tt>ru_RU.KOI8-R</tt> locale.
+</li><li><b>utf8_russian</b> -- the same as <b>default_russian</b> but 
+for <tt>ru_RU.UTF-8</tt> locale.
  </li><li><b>simple</b> -- Processes both words and numbers
   with the <i>simple</i> dictionary,
   which neither discards any stop words nor alters them.
@@ -239,7 +440,8 @@ Here:
  </li><li>description - human readable name of tok_type 
  </li><li>token - parser's token 
  </li><li>dict_name - dictionary used for the token 
-</li><li>tsvector - final result</li></ul>
+</li><li>tsvector - final result</li>
+</ul>
  
  
  <h2><a name="parsers">Parsers</a></h2>
@@ -300,20 +502,40 @@ the current parser is used when this argument is omitted.
  
  <h2><a name="dictionaries">Dictionaries</a></h2>
  
-Dictionaries take textual tokens as input,
-usually those produced by a parser,
-and return lexemes which are usually some reduced form of the token.
+Dictionary is a program, which accepts lexeme(s), usually those produced by a parser, 
+on input and returns:
+<ul>
+<li>array of lexeme(s) if input lexeme is known to the dictionary
+<li>void array - dictionary knows lexeme, but it's stop word.
+<li> NULL - dictionary doesn't recognized input lexeme
+</ul>
+Usually, dictionaries used for normalization of words ( ispell, stemmer dictionaries),
+but see, for example, <tt>intdict</tt> dictionary (available from 
+<a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">Tsearch2</a> home page, 
+which controls indexing of integers.
+
+<P>
  Among the dictionaries which come installed with tsearch2 are:
  
  <ul>
  <li><b>simple</b> simply folds uppercase letters to lowercase
   before returning the word.
-</li><li><b>en_stem</b> runs an English Snowball stemmer on each word
+</li>
+<li><b>ispell_template</b> - template for ispell dictionaries.
+</li>
+<li><b>en_stem</b> runs an English Snowball stemmer on each word
   that attempts to reduce the various forms of a verb or noun
   to a single recognizable form.
-</li><li><b>ru_stem</b> runs a Russian Snowball stemmer on each word.
-</li></ul>
-
+</li><li><b>ru_stem_koi8</b>, <b>ru_stem_utf8</b> runs a Russian Snowball stemmer on each word.
+</li>
+<li><b>synonym</b> - simple lexeme-to-lexeme replacement
+</li>
+<li><b>thesaurus_template</b> - template for <a href="#tz">thesaurus dictionary</a>. It's 
+phrase-to-phrase replacement
+</li>
+</ul>
+
+<P>
  Each dictionary is defined by an entry in the <tt>pg_ts_dict</tt> table:
  
  <pre>CREATE TABLE pg_ts_dict (
@@ -332,6 +554,12 @@ it specifies a file from which stop words should be read.
  The <tt>dict_comment</tt> is a human-readable description of the dictionary.
  The other fields are internal function identifiers
  useful only to developers trying to implement their own dictionaries.
+
+<blockquote>
+<b>WARNING:</b> Data files, used by dictionaries, should be in <tt>server_encoding</tt> to 
+avoid possible problems&nbsp;!
+</blockquote>
+
  <p>
  The argument named <tt><i>dictionary</i></tt>
  in each of the following functions
@@ -355,6 +583,27 @@ if omitted then the current dictionary is used.
   from which an inflected form could arise.
  </dd></dl>
  
+<h3>Using dictionaries template</h3>
+Templates used to define new dictionaries, for example,
+<pre>
+INSERT INTO pg_ts_dict
+               (SELECT 'en_ispell', dict_init,
+                       'DictFile="/usr/local/share/dicts/ispell/english.dict",'
+                       'AffFile="/usr/local/share/dicts/ispell/english.aff",'
+                       'StopFile="/usr/local/share/dicts/english.stop"',
+                       dict_lexize
+               FROM pg_ts_dict
+               WHERE dict_name = 'ispell_template');
+</pre>
+
+<h3>Working with stop words</h3>
+Ispell and snowball stemmers treat stop words differently:
+<ul>
+<li>ispell - normalize word and then lookups normalized form in stop-word file
+<li>snowball stemmer - first, it lookups word in stop-word file and then does it job. 
+The reason - to minimize possible 'noise'.
+</ul>
+
  <h2><a name="ranking">Ranking</a></h2>
  
  Ranking attempts to measure how relevant documents are to particular queries
@@ -364,26 +613,18 @@ Note that this information is only available in unstripped vectors --
  ranking functions will only return a useful result
  for a <tt>tsvector</tt> which still has position information!
  <p>
-Both of these ranking functions
-take an integer <i>normalization</i> option
-that specifies whether a document's length should impact its rank.
-This is often desirable,
-since a hundred-word document with five instances of a search word
-is probably more relevant than a thousand-word document with five instances.
-The option can have the values:
-
-</p><ul>
-<li><tt>0</tt> (the default) ignores document length.
-</li><li><tt>1</tt> divides the rank by the logarithm of the length.
-</li><li><tt>2</tt> divides the rank by the length itself.
-</li></ul>
+Notice, that ranking functions supplied are just an examples and 
+doesn't belong to the tsearch2 core, you can
+write your very own ranking function and/or combine additional
+factors to fit your specific interest.
+</p>
  
  The two ranking functions currently available are:
  
  <dl><dt>
   <tt>CREATE FUNCTION rank(<br>
    <em>[</em> <i>weights</i> float4[], <em>]</em>
-  <i>vector</i> tsvector, <i>query</i> tsquery,
+  <i>vector</i> TSVECTOR, <i>query</i> TSQUERY,
    <em>[</em> <i>normalization</i> int4 <em>]</em><br>
    ) RETURNS float4</tt>
  </dt><dd>
@@ -399,8 +640,8 @@ The two ranking functions currently available are:
   and make them more or less important than words in the document body.
  </dd><dt>
   <tt>CREATE FUNCTION rank_cd(<br>
-  <em>[</em> <i>K</i> int4, <em>]</em>
-  <i>vector</i> tsvector, <i>query</i> tsquery,
+  <em>[</em> <i>weights</i> float4[], <em>]</em>
+  <i>vector</i> TSVECTOR, <i>query</i> TSQUERY,
    <em>[</em> <i>normalization</i> int4 <em>]</em><br>
    ) RETURNS float4</tt>
  </dt><dd>
@@ -409,20 +650,51 @@ The two ranking functions currently available are:
   as described in Clarke, Cormack, and Tudhope's
   "<a href="http://citeseer.nj.nec.com/clarke00relevance.html">Relevance Ranking for One to Three Term Queries</a>"
   in the 1999 <i>Information Processing and Management</i>.
- The value <i>K</i> is one of the values from their formula,
- and defaults to&nbsp;<i>K</i>=4.
- The examples in their paper <i>K</i>=16;
- we can roughly describe the term
- as stating how far apart two search terms can fall
- before the formula begins penalizing them for lack of proximity.
-</dd></dl>
+</dd>
+<dt>
+ <tt>CREATE FUNCTION get_covers(vector TSVECTOR, query TSQUERY) RETURNS text</tt>
+ </dt>
+ <dd>
+ Returns <tt>extents</tt>, which are a shortest and non-nested sequences of words, which satisfy a query. 
+ Extents (covers) used in <tt>rank_cd</tt> algorithm for fast calculation of proximity ranking.
+ In example below there are two extents - <tt><b>{1</b>...<b>}1</b> and <b>{2</b> ...<b>}2</b></tt>.
+ <pre>
+=# select get_covers('1:1,2,10 2:4'::tsvector,'1&amp; 2');
+get_covers
+----------------------
+1 {1 1 {2 2 }1 1 }2
+</pre>
+ </dd>
+
+</dl>
+
+<p>
+Both of these (<tt>rank(), rank_cd()</tt>) ranking functions
+take an integer <i>normalization</i> option
+that specifies whether a document's length should impact its rank.
+This is often desirable,
+since a hundred-word document with five instances of a search word
+is probably more relevant than a thousand-word document with five instances.
+The option can have the values, which could be combined using "|" ( 2|4) to
+take into account several factors:
+
+</p>
+<ul>
+<li><tt>0</tt> (the default) ignores document length.</li>
+<li><tt>1</tt> divides the rank by the 1 + logarithm of the length </li>
+<li><tt>2</tt> divides the rank by the length itself.</li>
+<li><tt>4</tt> divides the rank by the mean harmonic distance between extents</li>
+<li><tt>8</tt> divides the rank by the  number of unique words in document</li>
+<li><tt>16</tt> divides the rank by 1 + logarithm of the  number of unique words in document
+</li>
+</ul>
  
  <h2><a name="headlines">Headlines</a></h2>
  
  <dl><dt>
   <tt>CREATE FUNCTION headline(<br>
    <em>[</em> <i>id</i> int4, <em>|</em> <i>ts_name</i> text, <em>]</em>
-  <i>document</i> text, <i>query</i> tsquery,
+  <i>document</i> text, <i>query</i> TSQUERY,
    <em>[</em> <i>options</i> text <em>]</em><br>
    ) RETURNS text</tt>
  </dt><dd>
@@ -448,10 +720,123 @@ The two ranking functions currently available are:
     with a word which has this many characters or less.
     The default value of <tt>3</tt> should eliminate most English
     conjunctions and articles.
+  </li><li><tt>HighlightAll</tt> -- 
+   boolean flag, if TRUE, than the whole document will be highlighted.
   </li></ul>
   Any unspecified options receive these defaults:
- <pre>StartSel=&lt;b&gt;, StopSel=&lt;/b&gt;, MaxWords=35, MinWords=15, ShortWord=3
+ <pre>StartSel=&lt;b&gt;, StopSel=&lt;/b&gt;, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE
   </pre>
  </dd></dl>
  
+
+<h2><a name="indexes">Indexes</a></h2>
+Tsearch2 supports indexed access to tsvector in order to further speedup FTS. Notice, indexes are not mandatory for FTS ! 
+<ul>
+<li> RD-Tree (Russian Doll Tree, matryoshka), based on GiST (Generalized Search Tree) 
+<pre>    
+    =# create index fts_idx on apod using gist(fts);
+</pre>    
+<li>GIN - Generalized Inverted Index 
+<pre>       
+        =# create index fts_idx on apod using gin(fts);
+</pre>
+</ul>  
+<b>GiST</b> index is very good for online update, but is not as scalable as <b>GIN</b> index,
+which, in turn, isn't good for updates. Both indexes support concurrency and recovery.
+
+<h2><a name="tz">Thesaurus dictionary</a></h2>
+
+<P>
+Thesaurus - is a collection of words with included information about the relationships of words and phrases, 
+i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred, related terms,etc.</p>
+<p>Basically,thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally, 
+preserves them for indexing. Thesaurus used when indexing, so any changes in thesaurus require reindexing.
+Tsearch2's <tt>thesaurus</tt> dictionary (TZ) is an extension of <tt>synonym</tt> dictionary 
+with <b>phrase</b> support. Thesaurus is a plain file of the following format: 
+<pre>
+# this is a comment 
+sample word(s) : indexed word(s)
+...............................
+</pre>
+<ul>
+<li><strong>Colon</strong> (:) symbol used as a delimiter.</li>
+<li>Use asterisk (<b>*</b>) at the beginning of <tt>indexed word</tt> to skip subdictionary.
+It's still required, that <tt>sample words</tt> should be known.</li>
+<li>thesaurus dictionary looks for the most longest match</li></ul>
+<P>
+TZ uses <strong>subdictionary</strong> (should be defined in tsearch2 configuration) 
+to normalize thesaurus text. It's possible to define only <strong>one dictionary</strong>. 
+Notice, that subdictionary produces an error, if it couldn't recognize word. 
+In that case, you should remove definition line with this word or teach  subdictionary to know it. 
+</p>
+<p>Stop-words recognized by subdictionary replaced by  'stop-word placeholder', i.e., 
+important only their position.
+To break possible ties thesaurus applies the last definition. For example, consider 
+thesaurus (with simple subdictionary) rules with pattern 'swsw' 
+('s' designates stop-word and 'w' - known word): </p>
+<pre>
+a one the two : swsw
+the one a two : swsw2
+</pre>
+<p>Words 'a' and 'the' are stop-words defined in the configuration of a subdictionary. 
+Thesaurus considers texts 'the one the two' and 'that one then two' as equal and  will use definition 
+'swsw2'.</p>
+<p>As a normal dictionary, it should be assigned to the specific lexeme types. 
+Since TZ has a capability to recognize phrases it must remember its  state and interact with parser. 
+TZ use these assignments to check if it should handle next word or stop accumulation. 
+Compiler of TZ should take care about proper configuration to avoid confusion. 
+For example, if TZ is assigned to handle only <tt>lword</tt> lexeme, then TZ definition like 
+' one 1:11' will not works, since lexeme type <tt>digit</tt> doesn't assigned to the TZ.</p>
+
+<h3>Configuration</h3>
+
+<dl><dt>tsearch2</dt><dd></dd></dl><p>tsearch2 comes with thesaurus template, which could be used to define new dictionary: </p>
+<pre class="real">INSERT INTO pg_ts_dict
+               (SELECT 'tz_simple', dict_init,
+                        'DictFile="/path/to/tz_simple.txt",'
+                        'Dictionary="en_stem"',
+                       dict_lexize
+                FROM pg_ts_dict
+                WHERE dict_name = 'thesaurus_template');
+
+</pre>
+<p>Here: </p>
+<ul>
+<li><tt>tz_simple</tt> - is the dictionary name</li>
+<li><tt>DictFile="/path/to/tz_simple.txt"</tt> - is the location of thesaurus file</li>
+<li><tt>Dictionary="en_stem"</tt> defines dictionary (snowball english stemmer) to use for thesaurus normalization. Notice, that <em>en_stem</em> dictionary has it's own configuration (stop-words, for example).</li>
+</ul>
+<p>Now, it's possible to use <tt>tz_simple</tt> in pg_ts_cfgmap, for  example: </p>
+<pre>
+update pg_ts_cfgmap set dict_name='{tz_simple,en_stem}' where ts_name = 'default_russian' and 
+tok_alias in ('lhword', 'lword', 'lpart_hword');
+</pre>
+<h3>Examples</h3>
+<p>tz_simple: </p>
+<pre>
+one : 1
+two : 2
+one two : 12
+the one : 1
+one 1 : 11
+</pre>
+<p>To see, how thesaurus works, one could use <tt>to_tsvector</tt>, <tt>to_tsquery</tt> or <tt>plainto_tsquery</tt> functions: </p><pre class="real">=# select plainto_tsquery('default_russian',' one day is oneday');
+    plainto_tsquery
+------------------------
+ '1' &amp; 'day' &amp; 'oneday'
+
+=# select plainto_tsquery('default_russian','one two day is oneday');
+     plainto_tsquery
+-------------------------
+ '12' &amp; 'day' &amp; 'oneday'
+
+=# select plainto_tsquery('default_russian','the one');
+NOTICE:  Thesaurus: word 'the' is recognized as stop-word, assign any stop-word (rule 3)
+ plainto_tsquery
+-----------------
+ '1'
+</pre>
+
+Additional information about thesaurus dictionary is available from
+<a href="http://www.sai.msu.su/~megera/wiki/Thesaurus_dictionary">Wiki</a> page.
  </body></html>
author	Teodor Sigaev <teodor@sigaev.ru>
	Tue, 31 Oct 2006 16:23:05 +0000 (16:23 +0000)
committer	Teodor Sigaev <teodor@sigaev.ru>
	Tue, 31 Oct 2006 16:23:05 +0000 (16:23 +0000)
contrib/tsearch2/docs/tsearch-V2-intro.html		patch \| blob \| history
contrib/tsearch2/docs/tsearch2-guide.html		patch \| blob \| history
contrib/tsearch2/docs/tsearch2-ref.html		patch \| blob \| history