From fcc6756341a03350854545c71b5384e804de1209 Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Tue, 4 Sep 2007 03:46:36 +0000 Subject: [PATCH] Sync examples of psql \dF output with current CVS HEAD behavior. Random other wordsmithing. --- doc/src/sgml/textsearch.sgml | 327 ++++++++++++++++++++--------------- 1 file changed, 185 insertions(+), 142 deletions(-) diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml index b165011bc8..72da3aae25 100644 --- a/doc/src/sgml/textsearch.sgml +++ b/doc/src/sgml/textsearch.sgml @@ -1,7 +1,15 @@ + + + Full Text Search - Full Text Search + + full text search + + + text search + Introduction @@ -67,43 +75,52 @@ Parsing documents into lexemes. It is - useful to identify various lexemes, e.g. digits, words, complex words, - email addresses, so they can be processed differently. In principle - lexemes depend on the specific application but for an ordinary search it - is useful to have a predefined list of lexemes. + useful to identify various classes of lexemes, e.g. digits, words, + complex words, email addresses, so that they can be processed + differently. In principle lexeme classes depend on the specific + application but for an ordinary search it is useful to have a predefined + set of classes. + PostgreSQL uses a parser to + perform this step. A standard parser is provided, and custom parsers + can be created for specific needs. - Dictionaries allow the conversion of lexemes into - a normalized form so it is not necessary to enter - search words in a specific form. + Converting lexemes into normalized + form. This allows searches to find variant forms of the + same word, without tediously entering all the possible variants. + Also, this step typically eliminates stop words, which + are words that are so common that they are useless for searching. + PostgreSQL uses dictionaries to + perform this step. Various standard dictionaries are provided, and + custom ones can be created for specific needs. - Store preprocessed documents optimized for - searching. For example, represent each document as a sorted array - of lexemes. Along with lexemes it is desirable to store positional - information to use for proximity ranking, so that - a document which contains a more "dense" region of query words is + Storing preprocessed documents optimized for + searching. For example, each document can be represented + as a sorted array of normalized lexemes. Along with the lexemes it is + desirable to store positional information to use for proximity + ranking, so that a document which contains a more + dense region of query words is assigned a higher rank than one with scattered query words. - Dictionaries allow fine-grained control over how lexemes are created. With - dictionaries you can: + Dictionaries allow fine-grained control over how lexemes are normalized. + With dictionaries you can: - Define "stop words" that should not be indexed. + Define stop words that should not be indexed. @@ -135,13 +152,12 @@ - A data type (), tsvector - is provided, for storing preprocessed documents, - along with a type tsquery for representing textual - queries. Also, a full text search operator @@ is defined - for these data types (). Full text - searches can be accelerated using indexes (). + A data type tsvector is provided for storing preprocessed + documents, along with a type tsquery for representing processed + queries (). Also, a full text search + operator @@ is defined for these data types (). Full text searches can be accelerated + using indexes (). @@ -154,20 +170,20 @@ - A document can be a simple text file stored in the file system. The full - text indexing engine can parse text files and store associations of lexemes - (words) with their parent document. Later, these associations are used to - search for documents which contain query words. In this case, the database - can be used to store the full text index and for executing searches, and - some unique identifier can be used to retrieve the document from the file - system. + A document is the unit of searching in a full text search + system; for example, a magazine article or email message. The text search + engine must be able to parse documents and store associations of lexemes + (key words) with their parent document. Later, these associations are + used to search for documents which contain query words. - A document can also be any textual database attribute or a combination - (concatenation), which in turn can be stored in various tables or obtained - dynamically. In other words, a document can be constructed from different - parts for indexing and it might not exist as a whole. For example: + For searches within PostgreSQL, + a document is normally a textual field within a row of a database table, + or possibly a combination (concatenation) of such fields, perhaps stored + in several tables or obtained dynamically. In other words, a document can + be constructed from different parts for indexing and it might not be + stored anywhere as a whole. For example: SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document @@ -184,10 +200,20 @@ WHERE mid = did AND mid = 12; Actually, in the previous example queries, COALESCE - should be used to prevent a NULL attribute from causing - a NULL result. + should be used to prevent a simgle NULL attribute from + causing a NULL result for the whole document. + + + Another possibility is to store the documents as simple text files in the + file system. In this case, the database can be used to store the full text + index and to execute searches, and some unique identifier can be used to + retrieve the document from the file system. However, retrieving files + from outside the database requires superuser permissions or special + function support, so this is usually less convenient than keeping all + the data inside PostgreSQL. + @@ -261,8 +287,9 @@ SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::t was set accordingly in postgresql.conf. If you are using the same text search configuration for the entire cluster you can use the value in - postgresql.conf. If using different configurations but - the same text search configuration for an entire database, + postgresql.conf. If using different configurations + throughout the cluster but + the same text search configuration for any one database, use ALTER DATABASE ... SET. If not, you must set default_text_search_config in each session. Many functions also take an optional configuration name. @@ -555,7 +582,7 @@ UPDATE tt SET ti= - ts_parse(parser, document TEXT) returns SETOF tokenout + ts_parse(parser, document text, OUT tokid integer, OUT token text) returns SETOF RECORD @@ -588,7 +615,7 @@ SELECT * FROM ts_parse('default','123 - a number'); - ts_token_type(parser ) returns SETOF tokentype + ts_token_type(parser, OUT tokid integer, OUT alias text, OUT description text) returns SETOF RECORD @@ -1107,20 +1134,20 @@ SELECT ts_lexize('english_stem', 'stars'); (1 row) - Also, the ts_debug function () - can be used for this. + Also, the ts_debug function () is helpful for testing. Stop Words - Stop words are words which are very common, appear in almost - every document, and have no discrimination value. Therefore, they can be ignored - in the context of full text searching. For example, every English text contains - words like a although it is useless to store them in an index. - However, stop words do affect the positions in tsvector, - which in turn, do affect ranking: + Stop words are words which are very common, appear in almost every + document, and have no discrimination value. Therefore, they can be ignored + in the context of full text searching. For example, every English text + contains words like a and the, so it is + useless to store them in an index. However, stop words do affect the + positions in tsvector, which in turn affect ranking: SELECT to_tsvector('english','in the list of stop words'); @@ -1542,11 +1569,15 @@ SELECT ts_lexize('norwegian_ispell','sjokoladefabrikk'); The Snowball dictionary template is based on the project of Martin Porter, inventor of the popular Porter's stemming algorithm - for the English language and now supported in many languages (see the Snowball site for more - information). The Snowball project supplies a large number of stemmers for - many languages. A Snowball dictionary requires a language parameter to - identify which stemmer to use, and optionally can specify a stopword file name. + for the English language. Snowball now provides stemming algorithms for + many languages (see the Snowball + site for more information). Each algorithm understands how to + reduce common variant forms of words to a base, or stem, spelling within + its language. A Snowball dictionary requires a language parameter to + identify which stemmer to use, and optionally can specify a stopword file + name that gives a list of words to eliminate. + (PostgreSQL's standard stopword lists are also + provided by the Snowball project.) For example, there is a built-in definition equivalent to @@ -1782,7 +1813,7 @@ version of our software: PostgreSQL 8.3. => \dF - List of fulltext configurations + List of text search configurations Schema | Name | Description ---------+------+------------- public | pg | @@ -2053,24 +2084,24 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a'); Information about full text searching objects can be obtained - in psql using a set of commands: + in psql using a set of commands: - \dF{,d,p}+ PATTERN + \dF{d,p,t}+ PATTERN An optional + produces more details. The optional parameter PATTERN should be the name of - a full text searching object, optionally schema-qualified. If + a text searching object, optionally schema-qualified. If PATTERN is not specified then information about all - visible objects will be displayed. PATTERN can be a - regular expression and can apply separately to schema - names and object names. The following examples illustrate this: + visible objects will be displayed. PATTERN can be a + regular expression and can provide separate patterns + for the schema and object names. The following examples illustrate this: => \dF *fulltext* - List of fulltext configurations + List of text search configurations Schema | Name | Description --------+--------------+------------- public | fulltext_cfg | @@ -2078,7 +2109,7 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a'); => \dF *.fulltext* - List of fulltext configurations + List of text search configurations Schema | Name | Description ----------+---------------------------- fulltext | fulltext_cfg | @@ -2093,46 +2124,42 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a'); - List full text searching configurations (add "+" for more detail) - - - By default (without PATTERN), information about - all visible full text configurations will be - displayed. + List text searching configurations (add + for more detail). + => \dF russian - List of fulltext configurations - Schema | Name | Description -------------+---------+----------------------------------- - pg_catalog | russian | default configuration for Russian + List of text search configurations + Schema | Name | Description +------------+---------+------------------------------------ + pg_catalog | russian | configuration for russian language => \dF+ russian - Configuration "pg_catalog.russian" - Parser name: "pg_catalog.default" - Token | Dictionaries ---------------+------------------------- - email | pg_catalog.simple - file | pg_catalog.simple - float | pg_catalog.simple - host | pg_catalog.simple - hword | pg_catalog.russian_stem - int | pg_catalog.simple - lhword | public.tz_simple - lpart_hword | public.tz_simple - lword | public.tz_simple - nlhword | pg_catalog.russian_stem - nlpart_hword | pg_catalog.russian_stem - nlword | pg_catalog.russian_stem - part_hword | pg_catalog.simple - sfloat | pg_catalog.simple - uint | pg_catalog.simple - uri | pg_catalog.simple - url | pg_catalog.simple - version | pg_catalog.simple - word | pg_catalog.russian_stem +Text search configuration "pg_catalog.russian" +Parser: "pg_catalog.default" + Token | Dictionaries +--------------+-------------- + email | simple + file | simple + float | simple + host | simple + hword | russian_stem + int | simple + lhword | english_stem + lpart_hword | english_stem + lword | english_stem + nlhword | russian_stem + nlpart_hword | russian_stem + nlword | russian_stem + part_hword | russian_stem + sfloat | simple + uint | simple + uri | simple + url | simple + version | simple + word | russian_stem @@ -2142,35 +2169,31 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a'); \dFd[+] [PATTERN] - List full text dictionaries (add "+" for more detail). - - - By default (without PATTERN), information about - all visible dictionaries will be displayed. + List text search dictionaries (add + for more detail). => \dFd - List of fulltext dictionaries - Schema | Name | Description -------------+------------+----------------------------------------------------------- - pg_catalog | danish | Snowball stemmer for danish language - pg_catalog | dutch | Snowball stemmer for dutch language - pg_catalog | english | Snowball stemmer for english language - pg_catalog | finnish | Snowball stemmer for finnish language - pg_catalog | french | Snowball stemmer for french language - pg_catalog | german | Snowball stemmer for german language - pg_catalog | hungarian | Snowball stemmer for hungarian language - pg_catalog | italian | Snowball stemmer for italian language - pg_catalog | norwegian | Snowball stemmer for norwegian language - pg_catalog | portuguese | Snowball stemmer for portuguese language - pg_catalog | romanian | Snowball stemmer for romanian language - pg_catalog | russian | Snowball stemmer for russian language - pg_catalog | simple | simple dictionary: just lower case and check for stopword - pg_catalog | spanish | Snowball stemmer for spanish language - pg_catalog | swedish | Snowball stemmer for swedish language - pg_catalog | turkish | Snowball stemmer for turkish language + List of text search dictionaries + Schema | Name | Description +------------+-----------------+----------------------------------------------------------- + pg_catalog | danish_stem | snowball stemmer for danish language + pg_catalog | dutch_stem | snowball stemmer for dutch language + pg_catalog | english_stem | snowball stemmer for english language + pg_catalog | finnish_stem | snowball stemmer for finnish language + pg_catalog | french_stem | snowball stemmer for french language + pg_catalog | german_stem | snowball stemmer for german language + pg_catalog | hungarian_stem | snowball stemmer for hungarian language + pg_catalog | italian_stem | snowball stemmer for italian language + pg_catalog | norwegian_stem | snowball stemmer for norwegian language + pg_catalog | portuguese_stem | snowball stemmer for portuguese language + pg_catalog | romanian_stem | snowball stemmer for romanian language + pg_catalog | russian_stem | snowball stemmer for russian language + pg_catalog | simple | simple dictionary: just lower case and check for stopword + pg_catalog | spanish_stem | snowball stemmer for spanish language + pg_catalog | swedish_stem | snowball stemmer for swedish language + pg_catalog | turkish_stem | snowball stemmer for turkish language @@ -2181,32 +2204,28 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a'); \dFp[+] [PATTERN] - List full text parsers (add "+" for more detail) - - - By default (without PATTERN), information about - all visible full text parsers will be displayed. + List text search parsers (add + for more detail). + - => \dFp - List of fulltext parsers - Schema | Name | Description +=> \dFp + List of text search parsers + Schema | Name | Description ------------+---------+--------------------- pg_catalog | default | default word parser - (1 row) => \dFp+ - Fulltext parser "pg_catalog.default" - Method | Function | Description --------------------+---------------------------+------------- - Start parse | pg_catalog.prsd_start | - Get next token | pg_catalog.prsd_nexttoken | - End parse | pg_catalog.prsd_end | - Get headline | pg_catalog.prsd_headline | - Get lexeme's type | pg_catalog.prsd_lextype | - - Token's types for parser "pg_catalog.default" - Token name | Description + Text search parser "pg_catalog.default" + Method | Function | Description +------------------+----------------+------------- + Start parse | prsd_start | + Get next token | prsd_nexttoken | + End parse | prsd_end | + Get headline | prsd_headline | + Get lexeme types | prsd_lextype | + + Token types for parser "pg_catalog.default" + Token name | Description --------------+----------------------------------- blank | Space symbols email | Email @@ -2237,6 +2256,30 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a'); + + + \dFt[+] [PATTERN] + + + List text search templates (add + for more detail). + + + + +=> \dFt + List of text search templates + Schema | Name | Description +------------+-----------+----------------------------------------------------------- + pg_catalog | ispell | ispell dictionary + pg_catalog | simple | simple dictionary: just lower case and check for stopword + pg_catalog | snowball | snowball stemmer + pg_catalog | synonym | synonym dictionary: replace word by its synonym + pg_catalog | thesaurus | thesaurus dictionary: phrase by phrase substitution + + + + + @@ -2261,7 +2304,7 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a'); - ts_debug type defined as: + ts_debug's result type is defined as: CREATE TYPE ts_debug AS ( @@ -2297,7 +2340,7 @@ ALTER TEXT SEARCH CONFIGURATION public.english SELECT * FROM ts_debug('public.english','The Brightest supernovaes'); - Alias | Description | Token | Dicts list | Lexized token + Alias | Description | Token | Dictionaries | Lexized token -------+---------------+-------------+---------------------------------------+--------------------------------- lword | Latin word | The | {public.english_ispell,pg_catalog.english_stem} | public.english_ispell: {} blank | Space symbols | | | -- 2.40.0