1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
4 <title>tsearch-v2-intro</title>
8 <h2>Tsearch2 - Introduction</h2>
10 http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html">
11 [Online version]</a> of this document is available.</p>
13 <p>The tsearch2 module is available to add as an extension to the
14 PostgreSQL database to allow for Full Text Indexing. This document
15 is an introduction to installing, configuring, using and
16 maintaining the database with the tsearch2 module activated.</p>
17 <p>Please, note, tsearch2 module is fully incompatible with old
18 tsearch, which is deprecated in 7.4 and will be obsoleted in
20 <h3>USING TSEARCH2 AND POSTGRESQL FOR A WEB BASED SEARCH
22 <p>This documentation is provided as a short guide on how to
23 quickly get up and running with tsearch2 and PostgreSQL, for those
24 who want to implement a full text indexed based search engine. It
25 is not meant to be a complete in-depth guide into the full ins and
26 outs of the contrib/tsearch2 module, and is primarily aimed at
27 beginners who want to speed up searching of large text fields, or
28 those migrating from other database systems such as MS-SQL.</p>
29 <p>The README.tsearch2 file included in the contrib/tsearch2
30 directory contains a brief overview and history behind tsearch.
31 This can also be found online <a href="
32 http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">[right
34 <p>Further in depth documentation such as a full function
35 reference, and user guide can be found online at the <a href="
36 http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/">[tsearch
37 documentation home]</a>.</p>
38 <h3>ACKNOWLEDGEMENTS</h3>
40 <p>Robert John Shepherd originally wrote this documentation for the
41 previous version of tsearch module (v1) included with the postgres
42 release. I took his documentation and updated it to comply with the
43 tsearch2 modifications.</p>
44 <p>Robert's original acknowledgements:</p>
45 <p>"Thanks to Oleg Bartunov for taking the time to answer many of
46 my questions regarding this module, and also to Teodor Sigaev for
47 clearing up the process of making your own dictionaries. Plus of
48 course a big thanks to the pair of them for writing this module in
50 <p>I would also like to extend my thanks to the developers, and
51 Oleg Bartunov for all of his direction and help with the new
52 features of tsearch2.</p>
54 <p>MS-SQL provides a full text indexing (FTI) system which enables
55 the fast searching of text based fields, very useful for websites
56 (and other applications) that require a results set based on key
57 words. PostgreSQL ships with a contributed module called tsearch2,
58 which implements a special type of index that can also be used for
59 full text indexing. Further more, unlike MS' offering which
60 requires regular incremental rebuilds of the text indexes
61 themselves, tsearch2 indexes are always up-to-date and keeping them
62 so induces very little overhead.</p>
63 <p>Before we get into the details, it is recommended that you have
64 installed and tested PostgreSQL, are reasonably familiar with
65 databases, the SQL query language and also understand the basics of
66 connecting to PostgreSQL from the local shell. This document isn't
67 intended for the complete PostgreSQL newbie, but anyone with a
68 reasonable grasp of the basics should be able to follow it.</p>
70 <p>Starting with PostgreSQL version 7.4 tsearch2 is now included in
71 the contrib directory with the PostgreSQL sources. contrib/tsearch2
72 is where you will find everything needed to install and use
73 tsearch2. Please note that tsearch2 will also work with PostgreSQL
74 version 7.3.x, but it is not the module included with the source
75 distribution. You will have to download the module separately and
76 install it in the same fashion.</p>
78 <p>I installed the tsearch2 module to a PostgreSQL 7.3 database
79 from the contrib directory without squashing the original (old)
80 tsearch module. What I did was move the modules tsearch src
81 driectory into the contrib tree under the name tsearchV2.</p>
82 <p>Step one is to download the tsearch V2 module :</p>
84 http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">[http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/]</a>
85 (check Development History for latest stable version !)</p>
87 tar -zxvf tsearch-v2.tar.gz
88 mv tsearch2 $PGSQL_SRC/contrib/
89 cd $PGSQL_SRC/contrib/tsearch2
91 <p>If you are installing from PostgreSQL version 7.4 or higher, you
92 can skip those steps and just change to the contrib/tsearch2
93 directory in the source tree and continue from there.</p>
94 <p>As of May 9, 2004 there is a source patch available for
95 tsearch2. The patch provides changes to the pg_ts_ configuration
96 tables to allow for easy dump and restore of a database containing
97 tsearch2. The patch is available here : <a href="
98 http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/regprocedure_7.4.patch.gz">
99 [http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/regprocedure_7.4.patch.gz]</a></p>
101 <p>To apply this patch, download the mentioned file and place it in
102 your postgreSQL source tree ($PGSQL_SRC). This patch is not
103 required for tsearch2 to work. I would however, highly recommend it
104 as it makes the backup and restore procedures very simple.</p>
107 gunzip regprocedure_7.4.patch.gz
108 patch -b -p1 < regprocedure_7.4.patch
110 <p>If you have a working version of tsearch2 in your database, you
111 do not need to re-install the tsearch2 module. Just apply the patch
112 and run make. This patch only affects the tsearch2.sql file. You
113 can run the SQL script found : <a href="
114 http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/regprocedure_update.sql">
115 [right here]</a> This script will make the modifications found in
116 the patch, and update the fields from the existing data. From this
117 point on, you can dump and restore the database in a normal
118 fashion. Without this patch, you must follow the instructions later
119 in this document for backup and restore.</p>
120 <p>This patch is only needed for tsearch2 in PostgreSQL versions
121 7.3.x and 7.4.x. The patch has been applied to the sources for
123 <p>When you have your source tree for tsearch2 ready, you can
124 continue with the regular building and installation process</p>
131 <p>That is pretty much all you have to do, unless of course you get
132 errors. However if you get those, you better go check with the
133 mailing lists over at <a href="
134 http://www.postgresql.org">http://www.postgresql.org</a> or
136 http://openfts.sourceforge.net/">http://openfts.sourceforge.net/</a>
137 since its never failed for me.</p>
138 <p>If you ever need to revert this patch, and go back to the
139 unpatched version of tsearch2, it is simple if you followed the
140 above patch command. The -b option creates a backup of the original
141 file, so we can just copy it back.</p>
143 cd $PGSQL_SRC/contrib/tsearch2
144 cp tsearch.sql.in.orig tsearch.sql.in
148 <p>If you need the patched version again, just follow the patch
149 instructions again.</p>
150 <p>The directory in the contib/ and the directory from the archive
151 is called tsearch2. Tsearch2 is completely incompatible with the
152 previous version of tsearch. This means that both versions can be
153 installed into a single database, and migration the new version may
155 <p>NOTE: the previous version of tsearch found in the
156 contrib/tsearch directory is depricated. Although it is still
157 available and included within PostgreSQL version 7.4. It will be
158 removed in version 8.0.</p>
159 <h3>ADDING TSEARCH2 FUNCTIONALITY TO A DATABASE</h3>
160 <p>We should create a database to use as an example for the
161 remainder of this file. We can call the database "ftstest". You can
162 create it from the command line like this:</p>
166 <p>If you thought installation was easy, this next bit is even
167 easier. Change to the PGSQL_SRC/contrib/tsearch2 directory and
171 psql ftstest < tsearch2.sql
173 <p>The file "tsearch2.sql" holds all the wonderful little goodies
174 you need to do full text indexing. It defines numerous functions
175 and operators, and creates the needed tables in the database. There
176 will be 4 new tables created after running the tsearch2.sql file :
177 pg_ts_dict, pg_ts_parser, pg_ts_cfg, pg_ts_cfgmap are added.</p>
178 <p>You can check out the tables if you like:</p>
183 Schema | Name | Type | Owner
184 --------+--------------+-------+----------
185 public | pg_ts_cfg | table | kopciuch
186 public | pg_ts_cfgmap | table | kopciuch
187 public | pg_ts_dict | table | kopciuch
188 public | pg_ts_parser | table | kopciuch
191 <p>You may need to grant permissions to use on pg_ts_dict, pg_ts_parser, pg_ts_cfg, pg_ts_cfgmap tables to let non-superuser works with tsearch2. GRANT SELECT should be enough for search-only access. </p>
192 <h3>TYPES AND FUNCTIONS PROVIDED BY TSEARCH2</h3>
193 <p>The first thing we can do is try out some of the types that are
194 provided for us. Lets look at the tsvector type provided for
198 SELECT 'Our first string used today'::tsvector;
200 ---------------------------------------
201 'Our' 'used' 'first' 'today' 'string'
204 <p>The results are the words used within our string. Notice they
205 are not in any particular order. The tsvector type returns a string
206 of space separated words.</p>
208 SELECT 'Our first string used today first string'::tsvector;
210 -----------------------------------------------
211 'Our' 'used' 'first' 'today' 'string'
214 <p>Notice the results string has each unique word ('first' and
215 'string' only appear once in the tsvector value). Which of course
216 makes sense if you are searching the full text ... you only need to
217 know each unique word in the text.</p>
218 <p>Those examples were just casting a text field to that of type
219 tsvector. Lets check out one of the new functions created by the
221 <p>The function to_tsvector has 3 possible signatures:</p>
224 to_tsvector(oid, text);
225 to_tsvector(text, text);
228 <p>We will use the second method using two text fields. The
229 overloaded methods provide us with a way to specifiy the way the
230 searchable text is broken up into words (Stemming process). Right
231 now we will specify the 'default' configuration. See the section on
232 TSEARCH2 CONFIGURATION to learn more about this.</p>
234 SELECT to_tsvector('default',
235 'Our first string used today first string');
237 --------------------------------------------
238 'use':4 'first':2,6 'today':5 'string':3,7
241 <p>The result returned from this function is of type tsvector. The
242 results came about by this reasoning: All of the words in the text
243 passed in are stemmed, or not used because they are stop words
244 defined in our configuration. Each lower case morphed word is
245 returned with all of the positons in the text.</p>
246 <p>In this case the word "Our" is a stop word in the default
247 configuration. That means it will not be included in the result.
248 The word "first" is found at positions 2 and 6 (although "Our" is a
249 stop word, it's position is maintained). The word(s) positioning is
250 maintained exactly as in the original string. The word "used" is
251 morphed to the word "use" based on the default configuration for
252 word stemming, and is found at position 4. The rest of the results
253 follow the same logic. Just a reminder again ... the order of the
254 'word' position in the output is not in any kind of order. (ie
255 'use':4 appears first)</p>
256 <p>If you want to view the output of the tsvector fields without
257 their positions, you can do so with the function
258 "strip(tsvector)".</p>
260 SELECT strip(to_tsvector('default',
261 'Our first string used today first string'));
263 --------------------------------
264 'use' 'first' 'today' 'string'
267 <p>If you wish to know the number of unique words returned in the
268 tsvector you can do so by using the function "length(tsvector)"</p>
270 SELECT length(to_tsvector('default',
271 'Our first string used today first string'));
277 <p>Lets take a look at the function to_tsquery. It also has 3
278 signatures which follow the same rational as the to_tsvector
281 to_tsquery(oid, text);
282 to_tsquery(text, text);
285 <p>Lets try using the function with a single word :</p>
287 SELECT to_tsquery('default', 'word');
294 <p>I call the function the same way I would a to_tsvector function,
295 specifying the 'default' configuration for morphing, and the result
296 is the stemmed output 'word'.</p>
297 <p>Lets attempt to use the function with a string of multiple
300 SELECT to_tsquery('default', 'this is many words');
303 <p>The function can not accept a space separated string. The
304 intention of the to_tsquery function is to return a type of
305 "tsquery" used for searching a tsvector field. What we need to do
306 is search for one to many words with some kind of logic (for now
309 SELECT to_tsquery('default', 'searching|sentence');
311 ----------------------
315 <p>Notice that the words are separated by the boolean logic "OR",
316 the text could contain boolean operators &,|,!,() with their
319 <p>You can not use words defined as being a stop word in your
320 configuration. The function will not fail ... you will just get no
321 result, and a NOTICE like this:</p>
323 SELECT to_tsquery('default', 'a|is&not|!the');
324 NOTICE: Query contains only stopword(s)
325 or doesn't contain lexeme(s), ignored
330 <p>That is a beginning to using the types, and functions defined in
331 the tsearch2 module. There are numerous more functions that I have
332 not touched on. You can read through the tsearch2.sql file built
333 when compiling to get more familiar with what is included.</p>
334 <h3>INDEXING FIELDS IN A TABLE</h3>
335 <p>The next stage is to add a full text index to an existing table.
336 In this example we already have a table defined as follows:</p>
338 CREATE TABLE tblMessages
341 strTopic varchar(100),
346 <p>We are assuming there are several rows with some kind of data in
347 them. Any data will do, just do several inserts with test strings
348 for a topic, and a message. here is some test data I inserted. (yes
349 I know it's completely useless stuff ;-) but it will serve our
350 purpose right now).</p>
352 INSERT INTO tblMessages
353 VALUES ('1', 'Testing Topic', 'Testing message data input');
354 INSERT INTO tblMessages
355 VALUES ('2', 'Movie', 'Breakfast at Tiffany\'s');
356 INSERT INTO tblMessages
357 VALUES ('3', 'Famous Author', 'Stephen King');
358 INSERT INTO tblMessages
359 VALUES ('4', 'Political Topic',
360 'Nelson Mandella is released from prison');
361 INSERT INTO tblMessages
362 VALUES ('5', 'Nursery rhyme phrase',
363 'Little jack horner sat in a corner');
364 INSERT INTO tblMessages
365 VALUES ('6', 'Gettysburg address quotation',
366 'Four score and seven years ago'
367 ' our fathers brought forth on this'
368 ' continent a new nation, conceived in'
369 ' liberty and dedicated to the proposition'
370 ' that all men are created equal');
371 INSERT INTO tblMessages
372 VALUES ('7', 'Classic Rock Bands',
373 'Led Zeppelin Grateful Dead and The Sex Pistols');
374 INSERT INTO tblMessages
375 VALUES ('8', 'My birth address',
376 '18 Sommervile road, Regina, Saskatchewan');
377 INSERT INTO tblMessages
378 VALUES ('9', 'Joke', 'knock knock : who\'s there?'
379 ' I will not finish this joke');
380 INSERT INTO tblMessages
381 VALUES ('10', 'Computer information',
382 'My computer is a pentium III 400 mHz'
383 ' with 192 megabytes of RAM');
385 <p>The next stage is to create a special text index which we will
386 use for FTI, so we can search our table of messages for words or a
387 phrase. We do this using the SQL command:</p>
389 ALTER TABLE tblMessages ADD COLUMN idxFTI tsvector;
391 <p>Note that unlike traditional indexes, this is actually a new
392 field in the same table, which is then used (through the magic of
393 the tsearch2 operators and functions) by a special index we will
394 create in a moment.</p>
395 <p>The general rule for the initial insertion of data will follow
400 2. vacuum full analyze
402 4. vacuum full analyze
404 <p>The data can be updated into the table, the vacuum full analyze
405 will reclaim unused space. The index can be created on the table
406 after the data has been inserted. Having the index created prior to
407 the update will slow down the process. It can be done in that
408 manner, this way is just more efficient. After the index has been
409 created on the table, vacuum full analyze is run again to update
410 postgres's statistics (ie having the index take effect).</p>
412 UPDATE tblMessages SET idxFTI=to_tsvector('default', strMessage);
415 <p>Note that this only inserts the field strMessage as a tsvector,
416 so if you want to also add strTopic to the information stored, you
417 should instead do the following, which effectively concatenates the
418 two fields into one before being inserted into the table:</p>
421 SET idxFTI=to_tsvector('default',coalesce(strTopic,'') ||' '|| coalesce(strMessage,''));
424 <p><strong>Using the coalesce function makes sure this
425 concatenation also works with NULL fields.</strong></p>
427 <p>We need to create the index on the column idxFTI. Keep in mind
428 that the database will update the index when some action is taken.
429 In this case we _need_ the index (The whole point of Full Text
430 INDEXING ;-)), so don't worry about any indexing overhead. We will
431 create an index based on the gist or gin function. GiST is an index
432 structure for Generalized Search Tree, GIN is a inverted index (see <a href="tsearch2-ref.html#indexes">The tsearch2 Reference: Indexes</a>).</p>
434 CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);
437 <p>After you have converted all of your data and indexed the
438 column, you can select some rows to see what actually happened. I
439 will not display output here but you can play around yourselves and
440 see what happened.</p>
441 <p>The last thing to do is set up a trigger so every time a row in
442 this table is changed, the text index is automatically updated.
443 This is easily done using:</p>
445 CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
446 FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, strMessage);
448 <p>Or if you are indexing both strMessage and strTopic you should
452 CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
453 FOR EACH ROW EXECUTE PROCEDURE
454 tsearch2(idxFTI, strTopic, strMessage);
456 <p>Before you ask, the tsearch2 function accepts multiple fields as
457 arguments so there is no need to concatenate the two into one like
459 <p>If you want to do something specific with columns, you may write
460 your very own trigger function using plpgsql or other procedural
461 languages (but not SQL, unfortunately) and use it instead of
462 <em>tsearch2</em> trigger.</p>
463 <p>You could however call other stored procedures from within the
464 tsearch2 function. Lets say we want to create a function to remove
465 certain characters (like the @ symbol from all text).</p>
467 CREATE FUNCTION dropatsymbol(text)
468 RETURNS text AS 'select replace($1, \'@\', \' \');' LANGUAGE SQL;
470 <p>Now we can use this function within the tsearch2 function on the
474 DROP TRIGGER tsvectorupdate ON tblmessages;
475 CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
476 FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, dropatsymbol, strMessage);
477 INSERT INTO tblmessages VALUES (69, 'Attempt for dropatsymbol', 'Test@test.com');
479 <p>If at this point you receive an error stating: ERROR: Can't find
480 tsearch config by locale</p>
481 <p>Do not worry. You have done nothing wrong. And tsearch2 is not
482 broken. All that has happened here is that the configuration is
483 setup to use a configuration based on the locale of the server. All
484 you have to do is change your default configuration, or add a new
485 one for your specific locale. See the section on TSEARCH2
488 SELECT * FROM tblmessages WHERE intindex = 69;
490 intindex | strtopic | strmessage | idxfti
491 ----------+--------------------------+---------------+-----------------------
492 69 | Attempt for dropatsymbol | Test@test.com | 'test':1 'test.com':2
495 Notice that the string content was passed throught the stored
496 procedure dropatsymbol. The '@' character was replaced with a
497 single space ... and the output from the procedure was then stored
498 in the tsvector column.
499 <p>This could be useful for removing other characters from indexed
500 text, or any kind of preprocessing needed to be done on the text
501 prior to insertion into the index.</p>
502 <h3>QUERYING A TABLE</h3>
504 <p>There are some examples in the README.tsearch2 file for querying
505 a table. One major difference between tsearch and tsearch2 is the
506 operator ## is no longer available. Only the operator @@ is
507 defined, using the types tsvector on one side and tsquery on the
509 <p>Lets search the indexed data for the word "Test". I indexed
510 based on the the concatenation of the strTopic, and the
513 SELECT intindex, strtopic FROM tblmessages
514 WHERE idxfti @@ 'test'::tsquery;
516 ----------+---------------
520 <p>The only result that matched was the row with a topic "Testing
521 Topic". Notice that the word I search for was all lowercase. Let's
522 see what happens when I query for uppercase "Test".</p>
524 SELECT intindex, strtopic FROM tblmessages
525 WHERE idxfti @@ 'Test'::tsquery;
527 ----------+----------
530 <p>We get zero rows returned. The reason is because when the text
531 was inserted, it was morphed to my default configuration (because
532 of the call to to_tsvector in the UPDATE statement). If there was
533 no morphing done, and the tsvector field(s) contained the word
534 'Text', a match would have been found.</p>
535 <p>Most likely the best way to query the field is to use the
536 to_tsquery function on the right hand side of the @@ operator like
540 SELECT intindex, strtopic FROM tblmessages
541 WHERE idxfti @@ to_tsquery('default', 'Test | Zeppelin');
543 ----------+--------------------
545 7 | Classic Rock Bands
548 <p>That query searched for all instances of "Test" OR "Zeppelin".
549 It returned two rows: the "Testing Topic" row, and the "Classic
550 Rock Bands" row. The to_tsquery function performed the correct
551 morphology upon the parameters, and searched the tsvector field
553 <p>The last example here relates to searching for a phrase, for
554 example "minority report". This poses a problem with regard to
555 tsearch2, as it doesn't index phrases, only words. But there is a
556 way around which doesn't appear to have a significant impact on
557 query time, and that is to use a query such as the following:</p>
559 SELECT intindex, strTopic FROM tblmessages
560 WHERE idxfti @@ to_tsquery('default', 'gettysburg & address')
561 AND strMessage ~* '.*men are created equal.*';
563 ----------+------------------------------
564 6 | Gettysburg address quotation
566 SELECT intindex, strTopic FROM tblmessages
567 WHERE idxfti @@ to_tsquery('default', 'gettysburg & address')
568 AND strMessage ~* '.*something that does not exist.*';
570 ----------+----------
573 <p>Of course if your indexing both strTopic and strMessage, and
574 want to search for this phrase on both, then you will have to get
575 out the brackets and extend this query a little more.</p>
577 <h3>TSEARCH2 CONFIGURATION</h3>
578 <p>Some words such as "and", "the", and "who" are automatically not
579 indexed, since they belong to a pre-existing dictionary of "Stop
580 Words" which tsearch2 does not perform indexing on. If someone
581 needs to search for "The Who" in your database, they are going to
582 have a tough time coming up with any results, since both are
583 ignored in the indexes. But there is a solution.</p>
584 <p>Lets say we want to add a word into the stop word list for
585 english stemming. We could edit the file
586 :'/usr/local/pgsql/share/english.stop' and add a word to the list.
587 I edited mine to exclude my name from indexing:</p>
589 - Edit /usr/local/pgsql/share/english.stop
590 - Add 'andy' to the list
593 <p>When you connect to the database, the dict_init procedure is run
594 during initialization. And in my configuration it will read the
595 stop words from the file I just edited. If you were connected to
596 the DB while editing the stop words, you will need to end the
597 current session and re-connect. When you re-connect to the
598 database, 'andy' is no longer indexed:</p>
600 SELECT to_tsvector('default', 'Andy');
605 <p>Originally I would get the result :</p>
608 SELECT to_tsvector('default', 'Andy');
614 <p>But since I added it as a stop word, it would be ingnored on the
615 indexing. The stop word added was used in the dictionary "en_stem".
616 If I were to use a different configuration such as 'simple', the
617 results would be different. There are no stop words for the simple
618 dictionary. It will just convert to lower case, and index every
621 SELECT to_tsvector('simple', 'Andy andy The the in out');
623 -------------------------------------
624 'in':5 'out':6 'the':3,4 'andy':1,2
627 <p>All this talk about which configuration to use is leading us
628 into the actual configuration of tsearch2. In the examples in this
629 document the configuration has always been specified when using the
630 tsearch2 functions:</p>
632 SELECT to_tsvector('default', 'Testing the default config');
633 SELECT to_tsvector('simple', 'Example of simple Config');
635 <p>The pg_ts_cfg table holds each configuration you can use with
636 the tsearch2 functions. As you can see the ts_name column contains
637 both the 'default' configurations based on the 'C' locale. And the
638 'simple' configuration which is not based on any locale.</p>
641 SELECT * from pg_ts_cfg;
642 ts_name | prs_name | locale
643 -----------------+----------+--------------
644 default | default | C
645 default_russian | default | ru_RU.KOI8-R
649 <p>Each row in the pg_ts_cfg table contains the name of the
650 tsearch2 configuration, the name of the parser to use, and the
651 locale mapped to the configuration. There is only one parser to
652 choose from the table pg_ts_parser called 'default'. More parsers
653 could be written, but for our needs we will use the default.</p>
654 <p>There are 3 configurations installed by tsearch2 initially. If
655 your locale is set to 'en_US' for example (like my laptop), then as
656 you can see there is currently no dictionary configured to use with
657 that locale. You can either set up a new configuration or just use
658 one that already exists. If I do not specify which configuration to
659 use in the to_tsvector function, I receive the following error.</p>
661 SELECT to_tsvector('learning tsearch is like going to school');
662 ERROR: Can't find tsearch config by locale
664 <p>We will create a new configuration for use with the server
665 encoding 'en_US'. The first step is to add a new configuration into
666 the pg_ts_cfg table. We will call the configuration
667 'default_english', with the default parser and use the locale
670 INSERT INTO pg_ts_cfg (ts_name, prs_name, locale)
671 VALUES ('default_english', 'default', 'en_US');
674 <p>We have only declared that there is a configuration called
675 'default_english'. We need to set the configuration of how
676 'default_english' will work. The next step is creating a new
677 dictionary to use. The configuration of the dictionary is
678 completlely different in tsearch2. In the prior versions to make
679 changes, you would have to re-compile your changes into the
680 tsearch.so. All of the configuration has now been moved into the
681 system tables created by executing the SQL code from
683 <p>Lets take a first look at the pg_ts_dict table</p>
685 ftstest=# \d pg_ts_dict
686 Table "public.pg_ts_dict"
687 Column | Type | Modifiers
688 -----------------+--------------+-----------
689 dict_name | text | not null
690 dict_init | regprocedure |
691 dict_initoption | text |
692 dict_lexize | regprocedure | not null
693 dict_comment | text |
694 Indexes: pg_ts_dict_pkey primary key btree (dict_name)
696 <p>The dict_name column is the name of the dictionary, for example
697 'simple', 'en_stem' or 'ru_stem'. The dict_init column is a text
698 representation of a stored procedure to run for initialization of
699 that dictionary, for example 'snb_en_init(text)' or
700 'snb_ru_init(text)'. The initial configuration of tsearch2 had the
701 dict_init and dict_lexize columns as type oid. The patch mentioned
702 in the Installation Notes changes these types to regprocedure. The
703 data inserted, or updated can still be the oid of the stored
704 procedure. The representation is just different. This makes backup
705 and restore procedures much easier for tsearch2. The dict_init
706 option is used for options passed to the init function for the
707 stored procedure. In the cases of 'en_stem' or 'ru_stem' it is a
708 path to a stopword file for that dictionary, for example
709 '/usr/local/pgsql/share/english.stop'. This is however dictated by
710 the dictionary. ISpell dictionaries may require different options.
711 The dict_lexize column is another OID of a stored procedure to the
712 function used to lexize, for example 'snb_lexize(internal,
713 internal, integer)'. The dict_comment column is just a comment.</p>
714 <p>Next we will configure the use of a new dictionary based on
715 ISpell. We will assume you have ISpell installed on you machine.
716 (in /usr/local/lib)</p>
717 <p>There has been some confusion in the past as to which files are
718 used from ISpell. ISpell operates using a hash file. This is a
719 binary file created by the ISpell command line utility "buildhash".
720 This utility accepts a file containing the words from the
721 dictionary, and the affixes file and the output is the hash file.
722 The default installation of ISPell installs the english hash file
723 english.hash, which is the exact same file as american.hash. ISpell
724 uses this as the fallback dictionary to use.</p>
725 <p>This hash file is not what tsearch2 requires as the ISpell
726 interface. The file(s) needed are those used to create the hash.
727 Tsearch uses the dictionary words for morphology, so the listing is
728 needed not spellchecking. Regardless, these files are included in
729 the ISpell sources, and you can use them to integrate into
730 tsearch2. This is not complicated, but is not very obvious to begin
731 with. The tsearch2 ISpell interface needs only the listing of
732 dictionary words, it will parse and load those words, and use the
733 ISpell dictionary for lexeme processing.</p>
735 <p>I found the ISPell make system to be very finicky. Their
736 documentation actually states this to be the case. So I just did
737 things the command line way. In the ISpell source tree under
738 languages/english there are several files in this directory. For a
739 complete description, please read the ISpell README. Basically for
740 the english dictionary there is the option to create the small,
741 medium, large and extra large dictionaries. The medium dictionary
742 is recommended. If the make system is configured correctly, it
743 would build and install the english.has file from the medium size
744 dictionary. Since we are only concerned with the dictionary word
745 listing ... it can be created from the /languages/english directory
746 with the following command:</p>
748 sort -u -t/ +0f -1 +0 -T /usr/tmp -o english.med english.0 english.1
750 <p>This will create a file called english.med. You can copy this
751 file to whever you like. I placed mine in /usr/local/lib so it
752 coincides with the ISpell hash files. You can now add the tsearch2
753 configuration entry for the ISpell english dictionary. We will also
754 continue to use the english word stop file that was installed for
755 the en_stem dictionary. You could use a different one if you like.
756 The ISpell configuration is based on the "ispell_template"
757 dictionary installed by default with tsearch2. We will use the OIDs
758 to the stored procedures from the row where the dict_name =
759 'ispell_template'.</p>
761 INSERT INTO pg_ts_dict
764 'DictFile="/usr/local/lib/english.med",'
765 'AffFile="/usr/local/lib/english.aff",'
766 'StopFile="/usr/local/pgsql/share/contrib/english.stop"',
769 WHERE dict_name = 'ispell_template');
771 <p>Now that we have a dictionary we can specify it's use in a query
772 to get a lexeme. For this we will use the lexize function. The
773 lexize function takes the name of the dictionary to use as an
774 argument. Just as the other tsearch2 functions operate. You will
775 need to stop your psql session and start it again in order for this
776 modification to take place.</p>
778 SELECT lexize('en_ispell', 'program');
785 <p>If you wanted to always use the ISpell english dictionary you
786 have installed, you can configure tsearch2 to always use a specific
789 SELECT set_curdict('en_ispell');
791 <p>Lexize is meant to turn a word into a lexeme. It is possible to
792 receive more than one lexeme returned for a single word.</p>
794 SELECT lexize('en_ispell', 'conditionally');
796 -----------------------------
797 {conditionally,conditional}
800 <p>The lexize function is not meant to take a full string as an
801 argument to return lexemes for. If you passed in an entire sentence,
802 it attempts to find that entire sentence in the dictionary. Since
803 the dictionary contains only words, you will receive an empty
806 SELECT lexize('en_ispell', 'This is a senctece to lexize');
812 If you parse a lexeme from a word not in the dictionary, then you will receive an empty result. This makes sense because the word "tsearch" is not in the english dictionary. You can create your own additions to the dictionary if you like. This may be useful for scientific or technical glossaries that need to be indexed. SELECT lexize('en_ispell', 'tsearch'); lexize -------- (1 row)
815 <p>This is not to say that tsearch will be ignored when adding text
816 information to the the tsvector index column. This will be
817 explained in greater detail with the table pg_ts_cfgmap.</p>
818 <p>Next we need to set up the configuration for mapping the
819 dictionay use to the lexxem parsings. This will be done by altering
820 the pg_ts_cfgmap table. We will insert several rows, specifying to
821 use the new dictionary we installed and configured for lexizing
822 within tsearch2. There are several type of lexims we would be
823 concerned with forcing the use of the ISpell dictionary.</p>
825 INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
826 VALUES ('default_english', 'lhword', '{en_ispell,en_stem}');
827 INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
828 VALUES ('default_english', 'lpart_hword', '{en_ispell,en_stem}');
829 INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
830 VALUES ('default_english', 'lword', '{en_ispell,en_stem}');
832 <p>We have just inserted 3 records to the configuration mapping,
833 specifying that the lexeme types for "lhword, lpart_hword and lword"
834 are to be stemmed using the 'en_ispell' dictionary we added into
835 pg_ts_dict, when using the configuration ' default_english' which
836 we added to pg_ts_cfg.</p>
837 <p>There are several other lexeme types used that we do not need to
838 specify as using the ISpell dictionary. We can simply insert values
839 using the 'simple' stemming process dictionary.</p>
841 INSERT INTO pg_ts_cfgmap
842 VALUES ('default_english', 'url', '{simple}');
843 INSERT INTO pg_ts_cfgmap
844 VALUES ('default_english', 'host', '{simple}');
845 INSERT INTO pg_ts_cfgmap
846 VALUES ('default_english', 'sfloat', '{simple}');
847 INSERT INTO pg_ts_cfgmap
848 VALUES ('default_english', 'uri', '{simple}');
849 INSERT INTO pg_ts_cfgmap
850 VALUES ('default_english', 'int', '{simple}');
851 INSERT INTO pg_ts_cfgmap
852 VALUES ('default_english', 'float', '{simple}');
853 INSERT INTO pg_ts_cfgmap
854 VALUES ('default_english', 'email', '{simple}');
855 INSERT INTO pg_ts_cfgmap
856 VALUES ('default_english', 'word', '{simple}');
857 INSERT INTO pg_ts_cfgmap
858 VALUES ('default_english', 'hword', '{simple}');
859 INSERT INTO pg_ts_cfgmap
860 VALUES ('default_english', 'nlword', '{simple}');
861 INSERT INTO pg_ts_cfgmap
862 VALUES ('default_english', 'nlpart_hword', '{simple}');
863 INSERT INTO pg_ts_cfgmap
864 VALUES ('default_english', 'part_hword', '{simple}');
865 INSERT INTO pg_ts_cfgmap
866 VALUES ('default_english', 'nlhword', '{simple}');
867 INSERT INTO pg_ts_cfgmap
868 VALUES ('default_english', 'file', '{simple}');
869 INSERT INTO pg_ts_cfgmap
870 VALUES ('default_english', 'uint', '{simple}');
871 INSERT INTO pg_ts_cfgmap
872 VALUES ('default_english', 'version', '{simple}');
875 <p>Our addition of a configuration for 'default_english' is now
876 complete. We have successfully created a new tsearch2
877 configuration. At the same time we have also set the new
878 configuration to be our default for en_US locale.</p>
880 SELECT to_tsvector('default_english',
881 'learning tsearch is like going to school');
883 --------------------------------------------------
884 'go':5 'like':4 'learn':1 'school':7 'tsearch':2
885 SELECT to_tsvector('learning tsearch is like going to school');
887 --------------------------------------------------
888 'go':5 'like':4 'learn':1 'school':7 'tsearch':2
891 <p>Notice here that words like "tsearch" are still parsed and
892 indexed in the tsvector column. There is a lexeme returned for the
893 word becuase in the configuration mapping table, we specify words
894 to be used from the 'en_ispell' dictionary first, but as a fallback
895 to use the 'en_stem' dictionary. Therefore a lexeme is not returned
896 from en_ispell, but is returned from en_stem, and added to the
899 SELECT to_tsvector('learning tsearch is like going to computer school');
901 ---------------------------------------------------------------------------
902 'go':5 'like':4 'learn':1 'school':8 'compute':7 'tsearch':2 'computer':7
905 <p>Notice in this last example I added the word "computer" to the
906 text to be converted into a tsvector. Because we have setup our
907 default configuration to use the ISpell english dictionary, the
908 words are lexized, and computer returns 2 lexemes at the same
909 position. 'compute':7 and 'computer':7 are now both indexed for the
911 <p>You can create additional dictionary lists, or use the extra
912 large dictionary from ISpell. You can read through the ISpell
913 documents, and source tree to make modifications as you see
915 <p>In the case that you already have a configuration set for the
916 locale, and you are changing it to your new dictionary
917 configuration. You will have to set the old locale to NULL. If we
918 are using the 'C' locale then we would do this:</p>
921 UPDATE pg_ts_cfg SET locale=NULL WHERE locale = 'C';
923 <p>That about wraps up the configuration of tsearch2. There is much
924 more you can do with the tables provided. This was just an
925 introduction to get things working rather quickly.</p>
926 <h3>ADDING NEW DICTIONARIES TO TSEARCH2</h3>
927 <p>To aid in the addition of new dictionaries to the tsearch2
928 module you can use another additional module in combination with
929 tsearch2. The gendict module is included into tsearch2 distribution
930 and is available from gendict/ subdirectory.</p>
931 <p>I will not go into detail about installation and instructions on
932 how to use gendict to it's fullest extent right now. You can read
933 the README.gendict ... it has all of the instructions and
934 information you will need.</p>
935 <h3>BACKING UP AND RESTORING DATABASES THAT FEATURE TSEARCH2</h3>
936 <p><strong>Never rely on anyone elses instructions to backup and
937 restore a database system, always develop and understand your own
938 methodology, and test it numerous times before you need to do it
939 for real.</strong></p>
940 <p>The backup and restore procedure has changed over time. This is
941 not meant to be the bible for tsearch2 back up and restore. Please
942 read all sections so you have a complete understanding of some
943 backup and restore issues. Please test your own procedures, and do
944 not rely on these instructions solely.</p>
946 <p>If you come accross some issues in your own procedures, please
947 feel free to bring the question up on the Open-FTS, and PostgreSQL
949 <h3>ORIGINAL BACKUP PROCEDURES</h3>
950 <p>Originally, tsearch2 had problems when using the pg_dump, and or
951 the pg_dumpall utilities. The problem lies within the original use
952 of OIDs for column types. Since OIDs are not consistent accross
953 pg_dumps, when you reload the data values into the pg_ts_dict
954 table, for example, those oids no longer point to anything. You
955 would then end up trying to use a "broken" tsearch2
957 <p>The solution was to backup and restore a database using the
958 tsearch2 module into small unique parts, and then load them in the
959 correct order. You would have to edit the schema and remove the
960 tsearch stored procedure references in the sql file. You would have
961 to load your global objects, then the tsearch2 objects. You had to
962 re-create the tsearch module before restoring your schema so no
963 conflicts would arise. Then you could restore your data (all
964 schemas, and types needed for the data were now available).</p>
965 <p><strong>The original backup instructions were as
967 <p>1) Backup any global database objects such as users and groups
968 (this step is usually only necessary when you will be restoring to
971 pg_dumpall -g > GLOBALobjects.sql
974 <p>2) Backup the full database schema using pg_dump</p>
976 pg_dump -s DATABASE > DATABASEschema.sql
978 <p>3) Backup the full database using pg_dump</p>
980 pg_dump -Fc DATABASE > DATABASEdata.tar
983 <p><strong>The original restore procedures were as
985 <p>1) Create the blank database</p>
989 <p>2) Restore any global database objects such as users and groups
990 (this step is usually only necessary when you will be restoring to
993 psql DATABASE < GLOBALobjects.sql
995 <p>3) Create the tsearch2 objects, functions and operators</p>
998 psql DATABASE < tsearch2.sql
1000 <p>4) Edit the backed up database schema and delete all SQL
1001 commands which create tsearch2 related functions, operators and
1002 data types, BUT NOT fields in table definitions that specify
1003 tsvector types. If your not sure what these are, they are the ones
1004 listed in tsearch2.sql. Then restore the edited schema to the
1007 psql DATABASE < DATABASEschema.sql
1009 <p>5) Restore the data for the database</p>
1012 pg_restore -N -a -d DATABASE DATABASEdata.tar
1014 <p>If you get any errors in step 4, it will most likely be because
1015 you forgot to remove an object that was created in tsearch2.sql.
1016 Any errors in step 5 will mean the database schema was probably
1017 restored wrongly.</p>
1018 <p><strong>Issues with this procedure</strong></p>
1019 <p>As I mentioned before, it is vital that you test out your own
1020 backup and restore procedures. These procedures were originally
1021 adopted from this document's orignal author. Robert John Shepherd.
1022 It makes use of the pg_dump custom archive functionality. I am not
1023 that familiar with the formatting output of pg_dump, and using
1024 pg_restore. I have always had the luxury of using text files
1025 (Everything is DATABASE.sql).</p>
1026 <p>One issue not forseen in the case of using a binary dump is the
1027 when you have added more than the default tsearch2 configurations.
1028 Upon reload of the data it will fail due to duplicate primary keys.
1029 If you load the tsearch2 module, and then delete the data loaded by
1030 tsearch2 into the configuration tables, the data will restore. The
1031 configurations are incorrect because you can not remove the data
1032 using OID references from the custom archive.</p>
1033 <p>It would be very simple to fix this problem if the data was not
1034 in an archive format. I do believe all of your data would have been
1035 restored properly and you can get things working fairly easy. All
1036 one would have to do is create the configurations as in the
1037 tsearch2.sql file. And then create your custom configurations
1039 <p>I have read in the pg_dump man page that if the tar archive
1040 format is used, it is possible to limit which data is restored
1041 using pg_restore. If anyone has more experience with pg_dump
1042 archives, and pg_restore. Please feel free to test and contribute
1043 your procedure(s).</p>
1044 <h3>CURRENT BACKUP AND RESTORE PROCEDURES</h3>
1046 <p>Currently a backup and restore of a database using the tsearch2
1047 module can be quite simple. If you have applied the patch mentioned
1048 in the installation instructions prior to tsearch2 installation.
1049 This patch removes the use of the oid column. The text
1050 representation for the stored procedures used are dumped with the
1051 data and the restoration of the data works seemlessly.</p>
1052 <p>1) to backup the database</p>
1054 pg_dump DATABASE > DATABASE.sql
1056 <p>1) to restore the database</p>
1059 psql -d DATABASE -f DATABASE.sql
1061 <p>This procedure is now like any normal backup and restore
1062 procedure. I can say whether this has been proven using the pg_dump
1063 archive, and restoring with pg_restore. In theory there should be
1064 no problems with any format after the patch is applied.</p>
1066 <p>This restoration procedure should never be an issue with the
1067 patch applied to version 8.0 of PostgreSQL. Only versions 7.3 and
1068 7.4 are affected. You can avoid any troubles by applying the patch
1069 prior to installation, or running the SQL script provided to live
1070 database before backup and restoring is done.</p>