1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
4 <title>tsearch2 guide</title>
7 <h1 align=center>The tsearch2 Guide</h1>
10 Brandon Craig Rhodes<br>30 June 2003
11 <br>Updated to 8.2 release by Oleg Bartunov, October 2006</br>
13 This Guide introduces the reader to the PostgreSQL tsearch2 module,
15 More formal descriptions of the module's types and functions
16 are provided in the <a href="tsearch2-ref.html">tsearch2 Reference</a>,
17 which is a companion to this document.
19 First we will examine the <tt>tsvector</tt> and <tt>tsquery</tt> types
20 and how they are used to search documents;
21 next, we will use them to build a simple search engine in SQL;
22 and finally, we will study the internals of document conversion
23 and how you might tune the internals to accommodate various searching needs.
25 Once you have tsearch2 working with PostgreSQL,
26 you should be able to run the examples here exactly as they are typed.
29 <h2>Table of Contents</h2>
31 <a href="#intro">Introduction to FTS with tsearch2</a><br>
32 <a href="#vectors_queries">Vectors and Queries</a><br>
33 <a href="#simple_search">A Simple Search Engine</a><br>
34 <a href="#weights">Ranking and Position Weights</a><br>
35 <a href="#casting">Casting Vectors and Queries</a><br>
36 <a href="#parsing_lexing">Parsing and Lexing</a><br>
37 <a href="#ref">Additional information</a>
43 <h2><a name="intro">Introduction to FTS with tsearch2</a></h2>
44 The purpose of FTS is to
45 find <b>documents</b>, which satisfy <b>query</b> and optionally return
46 them in some <b>order</b>.
47 Most common case: Find documents containing all query terms and return them in order
48 of their similarity to the query. Document in database can be
49 any text attribute, or combination of text attributes from one or many tables
51 Text search operators existed for years, in PostgreSQL they are
52 <tt><b>~,~*, LIKE, ILIKE</b></tt>, but they lack linguistic support,
53 tends to be slow and have no relevance ranking. The idea behind tsearch2 is
54 is rather simple - preprocess document at index time to save time at search stage.
55 Preprocessing includes
57 <li>document parsing onto words
58 <li>linguistic - normalize words to obtain lexemes
59 <li>store document in optimized for searching way
61 Tsearch2, in a nutshell, provides FTS operator (contains) for two new data types,
62 which represent document and query - <tt>tsquery @@ tsvector</tt>.
65 <h2><a name=vectors_queries>Vectors and Queries</a></h2>
68 <i>This section introduces
69 the two data types upon which tsearch2 search engines are based,
70 and illustrates their interaction using the simplest possible case.
71 The complex examples we present later on
72 are merely variations and elaborations of this basic mechanism.</i>
75 The tsearch2 module allows you to index documents by the words they contain,
76 and then perform very efficient searches
77 for documents that contain a given combination of words.
78 Preparing your document index involves two steps:
80 <li><b>Making a list of the words each document contains.</b>
81 You must reduce each document to a <tt>tsvector</tt>
82 which lists each word that appears in the document.
83 This process offers many options,
84 because there is no requirement
85 that you must copy words into the vector
86 exactly as they appear in the document.
88 many developers omit frequent and content-free <b>stop words</b>
89 like <i>the</i> to reduce the size of their index;
90 others reduce different forms of the same word
91 (<i>forked</i>, <i>forking</i>, <i>forks</i>)
92 to a common form (<i>fork</i>)
93 to make search results independent of tense and case.
94 Because words are very often stored in a modified form,
95 we use the special term <b>lexemes</b>
96 for the word forms we actually store in the vector.
97 <li><b>Creating an index of the documents by lexeme.</b>
98 This is managed automatically by tsearch2
99 when you creat a <tt>gist()</tt> index
100 on the <tt>tsvector</tt> column of a table,
101 which implements a form of the Berkeley
102 <a href="http://gist.cs.berkeley.edu/"><i>Generalized Search Tree</i></a>.
103 Since PostgreSQL 8.2 tsearch2 supports <a href="http://www.sigaev.ru/gin/">Gin</a> index,
104 which is an inverted index, commonly used in search engines. It adds scalability to tsearch2.
106 Once your documents are indexed,
107 performing a search involves:
109 <li><b>Reducing the search terms to lexemes.</b>
110 You must express each search you want to perform
111 as a <tt>tsquery</tt> specifying a boolean combination of lexemes.
112 Note that tsearch2 only finds <i>exact</i> matches
113 between the lexemes in your query and the ones in each vector —
114 even capitalization counts as a difference
115 (which is why all lexemes are usually kept lowercase).
116 So you must process search words the same way you processed document words;
117 if <i>forking</i> became <i>fork</i> in the document's <tt>tsvector</tt>,
118 then the search term <i>forking</i> must also become <i>fork</i>
119 or the search will not find the document.
120 <li><b>Retrieving the documents that match the query.</b>
121 Running a <tt>SELECT</tt> ... <tt>WHERE</tt>
122 <tt><i>query</i></tt> <tt>@@</tt> <tt><i>vector</i></tt>
123 on the table with the <tt><i>vector</i></tt> column
124 will return the documents that match your query.
125 <li><b>Presenting your results.</b>
126 This final stage offers as many options
127 as turning documents into vectors.
128 You can order documents by how well they matched the search terms;
129 create a headline for each document
130 showing some of the phrases in which it uses the search terms;
131 and restrict the number of results retrieved.
132 You will of course want some way to identify each document,
133 so the user can ask for the full text of the ones he wants to read.
135 And beyond deciding upon rules for turning documents into vectors
136 and for presenting search results to users,
137 you have to decide <i>where</i> to perform these operations —
138 whether one database server
139 will parse documents, perform searches, and prepare search results,
140 or whether to spread the load of these operations across several machines.
141 These are complicated design issues
142 which we will explore later;
143 in this section and the next,
144 we will illustrate what can be accomplished
145 using a single database server.
147 The <tt>default</tt> tsearch2 configuration,
148 which we will learn more about later,
149 provides a good example of a process for reducing documents to vectors:
152 =# <b>SELECT set_curcfg('default')</b>
153 =# <b>SELECT to_tsvector('The air smells of sea water.')</b>
155 -------------------------------------
156 'air':2 'sea':5 'smell':3 'water':6
160 Note the complex relationship between this document and its vector.
161 The vector lists only words from the document —
162 spaces and punctuation have disappeared.
163 Common words like <i>the</i> and <i>of</i> have been eliminated.
164 The <i>-s</i> that makes <i>smells</i> a plural has been removed,
165 leaving a lexeme that represents the word in its simplest form.
167 though the vector remembers the positions in which each word appeared,
168 it does not store the lexemes in that order.
170 Keeping word positions in your vectors is optional, by the way.
171 The positions are necessary for the tsearch2 ranking functions,
172 which you can use to prioritize documents
173 based on how often each document uses the search terms
174 and whether they appear in close proximity.
175 But if you do not perform ranking,
176 or use your own process that ignores the word positions stored in the vector,
177 then you can save space by stripping them from your vectors:
180 =# <b>SELECT strip(to_tsvector('The air smells of sea water.'))</b>
182 -----------------------------
183 'air' 'sea' 'smell' 'water'
187 Now that we have a procedure for creating vectors,
188 we can build an indexed table of vectors very simply:
191 =# <b>CREATE TABLE vectors ( vector tsvector )</b>
192 =# <b>CREATE INDEX vector_index ON vectors USING gist(vector)</b>
193 =# <b>INSERT INTO vectors VALUES (to_tsvector('The path forks here'))</b>
194 =# <b>INSERT INTO vectors VALUES (to_tsvector('A crawl leads west'))</b>
195 =# <b>INSERT INTO vectors VALUES (to_tsvector('The left fork leads northeast'))</b>
196 =# <b>SELECT * FROM vectors</b>
198 ------------------------------------------
200 'lead':3 'west':4 'crawl':2
201 'fork':3 'lead':4 'left':2 'northeast':5
205 Now we can search this collection of document vectors
206 using the <tt>@@</tt> operator and a <tt>tsquery</tt>
207 that specifies the combination of lexemes we are looking for.
208 Note that while vectors simply list lexemes,
209 queries always combine them with the operators
210 ‘<tt>&</tt>’ and,
211 ‘<tt>|</tt>’ or,
212 and ‘<tt>!</tt>’ not,
213 plus parentheses for grouping.
214 Some examples of the query syntax:
217 <td>‘find documents with the word <i>forks</i> in them’<br>
220 <td>‘... with both <i>forks</i> and <i>leads</i>’<br>
221 <td><tt>'forks & leads'</tt>
223 <td>‘... with either <i>forks</i> or <i>leads</i>’<br>
224 <td><tt>'forks | leads'</tt>
226 <td>‘... with either <i>forks</i> or <i>leads</i>,
227 but without <i>crawl</i>’<br>
228 <td><tt>'(forks|leads) & !crawl'</tt>
231 provides a <tt>to_tsquery()</tt> function for creating queries
232 that uses the same process as <tt>to_tsvector()</tt> uses
233 to reduce words to lexemes.
235 it will remove the <i>-s</i> from the plurals in the last example above:
238 =# <b>SELECT to_tsquery('(leads|forks) & !crawl')</b>
240 --------------------------------
241 ( 'lead' | 'fork' ) & !'crawl'
246 this is critically important because the search operator <tt>@@</tt>
247 only finds <i>exact</i> matches
248 between the words in a query and the words in a vector;
249 if the document vector lists the lexeme <i>fork</i>
250 but the query looks for the plural form <i>forks</i>,
251 the query would not match that document.
252 Thanks to the symmetry between our process
253 for producing vectors and queries, however,
254 the above searches return correct results:
257 =# <b>SELECT * FROM vectors WHERE vector @@ to_tsquery('(leads|forks) & !crawl')</b>
259 ------------------------------------------
261 'fork':3 'lead':4 'left':2 'northeast':5
265 You may want to try the other queries shown above,
266 and perhaps invent some of your own.
268 You should not include stop words in a query,
269 since you cannot search for words you have discarded.
270 If you throw out the word <i>the</i> when building vectors, for example,
271 your index will obviously not know which documents included it.
272 The <tt>to_tsquery()</tt> function will automatically detect this
273 and give you an error to prevent this mistake:
276 =# <b>SELECT to_tsquery('the')</b>
277 NOTICE: Query contains only stopword(s) or doesn't contain lexem(s), ignored
284 But if you every build vectors and queries using your own routines,
285 a possibility we will discuss later,
286 then you will need to enforce this rule yourself.
289 Now that you understand how vectors and queries work together,
290 you are prepared to tackle many additional topics:
291 how to distribute searching across many servers;
292 how to customize the process
293 by which tsearch2 turns documents and queries into lexemes,
294 or use a process of your own;
295 and how to sort and display search results to your users.
296 But before discussing these detailed questions,
297 we will build a simple search engine
298 to see how easily its basic features work together.
301 <h2><a name=simple_search>A Simple Search Engine</a></h2>
304 In this section we build a simple search engine out of SQL functions
305 that use the vector and query types described in the previous section.
306 While this example is simpler
307 than a search engine that has to interface with the outside world,
308 it will illustrate the basic principles of building a search engine,
309 and better prepare you for developing your own.
311 Building a search engine involves only a few improvements
312 upon the rudimentary vector searches described in the last section.
314 <li>Because the user wants to read documents, not vectors,
315 you must provide some way
316 for the full text of each document to be accessed —
317 either by storing the entire text of each document in the database,
318 or storing an identifier
319 like a URL, file name, or document routing number
320 that lets you fetch the document from other storage.
321 <li>You can make it easier for user interface code to refer to each document
322 by providing a unique identifier for each document,
323 perhaps with a <tt>SERIAL</tt> column.
324 <li>Search results should be ordered by relevance.
325 If you leave word positions in your vectors,
326 you can either have PostgreSQL <tt>ORDER</tt> your results
327 <tt>BY</tt> a ranking function,
328 or you can fetch the vectors yourself and perform your own sort.
329 If you choose to ignore word positions or strip them from your vectors,
330 you will have to determine relevance yourself,
331 using either the full text of the document
332 or other information about each document you may possess.
333 <li>For each document returned by a search,
334 you will usually want to display a summary called a <i>headline</i>
335 that shows short excerpts
336 illustrating how the document uses the query words.
337 Headlines are usually generated from the full text of the document,
338 not from position information in the <tt>tsvector</tt>,
339 since excerpts lacking stop words, punctuation, and suffixes
340 would not be comprehensible.
341 If you store the full text of each document in the database,
342 headlines can be generated very simply by a tsearch2 function.
343 If you store your documents elsewhere,
344 then you will either have to transmit each document to the database
345 every time you want to run the headline function on it,
346 or use your own headline code outside of the database.
349 We can easily construct a simple search engine
350 that accomplishes these goals.
351 First we build a table that, for each document,
352 stores a unique identifier, the full text of the document,
353 and its <tt>tsvector</tt>:
356 =# <b>CREATE TABLE docs ( id SERIAL, doc TEXT, vector tsvector )</b>
357 =# <b>CREATE INDEX docs_index ON docs USING gist(vector);</b>
360 Note that although searches will still work
361 on tables where you have neglected
362 to create a <tt>gist()</tt> index over your vectors,
363 they will run much more slowly
364 since they will have to compare the query
365 against every document vector in the table.
367 Because the table we have created
368 stores each document in two different ways —
369 both as text and as a vector —
370 our <tt>INSERT</tt> statements must provide the document in both forms.
371 While more advanced PostgreSQL programmers
372 might accomplish this with a database trigger or rule,
373 for this simple example we will use a small SQL function:
376 =# <b>CREATE FUNCTION insdoc(text) RETURNS void LANGUAGE sql AS
377 'INSERT INTO docs (doc, vector) VALUES ($1, to_tsvector($1));'</b>
380 Now, by calling <tt>insdoc()</tt> several times,
381 we can populate our table with documents:
384 =# <b>SELECT insdoc('A low crawl over cobbles leads inward to the west.')</b>
385 =# <b>SELECT insdoc('The canyon runs into a mass of boulders -- dead end.')</b>
386 =# <b>SELECT insdoc('You are crawling over cobbles in a low passage.')</b>
387 =# <b>SELECT insdoc('Cavernous passages lead east, north, and south.')</b>
388 =# <b>SELECT insdoc('To the east a low wide crawl slants up.')</b>
389 =# <b>SELECT insdoc('You are in the south side chamber.')</b>
390 =# <b>SELECT insdoc('The passage here is blocked by a recent cave-in.')</b>
391 =# <b>SELECT insdoc('You are in a splendid chamber thirty feet high.')</b>
394 Now we can build a search function.
395 Its <tt>SELECT</tt> statement is based upon
396 the same <tt>@@</tt> operation illustrated in the previous section.
397 But instead of returning matching vectors,
398 we return for each document
399 its <tt>SERIAL</tt> identifier, so the user can retrieve it later;
400 a headline that illustrates its use of the search terms;
401 and a ranking with which we also order the results.
402 Our search operation can be coded as a single <tt>SELECT</tt> statement
403 returning its own kind of table row,
404 which we call a <tt>finddoc_t</tt>:
407 =# <b>CREATE TYPE finddoc_t AS (id INTEGER, headline TEXT, rank REAL)</b>
408 =# <b>CREATE FUNCTION finddoc(text) RETURNS SETOF finddoc_t LANGUAGE sql AS '
409 SELECT id, headline(doc, q), rank(vector, q)
410 FROM docs, to_tsquery($1) AS q
411 WHERE vector @@ q ORDER BY rank(vector, q) DESC'</b>
414 This function is a rather satisfactory search engine.
415 Here is one example search,
416 after which the user fetches the top-ranking document itself;
417 with similar commands you can try queries of your own:
420 =# <b>SELECT * FROM finddoc('passage|crawl')</b>
422 ----+-------------------------------------------------------+------
423 3 | <b>crawling</b> over cobbles in a low <b>passage</b>. | 0.19
424 1 | <b>crawl</b> over cobbles leads inward to the west. | 0.1
425 4 | <b>passages</b> lead east, north, and south. | 0.1
426 5 | <b>crawl</b> slants up. | 0.1
427 7 | <b>passage</b> here is blocked by a recent cave-in. | 0.1
429 =# <b>SELECT doc FROM docs WHERE id = 3</b>
431 -------------------------------------------------
432 You are crawling over cobbles in a low passage.
436 While by default the <tt>headline()</tt> function
437 surrounds matching words with <tt><b></tt> and <tt></b></tt>
438 in order to distinguish them from the surrounding text,
439 you can provide options that change its behavior;
440 consult the tsearch2 Reference for more details about
441 <a href="tsearch2-ref.html#headlines">Headline Functions</a>.
443 Though a search may match hundreds or thousands of documents,
444 you will usually present only ten or twenty results to the user at a time.
445 This can be most easily accomplished
446 by limiting your query with a <tt>LIMIT</tt>
447 and an <tt>OFFSET</tt> clause —
448 to display results ten at a time, for example,
449 your would generate your first page of results
450 with <tt>LIMIT</tt> <tt>10</tt> <tt>OFFSET</tt> <tt>0</tt>,
452 with <tt>LIMIT</tt> <tt>10</tt> <tt>OFFSET</tt> <tt>10</tt>,
454 with <tt>LIMIT</tt> <tt>10</tt> <tt>OFFSET</tt> <tt>20</tt>,
456 There are two problems with this approach, however.
458 The first problem is the strain of running the query over again
459 for every page of results the user views.
460 For small document collections or lightly loaded servers,
461 this may not be a problem;
462 but the impact can be high
463 when a search must repeatedly rank and sort
464 the same ten thousand results
465 on an already busy server.
466 So instead of selecting only one page of results,
467 you will probably use <tt>LIMIT</tt> and <tt>OFFSET</tt>
468 to return a few dozen or few hundred results,
469 which you can cache and display to the user one page at a time.
470 Whether a result cache rewards your effort
471 will depend principally on the behavior of your users —
472 how often they even view the second page of results, for instance.
474 The second issue solved by caching involves consistency.
475 If the database is changing while the user browses their results,
476 then documents might appear and disappear as they page through them.
477 In some cases the user might even miss a particular result —
478 perhaps the one they were looking for —
479 if, say, its rank improves from 31th to 30th
480 after they load results 21–30 but before they view results 31–40.
481 While many databases are static or infrequently updated,
482 and will not present this problem,
483 users searching very dymanic document collections
484 might benefit from the stable results that caches yield.
487 Having seen the features of a search engine
488 implemented entirely within the database,
489 we will learn about some specific tsearch2 features.
490 First we will look in more detail at document ranking.
493 <h2><a name=weights>Ranking and Position Weights</a></h2>
496 When we built our simple search engine,
497 we used the </i><tt>rank()</tt><i> function to order our results.
498 Here we describe tsearch2 ranking in more detail.
501 There are two functions with which tsearch2 can rank search results.
502 They both use the lexeme positions listed in the <tt>tsvector</tt>,
503 so you cannot rank vectors
504 from which these have been removed with <tt>strip()</tt>.
505 The <tt>rank()</tt> function existed in older versions of OpenFTS,
506 and has the feature that you can assign different weights
507 to words from different sections of your document.
508 The <tt>rank_cd()</tt> uses a recent technique for weighting results
509 and also allows different weight to be given
510 to different sections of your document (since 8.2).
512 Both ranking functions allow you to specify,
513 as an optional last argument,
514 whether you want their results <i>normalized</i> —
515 whether the rank returned should be adjusted for document length.
516 Specifying a last argument of <tt>0</tt> (zero) makes no adjustment;
517 <tt>1</tt> (one) divides the document rank
518 by the logarithm of the document length;
519 and <tt>2</tt> divides it by the plain length.
520 In all of these examples we omit this optional argument,
521 which is the same as specifying zero —
522 we are making no adjustment for document length.
524 The <tt>rank_cd()</tt> function uses an experimental measurement
525 called <i>cover density ranking</i> that rewards documents
526 when they make frequent use of the search terms
527 that are close together in the document.
528 You can read about the algorithm in more detail
529 in Clarke et al.,
530 “<a href="http://citeseer.nj.nec.com/clarke00relevance.html"
531 >Relevance Ranking for One to Three Term Queries</a>.”
532 An optional first argument allows you to tune their formula;
534 see the <a href="tsearch2-ref.html#ranking">section on ranking</a>
537 Currently tsearch2 supports four different weight labels:
538 <tt>'D'</tt>, the default weight;
539 and <tt>'A'</tt>, <tt>'B'</tt>, and <tt>'C'</tt>.
540 All vectors created with <tt>to_tsvector()</tt>
541 assign the weight <tt>'D'</tt> to each position,
542 which as the default is not displayed when you print a vector out.
544 If you want positions with weights other than <tt>'D'</tt>,
545 you have two options:
546 either you can author a vector directly through the <tt>::tsvector</tt>
548 as described in the following section,
549 which lets you give each position whichever weight you want;
550 or you can pass a vector through the <tt>setweight()</tt> function
551 which sets all of its position weights to a single value.
552 An example of the latter:
556 =# <b>SELECT vector FROM docs WHERE id = 3</b>
558 ----------------------------------------
559 'low':8 'cobbl':5 'crawl':3 'passag':9
561 =# <b>SELECT setweight(vector, 'A') FROM docs WHERE id = 3</b>
563 --------------------------------------------
564 'low':8A 'cobbl':5A 'crawl':3A 'passag':9A
569 Merely changing all of the weights in a vector is not very useful,
571 since this results still in all words having the same weight.
572 But if we parse different parts of a document separately,
573 giving each section its own weight,
574 and then concatenate the vectors of each part into a single vector,
575 the result can be very useful.
576 We can construct a simple example
577 in which document titles are given greater weight
578 that text in the body of the document:
582 =# <b>CREATE TABLE tdocs ( id SERIAL, title TEXT, doc TEXT, vector tsvector )</b>
583 =# <b>CREATE INDEX tdocs_index ON tdocs USING gist(vector);</b>
584 =# <b>CREATE FUNCTION instdoc(text, text) RETURNS void LANGUAGE sql AS
585 'INSERT INTO tdocs (title, doc, vector)
586 VALUES ($1, $2, setweight(to_tsvector($1), ''A'') || to_tsvector($2));'</b>
590 Now words from a document title will be weighted differently
591 than those in the main text
592 if we provide the title and body as separate arguments:
596 =# <b>SELECT instdoc('Spendid Chamber',
597 'The walls are frozen rivers of orange stone.')</b>
602 =# <b>SELECT vector FROM tdocs</b>
604 ------------------------------------------------------------------------------
605 'wall':4 'orang':9 'river':7 'stone':10 'frozen':6 'chamber':2A 'spendid':1A
610 Note that although the necessity is unusual,
611 you can constrain search terms
612 to only match words from certain sections
613 by following them with a colon
614 and a list of the sections in which the word can occur;
615 by default this list is <tt>'ABCD'</tt>
616 so that search terms match words from all sections.
618 here we search for a word both generally,
619 and then looking only for specific weights:
623 =# <b>SELECT title, doc FROM tdocs WHERE vector @@ to_tsquery('spendid')</b>
625 -----------------+----------------------------------------------
626 Spendid Chamber | The walls are frozen rivers of orange stone.
628 =# <b>SELECT title, doc FROM tdocs WHERE vector @@ to_tsquery('spendid:A')</b>
630 -----------------+----------------------------------------------
631 Spendid Chamber | The walls are frozen rivers of orange stone.
633 =# <b>SELECT title, doc FROM tdocs WHERE vector @@ to_tsquery('spendid:D')</b>
643 Our examples so far use tsearch2 to parse our documents into vectors.
644 When your application needs absolute control over vector content,
645 you will want to use direct type casting,
646 which is described in the next section.
649 <h2><a name=casting>Casting Vectors and Queries</a></h2>
652 While tsearch2 has powerful and flexible ways
653 to process documents and turn them into document vectors,
654 you will sometimes want to parse documents on your own
655 and place the results directly in vectors.
656 Here we show you how.
659 In the preceding examples,
660 we used the <tt>to_tsvector()</tt> function
661 when we needed a document's text reduced to a document vector.
662 We saw that the function stripped whitespace and punctuation,
663 eliminated common words,
664 and altered suffixes to reduce words to a common form.
665 While these operations are often desirable,
666 and while in the sections below
667 we will gain precise control over this process,
668 there are occasions on which
669 you want to avoid the changes that <tt>to_tsvector()</tt> makes to text
670 and specify explicitly the words that you want in your vectors.
671 Or you may want to create queries directly
672 rather than through <tt>to_tsquery()</tt>.
675 you may have already developed your own routine
676 for reducing your documents to searchable lexemes,
677 and do not want your carefully generated terms altered
678 by passing them through <tt>to_tsvector()</tt>.
679 Or you might be developing and debugging parsing routines of your own
680 that you are not ready to load into the database.
682 you will find that direct insertion is easily accomplished
683 if you simply follow some simple rules.
685 Vectors are created directly
686 when you cast a string of whitespace separated lexemes
687 to the <tt>tsvector</tt> type:
691 =# <b>select 'the only exit is the way you came in'::tsvector</b>
693 --------------------------------------------------
694 'in' 'is' 'the' 'way' 'you' 'came' 'exit' 'only'
699 Notice that the conversion interpreted the string
700 simply as a list of lexemes to be included in the vector.
701 Their order was lost,
702 as was the number of times each lexeme appeared.
703 You must keep in mind that directly creating vectors with casting
704 is <i>not</i> an alternate means of parsing;
705 it is a way of directly entering lexemes into a vector <i>without</i> parsing.
707 Queries can also be created through casting,
708 if you separate lexemes with boolean operators
709 rather than with whitespace.
710 When creating your own vectors and queries,
711 remember that the search operator <tt>@@</tt>
712 finds only <i>exact</i> matches between query lexemes and vector lexemes
714 if they are not exactly the same string,
715 they will not be considered a match.
717 To include lexeme positions in your vector,
718 write the positions exactly the way tsearch2 displays them
719 when it prints vectors:
720 by following each lexeme with a colon
721 and a comma-separated list of integer positions.
722 If you list a lexeme more than once,
723 then all the positions listed for it are combined into a single list.
725 here are two ways of writing the same vector,
726 depending on whether you mention ‘<tt>the</tt>’ twice
727 or combine its positions into a list yourself:
731 =# <b>select 'the:1 only:2 exit:3 is:4 the:5 way:6 you:7 came:8 in:9'::tsvector</b>
733 --------------------------------------------------------------------
734 'in':9 'is':4 'the':1,5 'way':6 'you':7 'came':8 'exit':3 'only':2
736 =# <b>select 'the:1,5 only:2 exit:3 is:4 way:6 you:7 came:8 in:9'::tsvector</b>
738 --------------------------------------------------------------------
739 'in':9 'is':4 'the':1,5 'way':6 'you':7 'came':8 'exit':3 'only':2
744 Things can get slightly tricky
745 if you want to include apostrophes, backslashes, or spaces
747 (wanting to include either of the latter would be unusual,
748 but they can be included if you follow the rules).
749 The main problem is that the apostrophe and backslash
750 are important <i>both</i> to PostgreSQL when it is interpreting a string,
751 <i>and</i> to the <tt>tsvector</tt> conversion function.
752 You may want to review section
753 <a href="http://www.postgresql.org/docs/current/static/sql-syntax.html#SQL-SYNTAX-STRINGS">
754 “String Constants”</a>
755 in the PostgreSQL documentation before proceeding.
757 When you cast strings directly into vectors:
759 <li>The string is interpreted as a whitespace-separated list of lexemes,
760 any of which can be suffixed with a colon and a list of positions.
761 <li>A lexeme can be quoted by preceding it with an apostrophe,
762 in which case it runs until the next apostrophe;
763 otherwise a lexeme ends with the first whitespace or colon encountered.
764 <li>Any character preceded by a backslash,
765 including whitespace, the apostrophe, the colon, and the backslash itself,
766 loses its normal meaning and is treated as a letter.
767 Backslashes are effective
768 both inside and outside of apostrophe-quoted lexemes.
769 <li>A lexeme can be suffixed with a list of positions
770 by appending a colon and a comma-separated list of integers,
771 each of which can itself be followed by a letter
772 to designate a position weight
773 (position weights are <a href="#weights">described below</a>).
776 Here are some example strings,
777 showing the lexeme you want to insert
778 together with the string that the <tt>::tsvector</tt> operator
780 and how you would type that string at the PostgreSQL prompt:
784 <td><i>For the lexeme...</i>
785 <td><i>you need the string...</i>
786 <td><i>which you can type as:</i>
790 <td><tt>'nugget'</tt>
794 <td><tt>'won''t'</tt>
798 <td><tt>'pinin'''</tt>
802 <td><tt>'\\''bout'</tt>
804 <td><tt>white mist</tt>
805 <td><tt>white\ mist</tt>
806 <td><tt>'white\\ mist'</tt>
808 <td align=right><tt><i>or:</i></tt>
809 <td><tt>'white mist'</tt>
810 <td><tt>'''white mist'''</tt>
812 <td><tt>won't budge</tt>
813 <td><tt>won\'t\ budge</tt>
814 <td><tt>'won\\''t\\ budge'</tt>
816 <td align=right><tt><i>or:</i></tt>
817 <td><tt>'won\'t budge'</tt>
818 <td><tt>'''won\\''t budge'''</tt>
820 <td><tt>back\slashed</tt>
821 <td><tt>back\\slashed</tt>
822 <td><tt>'back\\\\slashed'</tt>
825 Remember to use the quoted quoting shown at the right
826 only when typing in strings as part of a PostgreSQL query.
827 If you are providing strings through a library
828 that automatically quotes them
829 or provides them in binary form to PostgreSQL,
830 then you can use the strings in the middle instead —
831 suitably quoted in the language you are using, of course.
833 Position weights are <a href="#weights">described below</a>
834 and can be written exactly as they will be displayed
835 when you select a weighted vector:
838 =# <b>select 'weighty:1,3A trivial:2B,4'::tsvector</b>
840 -------------------------------
841 'trivial':2B,4 'weighty':1,3A
846 Note that if you are composing SQL queries
847 in a scripting language like Perl or Python,
848 that itself considers quotes and backslashes special,
849 then you may have another quoting layer to deal with
850 on top of the two layers already shown above.
851 In such cases you may want to write a function
852 that performs the necessary quoting for you.
855 Having seen how to create vectors of your own,
856 it is time to learn how the native tsearch2 parser
857 reduces documents to vectors.
860 <h2><a name=parsing_lexing>Parsing and Lexing</a></h2>
864 described how you can bypass the parser provided by tsearch2
865 and populate your table of documents
866 with vectors of your own devising.
867 But for those interested in the native tsearch2 facilities,
868 we present here an overview of how it goes about
869 reducing documents to vectors.
872 The <tt>to_tsvector()</tt> function reduces documents to vectors
874 First, a <i>parser</i> breaks the input document
875 into short sequences of text called <i>tokens</i>.
876 Each token is usually a word, space, or piece of punctuation,
877 though some parsers return larger and more exotic items
878 like HTML tags as single tokens.
879 Each token returned by the parser
881 or passed to a <i>dictionary</i> that converts it into a lexeme.
882 The resulting lexemes are collected into a vector and returned.
884 The choice of which parser and dictionaries <tt>to_tsvector()</tt> should use
885 is controlled by your choice of <i>configuration</i>.
886 The tsearch2 module comes with several configurations,
887 and you can define more of your own;
888 in fact the creation of a new configuration is illustrated below,
889 in the section on position weights.
891 To learn about parsing in more detail,
892 we will study this example:
895 =# <b>select to_tsvector('default',
896 'The walls extend upward for well over 100 feet.')</b>
898 ----------------------------------------------------------
899 '100':8 'feet':9 'wall':2 'well':6 'extend':3 'upward':4
903 Unlike the <tt>to_tsvector()</tt> calls used in the above examples,
904 this one specifies the <tt>'default'</tt> configuration explicitly.
905 When we called <tt>to_tsvector()</tt> in earlier examples
906 with only one argument,
907 it used the <i>current</i> configuration,
908 which is chosen automatically based on your <tt>LOCALE</tt>
909 if that locale is mentioned in the <tt>pg_ts_cfg</tt> table
910 (which is shown under the first bullet in the description below).
911 If your locale is not listed in the table,
912 your attempts to use the current configuration will return:
915 ERROR: Can't find tsearch2 config by locale
918 You can always change the current configuration manually
919 by calling the <tt>set_curcfg()</tt> function
920 described in the section on
921 <a href="tsearch2-ref.html#configurations">Configurations</a>
924 Each configuration serves as an index into two different tables:
925 in <tt>pg_ts_cfg</tt> it determines
926 which parser will break our text into tokens,
927 and in <tt>pg_ts_cfgmap</tt>
928 it directs each token to a dictionary for processing.
929 The steps in detail are:
933 <p>First, our text is parsed,
934 using the parser listed for our configuration in the <tt>pg_ts_cfg</tt> table.
935 We are using the <tt>'default'</tt> configuration,
936 so the table tells us to use the <tt>'default'</tt> parser:
939 =# <b>SELECT * FROM pg_ts_cfg WHERE ts_name = 'default'</b>
940 ts_name | prs_name | locale
941 ---------+----------+--------
942 default | default | C
946 So our text will be parsed as though we had called:
949 =# <b>select * from parse('default',
950 'The walls extend upward for well over 100 feet.')</b>
953 This breaks the text into a list of tokens
954 which are each labelled with an integer type:
956 The<sub>1</sub>♦<sub>12</sub
957 >walls<sub>1</sub>♦<sub>12</sub
958 >extend<sub>1</sub>♦<sub>12</sub
959 >upward<sub>1</sub>♦<sub>12</sub
960 >for<sub>1</sub>♦<sub>12</sub
961 >well<sub>1</sub>♦<sub>12</sub
962 >over<sub>1</sub>♦<sub>12</sub
963 >100<sub>22</sub>♦<sub>12</sub
964 >feet<sub>1</sub>.<sub>12</sub>
966 Each word has been assigned type 1;
967 each space (represented here by a diamond) and the period, type 12;
968 and the number one hundred, type 22.
969 We can retrieve the alias for each type
970 through the <tt>token_type</tt> function:
973 =# <b>select * from token_type('default')
974 where tokid = 1 or tokid = 12 or tokid = 22</b>
975 tokid | alias | descr
976 -------+-------+------------------
977 1 | lword | Latin word
978 12 | blank | Space symbols
979 22 | uint | Unsigned integer
985 Next, the tokens are assigned to dictionaries
986 by looking up their type aliases in <tt>pg_ts_cfgmap</tt>
987 to determine which dictionary should process each token.
988 Since we are using the <tt>'default'</tt> configuration:
991 =# <b>select * from pg_ts_cfgmap where ts_name = 'default' and
992 (tok_alias = 'lword' or tok_alias = 'blank' or tok_alias = 'uint')</b>
993 ts_name | tok_alias | dict_name
994 ---------+-----------+-----------
995 default | lword | {en_stem}
996 default | uint | {simple}
1000 Since this map provides no dictionary for <tt>blank</tt> tokens,
1001 the spaces and period are simply discarded,
1002 leaving nine tokens,
1003 which are then numbered by their position:
1016 Finally, the words are reduced to lexemes by their respective dictionaries.
1017 The <tt>100</tt> is submitted to the <tt>simple</tt> dictionary,
1018 which returns tokens unaltered except for making them lowercase:
1021 =# <b>select lexize('simple', '100')</b>
1028 The other words are submitted to <tt>en_stem</tt>
1029 which reduces each English word to a linguistic stem,
1030 and then discards stems which belong to its list of stop words;
1031 you can see the list of stop words
1032 in the file whose path is in the <tt>dict_initoption</tt> field
1033 of the <tt>pg_ts_dict</tt> table entry for <tt>en_stem</tt>.
1034 The first three words of our text illustrate respectively
1035 an <tt>en_stem</tt> stop word,
1036 a word which <tt>en_stem</tt> alters by stemming,
1037 and a word which <tt>en_stem</tt> leaves alone:
1040 =# <b>select lexize('en_stem', 'The')</b>
1045 =# <b>select lexize('en_stem', 'walls')</b>
1050 =# <b>select lexize('en_stem', 'extend')</b>
1057 Once <tt>en_stem</tt> is done discarding stop words and stemming the rest,
1067 Which is precisely the result of the example that began this section.
1069 Query words are stemmed by the <tt>to_tsquery()</tt> function
1070 using the same scheme to determine the dictionary for each token,
1071 with the difference that the query parser recognizes as special
1072 the boolean operators that separate query words.
1075 <h2><a name="ref">Additional information</a></h2>
1076 More information about tsearch2 is available from
1077 <a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2">tsearch2</a> page.
1078 Also, it's worth to check
1079 <a href="http://www.sai.msu.su/~megera/wiki/Tsearch2">tsearch2 wiki</a> pages.