granicus.if.org Git - postgresql/blob - contrib/tsearch2/docs/tsearch2-guide.html

   1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
   2 <html>
   3 <head>
   4 <title>tsearch2 guide</title>
   5 </head>
   6 <body>
   7 <h1 align=center>The tsearch2 Guide</h1>
   8
   9 <p align=center>
  10 Brandon Craig Rhodes<br>30 June 2003
  11 <br>Updated to 8.2 release by Oleg Bartunov, October 2006</br>
  12 <p>
  13 This Guide introduces the reader to the PostgreSQL tsearch2 module,
  14 version&nbsp;2.
  15 More formal descriptions of the module's types and functions
  16 are provided in the <a href="tsearch2-ref.html">tsearch2 Reference</a>,
  17 which is a companion to this document.
  18 <p>
  19 First we will examine the <tt>tsvector</tt> and <tt>tsquery</tt> types
  20 and how they are used to search documents;
  21 next, we will use them to build a simple search engine in&nbsp;SQL;
  22 and finally, we will study the internals of document conversion
  23 and how you might tune the internals to accommodate various searching needs.
  24 <p>
  25 Once you have tsearch2 working with PostgreSQL,
  26 you should be able to run the examples here exactly as they are typed.
  27 <p>
  28 <hr>
  29 <h2>Table of Contents</h2>
  30 <blockquote>
  31 <a href="#intro">Introduction to FTS with tsearch2</a><br>
  32 <a href="#vectors_queries">Vectors and Queries</a><br>
  33 <a href="#simple_search">A Simple Search Engine</a><br>
  34 <a href="#weights">Ranking and Position Weights</a><br>
  35 <a href="#casting">Casting Vectors and Queries</a><br>
  36 <a href="#parsing_lexing">Parsing and Lexing</a><br>
  37 <a href="#ref">Additional information</a>
  38 </blockquote>
  39
  40 <hr>
  41
  42
  43 <h2><a name="intro">Introduction to FTS with tsearch2</a></h2>
  44 The purpose of FTS is to
  45 find <b>documents</b>, which satisfy <b>query</b> and optionally return
  46 them in some <b>order</b>.
  47 Most common case: Find documents containing all query terms and return them in order
  48 of their similarity to the query. Document in database can be
  49 any text attribute, or combination of text attributes from one or many tables
  50 (using joins).
  51 Text search operators existed for years, in PostgreSQL they are
  52 <tt><b>~,~*, LIKE, ILIKE</b></tt>, but they lack linguistic support,
  53 tends to be slow and have no relevance ranking. The idea behind tsearch2 is
  54 is rather simple - preprocess document at index time to save time at search stage.
  55 Preprocessing includes
  56 <ul>
  57 <li>document parsing onto words
  58 <li>linguistic - normalize words to obtain lexemes
  59 <li>store document in optimized for searching way
  60 </ul>
  61 Tsearch2, in a nutshell, provides FTS operator (contains) for two new data types,
  62 which represent document and query - <tt>tsquery  @@ tsvector</tt>.
  63
  64 <P>
  65 <h2><a name=vectors_queries>Vectors and Queries</a></h2>
  66
  67 <blockquote>
  68 <i>This section introduces
  69 the two data types upon which tsearch2 search engines are based,
  70 and illustrates their interaction using the simplest possible case.
  71 The complex examples we present later on
  72 are merely variations and elaborations of this basic mechanism.</i>
  73 </blockquote>
  74 <p>
  75 The tsearch2 module allows you to index documents by the words they contain,
  76 and then perform very efficient searches
  77 for documents that contain a given combination of words.
  78 Preparing your document index involves two steps:
  79 <ul>
  80 <li><b>Making a list of the words each document contains.</b>
  81  You must reduce each document to a <tt>tsvector</tt>
  82  which lists each word that appears in the document.
  83  This process offers many options,
  84  because there is no requirement
  85  that you must copy words into the vector
  86  exactly as they appear in the document.
  87  For example,
  88  many developers omit frequent and content-free <b>stop words</b>
  89  like <i>the</i> to reduce the size of their index;
  90  others reduce different forms of the same word
  91  (<i>forked</i>, <i>forking</i>, <i>forks</i>)
  92  to a common form (<i>fork</i>)
  93  to make search results independent of tense and case.
  94  Because words are very often stored in a modified form,
  95  we use the special term <b>lexemes</b>
  96  for the word forms we actually store in the vector.
  97 <li><b>Creating an index of the documents by lexeme.</b>
  98  This is managed automatically by tsearch2
  99  when you creat a <tt>gist()</tt> index
 100  on the <tt>tsvector</tt> column of a table,
 101  which implements a form of the Berkeley
 102  <a href="http://gist.cs.berkeley.edu/"><i>Generalized Search Tree</i></a>.
 103  Since PostgreSQL 8.2 tsearch2 supports <a href="http://www.sigaev.ru/gin/">Gin</a> index,
 104  which is an inverted index, commonly used in search engines. It adds scalability to tsearch2.
 105 </ul>
 106 Once your documents are indexed,
 107 performing a search involves:
 108 <ul>
 109 <li><b>Reducing the search terms to lexemes.</b>
 110  You must express each search you want to perform
 111  as a <tt>tsquery</tt> specifying a boolean combination of lexemes.
 112  Note that tsearch2 only finds <i>exact</i> matches
 113  between the lexemes in your query and the ones in each vector &mdash;
 114  even capitalization counts as a difference
 115  (which is why all lexemes are usually kept lowercase).
 116  So you must process search words the same way you processed document words;
 117  if <i>forking</i> became <i>fork</i> in the document's <tt>tsvector</tt>,
 118  then the search term <i>forking</i> must also become <i>fork</i>
 119  or the search will not find the document.
 120 <li><b>Retrieving the documents that match the query.</b>
 121  Running a <tt>SELECT</tt> ... <tt>WHERE</tt>
 122  <tt><i>query</i></tt> <tt>@@</tt> <tt><i>vector</i></tt>
 123  on the table with the <tt><i>vector</i></tt> column
 124  will return the documents that match your query.
 125 <li><b>Presenting your results.</b>
 126  This final stage offers as many options
 127  as turning documents into vectors.
 128  You can order documents by how well they matched the search terms;
 129  create a headline for each document
 130  showing some of the phrases in which it uses the search terms;
 131  and restrict the number of results retrieved.
 132  You will of course want some way to identify each document,
 133  so the user can ask for the full text of the ones he wants to read.
 134 </ul>
 135 And beyond deciding upon rules for turning documents into vectors
 136 and for presenting search results to users,
 137 you have to decide <i>where</i> to perform these operations &mdash;
 138 whether one database server
 139 will parse documents, perform searches, and prepare search results,
 140 or whether to spread the load of these operations across several machines.
 141 These are complicated design issues
 142 which we will explore later;
 143 in this section and the next,
 144 we will illustrate what can be accomplished
 145 using a single database server.
 146 <p>
 147 The <tt>default</tt> tsearch2 configuration,
 148 which we will learn more about later,
 149 provides a good example of a process for reducing documents to vectors:
 150
 151 <pre>
 152 =# <b>SELECT set_curcfg('default')</b>
 153 =# <b>SELECT to_tsvector('The air smells of sea water.')</b>
 154              to_tsvector
 155 -------------------------------------
 156  'air':2 'sea':5 'smell':3 'water':6
 157 (1 row)
 158 </pre>
 159
 160 Note the complex relationship between this document and its vector.
 161 The vector lists only words from the document &mdash;
 162 spaces and punctuation have disappeared.
 163 Common words like <i>the</i> and <i>of</i> have been eliminated.
 164 The&nbsp;<i>-s</i> that makes <i>smells</i> a plural has been removed,
 165 leaving a lexeme that represents the word in its simplest form.
 166 And finally,
 167 though the vector remembers the positions in which each word appeared,
 168 it does not store the lexemes in that order.
 169 <p>
 170 Keeping word positions in your vectors is optional, by the way.
 171 The positions are necessary for the tsearch2 ranking functions,
 172 which you can use to prioritize documents
 173 based on how often each document uses the search terms
 174 and whether they appear in close proximity.
 175 But if you do not perform ranking,
 176 or use your own process that ignores the word positions stored in the vector,
 177 then you can save space by stripping them from your vectors:
 178
 179 <pre>
 180 =# <b>SELECT strip(to_tsvector('The air smells of sea water.'))</b>
 181             strip
 182 -----------------------------
 183  'air' 'sea' 'smell' 'water'
 184 (1 row)
 185 </pre>
 186
 187 Now that we have a procedure for creating vectors,
 188 we can build an indexed table of vectors very simply:
 189
 190 <pre>
 191 =# <b>CREATE TABLE vectors ( vector tsvector )</b>
 192 =# <b>CREATE INDEX vector_index ON vectors USING gist(vector)</b>
 193 =# <b>INSERT INTO vectors VALUES (to_tsvector('The path forks here'))</b>
 194 =# <b>INSERT INTO vectors VALUES (to_tsvector('A crawl leads west'))</b>
 195 =# <b>INSERT INTO vectors VALUES (to_tsvector('The left fork leads northeast'))</b>
 196 =# <b>SELECT * FROM vectors</b>
 197                   vector
 198 ------------------------------------------
 199  'fork':3 'path':2
 200  'lead':3 'west':4 'crawl':2
 201  'fork':3 'lead':4 'left':2 'northeast':5
 202 (3 rows)
 203 </pre>
 204
 205 Now we can search this collection of document vectors
 206 using the <tt>@@</tt> operator and a <tt>tsquery</tt>
 207 that specifies the combination of lexemes we are looking for.
 208 Note that while vectors simply list lexemes,
 209 queries always combine them with the operators
 210 &lsquo;<tt>&amp;</tt>&rsquo;&nbsp;and,
 211 &lsquo;<tt>|</tt>&rsquo;&nbsp;or,
 212 and &nbsp;&lsquo;<tt>!</tt>&rsquo;&nbsp;not,
 213 plus parentheses for grouping.
 214 Some examples of the query syntax:
 215 <table align=center>
 216 <tr>
 217  <td>&lsquo;find documents with the word <i>forks</i> in them&rsquo;<br>
 218  <td><tt>'forks'</tt>
 219 <tr>
 220  <td>&lsquo;... with both <i>forks</i> and <i>leads</i>&rsquo;<br>
 221  <td><tt>'forks & leads'</tt>
 222 <tr>
 223  <td>&lsquo;... with either <i>forks</i> or <i>leads</i>&rsquo;<br>
 224  <td><tt>'forks | leads'</tt>
 225 <tr>
 226  <td>&lsquo;... with either <i>forks</i> or <i>leads</i>,
 227   but without <i>crawl</i>&rsquo;<br>
 228  <td><tt>'(forks|leads) & !crawl'</tt>
 229 </table>
 230 The tsearch2 module
 231 provides a <tt>to_tsquery()</tt> function for creating queries
 232 that uses the same process as <tt>to_tsvector()</tt> uses
 233 to reduce words to lexemes.
 234 For instance,
 235 it will remove the&nbsp;<i>-s</i> from the plurals in the last example above:
 236
 237 <pre>
 238 =# <b>SELECT to_tsquery('(leads|forks) & !crawl')</b>
 239            to_tsquery
 240 --------------------------------
 241  ( 'lead' | 'fork' ) & !'crawl'
 242 (1 row)
 243 </pre>
 244
 245 Again,
 246 this is critically important because the search operator <tt>@@</tt>
 247 only finds <i>exact</i> matches
 248 between the words in a query and the words in a vector;
 249 if the document vector lists the lexeme <i>fork</i>
 250 but the query looks for the plural form <i>forks</i>,
 251 the query would not match that document.
 252 Thanks to the symmetry between our process
 253 for producing vectors and queries, however,
 254 the above searches return correct results:
 255
 256 <pre>
 257 =# <b>SELECT * FROM vectors WHERE vector @@ to_tsquery('(leads|forks) & !crawl')</b>
 258                   vector
 259 ------------------------------------------
 260  'fork':3 'path':2
 261  'fork':3 'lead':4 'left':2 'northeast':5
 262 (2 rows)
 263 </pre>
 264
 265 You may want to try the other queries shown above,
 266 and perhaps invent some of your own.
 267 <p>
 268 You should not include stop words in a query,
 269 since you cannot search for words you have discarded.
 270 If you throw out the word <i>the</i> when building vectors, for example,
 271 your index will obviously not know which documents included it.
 272 The <tt>to_tsquery()</tt> function will automatically detect this
 273 and give you an error to prevent this mistake:
 274
 275 <pre>
 276 =# <b>SELECT to_tsquery('the')</b>
 277 NOTICE:  Query contains only stopword(s) or doesn't contain lexem(s), ignored
 278  to_tsquery
 279 ------------
 280
 281 (1 row)
 282 </pre>
 283
 284 But if you every build vectors and queries using your own routines,
 285 a possibility we will discuss later,
 286 then you will need to enforce this rule yourself.
 287
 288 <blockquote><i>
 289 Now that you understand how vectors and queries work together,
 290 you are prepared to tackle many additional topics:
 291 how to distribute searching across many servers;
 292 how to customize the process
 293 by which tsearch2 turns documents and queries into lexemes,
 294 or use a process of your own;
 295 and how to sort and display search results to your users.
 296 But before discussing these detailed questions,
 297 we will build a simple search engine
 298 to see how easily its basic features work together.
 299 </i></blockquote>
 300
 301 <h2><a name=simple_search>A Simple Search Engine</a></h2>
 302
 303 <blockquote><i>
 304 In this section we build a simple search engine out of SQL functions
 305 that use the vector and query types described in the previous section.
 306 While this example is simpler
 307 than a search engine that has to interface with the outside world,
 308 it will illustrate the basic principles of building a search engine,
 309 and better prepare you for developing your own.
 310 </i></blockquote>
 311 Building a search engine involves only a few improvements
 312 upon the rudimentary vector searches described in the last section.
 313 <ul>
 314 <li>Because the user wants to read documents, not vectors,
 315  you must provide some way
 316  for the full text of each document to be accessed &mdash;
 317  either by storing the entire text of each document in the database,
 318  or storing an identifier
 319  like a URL, file name, or document routing number
 320  that lets you fetch the document from other storage.
 321 <li>You can make it easier for user interface code to refer to each document
 322  by providing a unique identifier for each document,
 323  perhaps with a <tt>SERIAL</tt> column.
 324 <li>Search results should be ordered by relevance.
 325  If you leave word positions in your vectors,
 326  you can either have PostgreSQL <tt>ORDER</tt> your results
 327  <tt>BY</tt> a ranking function,
 328  or you can fetch the vectors yourself and perform your own sort.
 329  If you choose to ignore word positions or strip them from your vectors,
 330  you will have to determine relevance yourself,
 331  using either the full text of the document
 332  or other information about each document you may possess.
 333 <li>For each document returned by a search,
 334  you will usually want to display a summary called a <i>headline</i>
 335  that shows short excerpts
 336  illustrating how the document uses the query words.
 337  Headlines are usually generated from the full text of the document,
 338  not from position information in the <tt>tsvector</tt>,
 339  since excerpts lacking stop words, punctuation, and suffixes
 340  would not be comprehensible.
 341  If you store the full text of each document in the database,
 342  headlines can be generated very simply by a tsearch2 function.
 343  If you store your documents elsewhere,
 344  then you will either have to transmit each document to the database
 345  every time you want to run the headline function on it,
 346  or use your own headline code outside of the database.
 347 </ul>
 348 <p>
 349 We can easily construct a simple search engine
 350 that accomplishes these goals.
 351 First we build a table that, for each document,
 352 stores a unique identifier, the full text of the document,
 353 and its <tt>tsvector</tt>:
 354
 355 <pre>
 356 =# <b>CREATE TABLE docs ( id SERIAL, doc TEXT, vector tsvector )</b>
 357 =# <b>CREATE INDEX docs_index ON docs USING gist(vector);</b>
 358 </pre>
 359
 360 Note that although searches will still work
 361 on tables where you have neglected
 362 to create a <tt>gist()</tt> index over your vectors,
 363 they will run much more slowly
 364 since they will have to compare the query
 365 against every document vector in the table.
 366 <p>
 367 Because the table we have created
 368 stores each document in two different ways &mdash;
 369 both as text and as a vector &mdash;
 370 our <tt>INSERT</tt> statements must provide the document in both forms.
 371 While more advanced PostgreSQL programmers
 372 might accomplish this with a database trigger or rule,
 373 for this simple example we will use a small SQL function:
 374
 375 <pre>
 376 =# <b>CREATE FUNCTION insdoc(text) RETURNS void LANGUAGE sql AS
 377   'INSERT INTO docs (doc, vector) VALUES ($1, to_tsvector($1));'</b>
 378 </pre>
 379
 380 Now, by calling <tt>insdoc()</tt> several times,
 381 we can populate our table with documents:
 382
 383 <pre>
 384 =# <b>SELECT insdoc('A low crawl over cobbles leads inward to the west.')</b>
 385 =# <b>SELECT insdoc('The canyon runs into a mass of boulders -- dead end.')</b>
 386 =# <b>SELECT insdoc('You are crawling over cobbles in a low passage.')</b>
 387 =# <b>SELECT insdoc('Cavernous passages lead east, north, and south.')</b>
 388 =# <b>SELECT insdoc('To the east a low wide crawl slants up.')</b>
 389 =# <b>SELECT insdoc('You are in the south side chamber.')</b>
 390 =# <b>SELECT insdoc('The passage here is blocked by a recent cave-in.')</b>
 391 =# <b>SELECT insdoc('You are in a splendid chamber thirty feet high.')</b>
 392 </pre>
 393
 394 Now we can build a search function.
 395 Its <tt>SELECT</tt> statement is based upon
 396 the same <tt>@@</tt> operation illustrated in the previous section.
 397 But instead of returning matching vectors,
 398 we return for each document
 399 its <tt>SERIAL</tt> identifier, so the user can retrieve it later;
 400 a headline that illustrates its use of the search terms;
 401 and a ranking with which we also order the results.
 402 Our search operation can be coded as a single <tt>SELECT</tt> statement
 403 returning its own kind of table row,
 404 which we call a&nbsp;<tt>finddoc_t</tt>:
 405
 406 <pre>
 407 =# <b>CREATE TYPE finddoc_t AS (id INTEGER, headline TEXT, rank REAL)</b>
 408 =# <b>CREATE FUNCTION finddoc(text) RETURNS SETOF finddoc_t LANGUAGE sql AS '
 409    SELECT id, headline(doc, q), rank(vector, q)
 410      FROM docs, to_tsquery($1) AS q
 411      WHERE vector @@ q ORDER BY rank(vector, q) DESC'</b>
 412 </pre>
 413
 414 This function is a rather satisfactory search engine.
 415 Here is one example search,
 416 after which the user fetches the top-ranking document itself;
 417 with similar commands you can try queries of your own:
 418
 419 <pre>
 420 =# <b>SELECT * FROM finddoc('passage|crawl')</b>
 421  id |                       headline                        | rank
 422 ----+-------------------------------------------------------+------
 423   3 | &lt;b&gt;crawling&lt;/b&gt; over cobbles in a low &lt;b&gt;passage&lt;/b&gt;. | 0.19
 424   1 | &lt;b&gt;crawl&lt;/b&gt; over cobbles leads inward to the west.   |  0.1
 425   4 | &lt;b&gt;passages&lt;/b&gt; lead east, north, and south.          |  0.1
 426   5 | &lt;b&gt;crawl&lt;/b&gt; slants up.                               |  0.1
 427   7 | &lt;b&gt;passage&lt;/b&gt; here is blocked by a recent  cave-in.  |  0.1
 428 (5 rows)
 429 =# <b>SELECT doc FROM docs WHERE id = 3</b>
 430                        doc
 431 -------------------------------------------------
 432  You are crawling over cobbles in a low passage.
 433 (1 row)
 434 </pre>
 435
 436 While by default the <tt>headline()</tt> function
 437 surrounds matching words with <tt>&lt;b&gt;</tt> and <tt>&lt;/b&gt;</tt>
 438 in order to distinguish them from the surrounding text,
 439 you can provide options that change its behavior;
 440 consult the tsearch2 Reference for more details about
 441 <a href="tsearch2-ref.html#headlines">Headline Functions</a>.
 442 <p>
 443 Though a search may match hundreds or thousands of documents,
 444 you will usually present only ten or twenty results to the user at a time.
 445 This can be most easily accomplished
 446 by limiting your query with a <tt>LIMIT</tt>
 447 and an <tt>OFFSET</tt> clause &mdash;
 448 to display results ten at a time, for example,
 449 your would generate your first page of results
 450 with <tt>LIMIT</tt> <tt>10</tt> <tt>OFFSET</tt> <tt>0</tt>,
 451 your second page
 452 with <tt>LIMIT</tt> <tt>10</tt> <tt>OFFSET</tt> <tt>10</tt>,
 453 your third page
 454 with <tt>LIMIT</tt> <tt>10</tt> <tt>OFFSET</tt> <tt>20</tt>,
 455 and so forth.
 456 There are two problems with this approach, however.
 457 <p>
 458 The first problem is the strain of running the query over again
 459 for every page of results the user views.
 460 For small document collections or lightly loaded servers,
 461 this may not be a problem;
 462 but the impact can be high
 463 when a search must repeatedly rank and sort
 464 the same ten thousand results
 465 on an already busy server.
 466 So instead of selecting only one page of results,
 467 you will probably use <tt>LIMIT</tt> and <tt>OFFSET</tt>
 468 to return a few dozen or few hundred results,
 469 which you can cache and display to the user one page at a time.
 470 Whether a result cache rewards your effort
 471 will depend principally on the behavior of your users &mdash;
 472 how often they even view the second page of results, for instance.
 473 <p>
 474 The second issue solved by caching involves consistency.
 475 If the database is changing while the user browses their results,
 476 then documents might appear and disappear as they page through them.
 477 In some cases the user might even miss a particular result &mdash;
 478 perhaps the one they were looking for &mdash;
 479 if, say, its rank improves from 31th to 30th
 480 after they load results 21&ndash;30 but before they view results 31&ndash;40.
 481 While many databases are static or infrequently updated,
 482 and will not present this problem,
 483 users searching very dymanic document collections
 484 might benefit from the stable results that caches yield.
 485
 486 <blockquote><i>
 487 Having seen the features of a search engine
 488 implemented entirely within the database,
 489 we will learn about some specific tsearch2 features.
 490 First we will look in more detail at document ranking.
 491 </i></blockquote>
 492
 493 <h2><a name=weights>Ranking and Position Weights</a></h2>
 494
 495 <blockquote><i>
 496 When we built our simple search engine,
 497 we used the </i><tt>rank()</tt><i> function to order our results.
 498 Here we describe tsearch2 ranking in more detail.
 499 </i></blockquote>
 500
 501 There are two functions with which tsearch2 can rank search results.
 502 They both use the lexeme positions listed in the <tt>tsvector</tt>,
 503 so you cannot rank vectors
 504 from which these have been removed with <tt>strip()</tt>.
 505 The <tt>rank()</tt> function existed in older versions of OpenFTS,
 506 and has the feature that you can assign different weights
 507 to words from different sections of your document.
 508 The <tt>rank_cd()</tt> uses a recent technique for weighting results
 509 and also allows  different weight to be given
 510 to different sections of your document (since 8.2).
 511 <p>
 512 Both ranking functions allow you to specify,
 513 as an optional last argument,
 514 whether you want their results <i>normalized</i> &mdash;
 515 whether the rank returned should be adjusted for document length.
 516 Specifying a last argument of <tt>0</tt> (zero) makes no adjustment;
 517 <tt>1</tt> (one) divides the document rank
 518 by the logarithm of the document length;
 519 and <tt>2</tt> divides it by the plain length.
 520 In all of these examples we omit this optional argument,
 521 which is the same as specifying zero &mdash;
 522 we are making no adjustment for document length.
 523 <p>
 524 The <tt>rank_cd()</tt> function uses an experimental measurement
 525 called <i>cover density ranking</i> that rewards documents
 526 when they make frequent use of the search terms
 527 that are close together in the document.
 528 You can read about the algorithm in more detail
 529 in Clarke&nbsp;et&nbsp;al.,
 530  &ldquo;<a href="http://citeseer.nj.nec.com/clarke00relevance.html"
 531 >Relevance Ranking for One to Three Term Queries</a>.&rdquo;
 532 An optional first argument allows you to tune their formula;
 533 for details
 534 see the <a href="tsearch2-ref.html#ranking">section on ranking</a>
 535 in the Reference.
 536 <p>
 537 Currently tsearch2 supports four different weight labels:
 538 <tt>'D'</tt>, the default weight;
 539 and <tt>'A'</tt>, <tt>'B'</tt>, and <tt>'C'</tt>.
 540 All vectors created with <tt>to_tsvector()</tt>
 541 assign the weight <tt>'D'</tt> to each position,
 542 which as the default is not displayed when you print a vector out.
 543 <p>
 544 If you want positions with weights other than <tt>'D'</tt>,
 545 you have two options:
 546 either you can author a vector directly through the <tt>::tsvector</tt>
 547 casting operation,
 548 as described in the following section,
 549 which lets you give each position whichever weight you want;
 550 or you can pass a vector through the <tt>setweight()</tt> function
 551 which sets all of its position weights to a single value.
 552 An example of the latter:
 553
 554
 555 <pre>
 556 =# <b>SELECT vector FROM docs WHERE id = 3</b>
 557                  vector
 558 ----------------------------------------
 559  'low':8 'cobbl':5 'crawl':3 'passag':9
 560 (1 row)
 561 =# <b>SELECT setweight(vector, 'A') FROM docs WHERE id = 3</b>
 562                  setweight
 563 --------------------------------------------
 564  'low':8A 'cobbl':5A 'crawl':3A 'passag':9A
 565 (1 row)
 566 </pre>
 567
 568
 569 Merely changing all of the weights in a vector is not very useful,
 570 of course,
 571 since this results still in all words having the same weight.
 572 But if we parse different parts of a document separately,
 573 giving each section its own weight,
 574 and then concatenate the vectors of each part into a single vector,
 575 the result can be very useful.
 576 We can construct a simple example
 577 in which document titles are given greater weight
 578 that text in the body of the document:
 579
 580
 581 <pre>
 582 =# <b>CREATE TABLE tdocs ( id SERIAL, title TEXT, doc TEXT, vector tsvector )</b>
 583 =# <b>CREATE INDEX tdocs_index ON tdocs USING gist(vector);</b>
 584 =# <b>CREATE FUNCTION instdoc(text, text) RETURNS void LANGUAGE sql AS
 585   'INSERT INTO tdocs (title, doc, vector)
 586    VALUES ($1, $2, setweight(to_tsvector($1), ''A'') || to_tsvector($2));'</b>
 587 </pre>
 588
 589
 590 Now words from a document title will be weighted differently
 591 than those in the main text
 592 if we provide the title and body as separate arguments:
 593
 594
 595 <pre>
 596 =# <b>SELECT instdoc('Spendid Chamber',
 597  'The walls are frozen rivers of orange stone.')</b>
 598  instdoc
 599 ---------
 600
 601 (1 row)
 602 =# <b>SELECT vector FROM tdocs</b>
 603                                     vector
 604 ------------------------------------------------------------------------------
 605  'wall':4 'orang':9 'river':7 'stone':10 'frozen':6 'chamber':2A 'spendid':1A
 606 (1 row)
 607 </pre>
 608
 609
 610 Note that although the necessity is unusual,
 611 you can constrain search terms
 612 to only match words from certain sections
 613 by following them with a colon
 614 and a list of the sections in which the word can occur;
 615 by default this list is <tt>'ABCD'</tt>
 616 so that search terms match words from all sections.
 617 For example,
 618 here we search for a word both generally,
 619 and then looking only for specific weights:
 620
 621
 622 <pre>
 623 =# <b>SELECT title, doc FROM tdocs WHERE vector @@ to_tsquery('spendid')</b>
 624       title      |                     doc
 625 -----------------+----------------------------------------------
 626  Spendid Chamber | The walls are frozen rivers of orange stone.
 627 (1 row)
 628 =# <b>SELECT title, doc FROM tdocs WHERE vector @@ to_tsquery('spendid:A')</b>
 629       title      |                     doc
 630 -----------------+----------------------------------------------
 631  Spendid Chamber | The walls are frozen rivers of orange stone.
 632 (1 row)
 633 =# <b>SELECT title, doc FROM tdocs WHERE vector @@ to_tsquery('spendid:D')</b>
 634  title | doc
 635 -------+-----
 636 (0 rows)
 637 </pre>
 638
 639
 640
 641
 642 <blockquote><i>
 643 Our examples so far use tsearch2 to parse our documents into vectors.
 644 When your application needs absolute control over vector content,
 645 you will want to use direct type casting,
 646 which is described in the next section.
 647 </i></blockquote>
 648
 649 <h2><a name=casting>Casting Vectors and Queries</a></h2>
 650
 651 <blockquote><i>
 652 While tsearch2 has powerful and flexible ways
 653 to process documents and turn them into document vectors,
 654 you will sometimes want to parse documents on your own
 655 and place the results directly in vectors.
 656 Here we show you how.
 657 </i></blockquote>
 658
 659 In the preceding examples,
 660 we used the <tt>to_tsvector()</tt> function
 661 when we needed a document's text reduced to a document vector.
 662 We saw that the function stripped whitespace and punctuation,
 663 eliminated common words,
 664 and altered suffixes to reduce words to a common form.
 665 While these operations are often desirable,
 666 and while in the sections below
 667 we will gain precise control over this process,
 668 there are occasions on which
 669 you want to avoid the changes that <tt>to_tsvector()</tt> makes to text
 670 and specify explicitly the words that you want in your vectors.
 671 Or you may want to create queries directly
 672 rather than through <tt>to_tsquery()</tt>.
 673 <p>
 674 For example,
 675 you may have already developed your own routine
 676 for reducing your documents to searchable lexemes,
 677 and do not want your carefully generated terms altered
 678 by passing them through <tt>to_tsvector()</tt>.
 679 Or you might be developing and debugging parsing routines of your own
 680 that you are not ready to load into the database.
 681 In either case,
 682 you will find that direct insertion is easily accomplished
 683 if you simply follow some simple rules.
 684 <p>
 685 Vectors are created directly
 686 when you cast a string of whitespace separated lexemes
 687 to the <tt>tsvector</tt> type:
 688
 689
 690 <pre>
 691 =# <b>select 'the only exit is the way you came in'::tsvector</b>
 692                      tsvector
 693 --------------------------------------------------
 694  'in' 'is' 'the' 'way' 'you' 'came' 'exit' 'only'
 695 (1 row)
 696 </pre>
 697
 698
 699 Notice that the conversion interpreted the string
 700 simply as a list of lexemes to be included in the vector.
 701 Their order was lost,
 702 as was the number of times each lexeme appeared.
 703 You must keep in mind that directly creating vectors with casting
 704 is <i>not</i> an alternate means of parsing;
 705 it is a way of directly entering lexemes into a vector <i>without</i> parsing.
 706 <p>
 707 Queries can also be created through casting,
 708 if you separate lexemes with boolean operators
 709 rather than with whitespace.
 710 When creating your own vectors and queries,
 711 remember that the search operator <tt>@@</tt>
 712 finds only <i>exact</i> matches between query lexemes and vector lexemes
 713 &mdash;
 714 if they are not exactly the same string,
 715 they will not be considered a match.
 716 <p>
 717 To include lexeme positions in your vector,
 718 write the positions exactly the way tsearch2 displays them
 719 when it prints vectors:
 720 by following each lexeme with a colon
 721 and a comma-separated list of integer positions.
 722 If you list a lexeme more than once,
 723 then all the positions listed for it are combined into a single list.
 724 For example,
 725 here are two ways of writing the same vector,
 726 depending on whether you mention &lsquo;<tt>the</tt>&rsquo; twice
 727 or combine its positions into a list yourself:
 728
 729
 730 <pre>
 731 =# <b>select 'the:1 only:2 exit:3 is:4 the:5 way:6 you:7 came:8 in:9'::tsvector</b>
 732                               tsvector
 733 --------------------------------------------------------------------
 734  'in':9 'is':4 'the':1,5 'way':6 'you':7 'came':8 'exit':3 'only':2
 735 (1 row)
 736 =# <b>select 'the:1,5 only:2 exit:3 is:4 way:6 you:7 came:8 in:9'::tsvector</b>
 737                               tsvector
 738 --------------------------------------------------------------------
 739  'in':9 'is':4 'the':1,5 'way':6 'you':7 'came':8 'exit':3 'only':2
 740 (1 row)
 741 </pre>
 742
 743
 744 Things can get slightly tricky
 745 if you want to include apostrophes, backslashes, or spaces
 746 inside your lexemes
 747 (wanting to include either of the latter would be unusual,
 748 but they can be included if you follow the rules).
 749 The main problem is that the apostrophe and backslash
 750 are important <i>both</i> to PostgreSQL when it is interpreting a string,
 751 <i>and</i> to the <tt>tsvector</tt> conversion function.
 752 You may want to review section
 753 <a href="http://www.postgresql.org/docs/current/static/sql-syntax.html#SQL-SYNTAX-STRINGS">
 754 &ldquo;String Constants&rdquo;</a>
 755 in the PostgreSQL documentation before proceeding.
 756 <p>
 757 When you cast strings directly into vectors:
 758 <ul>
 759 <li>The string is interpreted as a whitespace-separated list of lexemes,
 760  any of which can be suffixed with a colon and a list of positions.
 761 <li>A lexeme can be quoted by preceding it with an apostrophe,
 762  in which case it runs until the next apostrophe;
 763  otherwise a lexeme ends with the first whitespace or colon encountered.
 764 <li>Any character preceded by a backslash,
 765  including whitespace, the apostrophe, the colon, and the backslash itself,
 766  loses its normal meaning and is treated as a letter.
 767  Backslashes are effective
 768  both inside and outside of apostrophe-quoted lexemes.
 769 <li>A lexeme can be suffixed with a list of positions
 770  by appending a colon and a comma-separated list of integers,
 771  each of which can itself be followed by a letter
 772  to designate a position weight
 773  (position weights are <a href="#weights">described below</a>).
 774 </ul>
 775
 776 Here are some example strings,
 777 showing the lexeme you want to insert
 778 together with the string that the <tt>::tsvector</tt> operator
 779 needs to see,
 780 and how you would type that string at the PostgreSQL prompt:
 781
 782 <table align=center>
 783 <tr>
 784 <td><i>For the lexeme...</i>
 785 <td><i>you need the string...</i>
 786 <td><i>which you can type as:</i>
 787 <tr>
 788 <td><tt>nugget</tt>
 789 <td><tt>nugget</tt>
 790 <td><tt>'nugget'</tt>
 791 <tr>
 792 <td><tt>won't</tt>
 793 <td><tt>won't</tt>
 794 <td><tt>'won''t'</tt>
 795 <tr>
 796 <td><tt>pinin'</tt>
 797 <td><tt>pinin'</tt>
 798 <td><tt>'pinin'''</tt>
 799 <tr>
 800 <td><tt>'bout</tt>
 801 <td><tt>\'bout</tt>
 802 <td><tt>'\\''bout'</tt>
 803 <tr>
 804 <td><tt>white mist</tt>
 805 <td><tt>white\ mist</tt>
 806 <td><tt>'white\\ mist'</tt>
 807 <tr>
 808 <td align=right><tt><i>or:</i></tt>
 809 <td><tt>'white mist'</tt>
 810 <td><tt>'''white mist'''</tt>
 811 <tr>
 812 <td><tt>won't budge</tt>
 813 <td><tt>won\'t\ budge</tt>
 814 <td><tt>'won\\''t\\ budge'</tt>
 815 <tr>
 816 <td align=right><tt><i>or:</i></tt>
 817 <td><tt>'won\'t budge'</tt>
 818 <td><tt>'''won\\''t budge'''</tt>
 819 <tr>
 820 <td><tt>back\slashed</tt>
 821 <td><tt>back\\slashed</tt>
 822 <td><tt>'back\\\\slashed'</tt>
 823 </table>
 824
 825 Remember to use the quoted quoting shown at the right
 826 only when typing in strings as part of a PostgreSQL query.
 827 If you are providing strings through a library
 828 that automatically quotes them
 829 or provides them in binary form to PostgreSQL,
 830 then you can use the strings in the middle instead &mdash;
 831 suitably quoted in the language you are using, of course.
 832 <p>
 833 Position weights are <a href="#weights">described below</a>
 834 and can be written exactly as they will be displayed
 835 when you select a weighted vector:
 836
 837 <pre>
 838 =# <b>select 'weighty:1,3A trivial:2B,4'::tsvector</b>
 839            tsvector
 840 -------------------------------
 841  'trivial':2B,4 'weighty':1,3A
 842 (1 row)
 843 </pre>
 844
 845 <p>
 846 Note that if you are composing SQL queries
 847 in a scripting language like Perl or Python,
 848 that itself considers quotes and backslashes special,
 849 then you may have another quoting layer to deal with
 850 on top of the two layers already shown above.
 851 In such cases you may want to write a function
 852 that performs the necessary quoting for you.
 853
 854 <blockquote><i>
 855 Having seen how to create vectors of your own,
 856 it is time to learn how the native tsearch2 parser
 857 reduces documents to vectors.
 858 </i></blockquote>
 859
 860 <h2><a name=parsing_lexing>Parsing and Lexing</a></h2>
 861
 862 <blockquote><i>
 863 The previous section
 864 described how you can bypass the parser provided by tsearch2
 865 and populate your table of documents
 866 with vectors of your own devising.
 867 But for those interested in the native tsearch2 facilities,
 868 we present here an overview of how it goes about
 869 reducing documents to vectors.
 870 </i></blockquote>
 871
 872 The <tt>to_tsvector()</tt> function reduces documents to vectors
 873 in two stages.
 874 First, a <i>parser</i> breaks the input document
 875 into short sequences of text called <i>tokens</i>.
 876 Each token is usually a word, space, or piece of punctuation,
 877 though some parsers return larger and more exotic items
 878 like HTML tags as single tokens.
 879 Each token returned by the parser
 880 is either discarded
 881 or passed to a <i>dictionary</i> that converts it into a lexeme.
 882 The resulting lexemes are collected into a vector and returned.
 883 <p>
 884 The choice of which parser and dictionaries <tt>to_tsvector()</tt> should use
 885 is controlled by your choice of <i>configuration</i>.
 886 The tsearch2 module comes with several configurations,
 887 and you can define more of your own;
 888 in fact the creation of a new configuration is illustrated below,
 889 in the section on position weights.
 890 <p>
 891 To learn about parsing in more detail,
 892 we will study this example:
 893
 894 <pre>
 895 =# <b>select to_tsvector('default',
 896      'The walls extend upward for well over 100 feet.')</b>
 897                        to_tsvector
 898 ----------------------------------------------------------
 899  '100':8 'feet':9 'wall':2 'well':6 'extend':3 'upward':4
 900 (1 row)
 901 </pre>
 902
 903 Unlike the <tt>to_tsvector()</tt> calls used in the above examples,
 904 this one specifies the <tt>'default'</tt> configuration explicitly.
 905 When we called <tt>to_tsvector()</tt> in earlier examples
 906 with only one argument,
 907 it used the <i>current</i> configuration,
 908 which is chosen automatically based on your <tt>LOCALE</tt>
 909 if that locale is mentioned in the <tt>pg_ts_cfg</tt> table
 910 (which is shown under the first bullet in the description below).
 911 If your locale is not listed in the table,
 912 your attempts to use the current configuration will return:
 913
 914 <pre>
 915 ERROR:  Can't find tsearch2 config by locale
 916 </pre>
 917
 918 You can always change the current configuration manually
 919 by calling the <tt>set_curcfg()</tt> function
 920 described in the section on
 921 <a href="tsearch2-ref.html#configurations">Configurations</a>
 922 in the Reference.
 923 <p>
 924 Each configuration serves as an index into two different tables:
 925 in <tt>pg_ts_cfg</tt> it determines
 926 which parser will break our text into tokens,
 927 and in <tt>pg_ts_cfgmap</tt>
 928 it directs each token to a dictionary for processing.
 929 The steps in detail are:
 930
 931 <ul>
 932 <li class=big>
 933 <p>First, our text is parsed,
 934 using the parser listed for our configuration in the <tt>pg_ts_cfg</tt> table.
 935 We are using the <tt>'default'</tt> configuration,
 936 so the table tells us to use the <tt>'default'</tt> parser:
 937
 938 <pre>
 939 =# <b>SELECT * FROM pg_ts_cfg WHERE ts_name = 'default'</b>
 940  ts_name | prs_name | locale
 941 ---------+----------+--------
 942  default | default  | C
 943 (1 row)
 944 </pre>
 945
 946 So our text will be parsed as though we had called:
 947
 948 <pre>
 949 =# <b>select * from parse('default',
 950      'The walls extend upward for well over 100 feet.')</b>
 951 </pre>
 952
 953 This breaks the text into a list of tokens
 954 which are each labelled with an integer type:
 955 <p align=center>
 956 The<sub>1</sub>&diams;<sub>12</sub
 957 >walls<sub>1</sub>&diams;<sub>12</sub
 958 >extend<sub>1</sub>&diams;<sub>12</sub
 959 >upward<sub>1</sub>&diams;<sub>12</sub
 960 >for<sub>1</sub>&diams;<sub>12</sub
 961 >well<sub>1</sub>&diams;<sub>12</sub
 962 >over<sub>1</sub>&diams;<sub>12</sub
 963 >100<sub>22</sub>&diams;<sub>12</sub
 964 >feet<sub>1</sub>.<sub>12</sub>
 965 <p>
 966 Each word has been assigned type&nbsp;1;
 967 each space (represented here by a diamond) and the period, type&nbsp;12;
 968 and the number one hundred, type&nbsp;22.
 969 We can retrieve the alias for each type
 970 through the <tt>token_type</tt> function:
 971
 972 <pre>
 973 =# <b>select * from token_type('default')
 974      where tokid = 1 or tokid = 12 or tokid = 22</b>
 975  tokid | alias |      descr
 976 -------+-------+------------------
 977      1 | lword | Latin word
 978     12 | blank | Space symbols
 979     22 | uint  | Unsigned integer
 980 (3 rows)
 981 </pre>
 982
 983
 984 <li class=big>
 985 Next, the tokens are assigned to dictionaries
 986 by looking up their type aliases in <tt>pg_ts_cfgmap</tt>
 987 to determine which dictionary should process each token.
 988 Since we are using the <tt>'default'</tt> configuration:
 989
 990 <pre>
 991 =# <b>select * from pg_ts_cfgmap where ts_name = 'default' and
 992       (tok_alias = 'lword' or tok_alias = 'blank' or tok_alias = 'uint')</b>
 993  ts_name | tok_alias | dict_name
 994 ---------+-----------+-----------
 995  default | lword     | {en_stem}
 996  default | uint      | {simple}
 997 (2 rows)
 998 </pre>
 999
1000 Since this map provides no dictionary for <tt>blank</tt> tokens,
1001 the spaces and period are simply discarded,
1002 leaving nine tokens,
1003 which are then numbered by their position:
1004 <p align=center>
1005 The<sup>1</sup>
1006 walls<sup>2</sup>
1007 extend<sup>3</sup>
1008 upward<sup>4</sup>
1009 for<sup>5</sup>
1010 well<sup>6</sup>
1011 over<sup>7</sup>
1012 100<sup>8</sup>
1013 feet<sup>9</sup>
1014
1015 <li class=big>
1016 Finally, the words are reduced to lexemes by their respective dictionaries.
1017 The <tt>100</tt> is submitted to the <tt>simple</tt> dictionary,
1018 which returns tokens unaltered except for making them lowercase:
1019
1020 <pre>
1021 =# <b>select lexize('simple', '100')</b>
1022  lexize
1023 --------
1024  {100}
1025 (1 row)
1026 </pre>
1027
1028 The other words are submitted to <tt>en_stem</tt>
1029 which reduces each English word to a linguistic stem,
1030 and then discards stems which belong to its list of stop words;
1031 you can see the list of stop words
1032 in the file whose path is in the <tt>dict_initoption</tt> field
1033 of the <tt>pg_ts_dict</tt> table entry for <tt>en_stem</tt>.
1034 The first three words of our text illustrate respectively
1035 an <tt>en_stem</tt> stop word,
1036 a word which <tt>en_stem</tt> alters by stemming,
1037 and a word which <tt>en_stem</tt> leaves alone:
1038
1039 <pre>
1040 =# <b>select lexize('en_stem', 'The')</b>
1041  lexize
1042 --------
1043  {}
1044 (1 row)
1045 =# <b>select lexize('en_stem', 'walls')</b>
1046  lexize
1047 --------
1048  {wall}
1049 (1 row)
1050 =# <b>select lexize('en_stem', 'extend')</b>
1051   lexize
1052 ----------
1053  {extend}
1054 (1 row)
1055 </pre>
1056
1057 Once <tt>en_stem</tt> is done discarding stop words and stemming the rest,
1058 we are left with:
1059 <p align=center>
1060 wall<sup>2</sup>
1061 extend<sup>3</sup>
1062 upward<sup>4</sup>
1063 well<sup>6</sup>
1064 100<sup>8</sup>
1065 feet<sup>9</sup>
1066 <p>
1067 Which is precisely the result of the example that began this section.
1068 </ul>
1069 Query words are stemmed by the <tt>to_tsquery()</tt> function
1070 using the same scheme to determine the dictionary for each token,
1071 with the difference that the query parser recognizes as special
1072 the boolean operators that separate query words.
1073
1074
1075 <h2><a name="ref">Additional information</a></h2>
1076 More information about tsearch2 is available from
1077 <a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2">tsearch2</a> page.
1078 Also, it's worth to check
1079 <a href="http://www.sai.msu.su/~megera/wiki/Tsearch2">tsearch2 wiki</a> pages.
1080
1081
1082 </body>
1083 </html>
1084
1085