1 <!-- doc/src/sgml/pgtrgm.sgml -->
3 <sect1 id="pgtrgm" xreflabel="pg_trgm">
6 <indexterm zone="pgtrgm">
7 <primary>pg_trgm</primary>
11 The <filename>pg_trgm</filename> module provides functions and operators
12 for determining the similarity of
13 alphanumeric text based on trigram matching, as
14 well as index operator classes that support fast searching for similar
19 <title>Trigram (or Trigraph) Concepts</title>
22 A trigram is a group of three consecutive characters taken
23 from a string. We can measure the similarity of two strings by
24 counting the number of trigrams they share. This simple idea
25 turns out to be very effective for measuring the similarity of
26 words in many natural languages.
31 <filename>pg_trgm</filename> ignores non-word characters
32 (non-alphanumerics) when extracting trigrams from a string.
33 Each word is considered to have two spaces
34 prefixed and one space suffixed when determining the set
35 of trigrams contained in the string.
36 For example, the set of trigrams in the string
37 <quote><literal>cat</literal></quote> is
38 <quote><literal> c</literal></quote>,
39 <quote><literal> ca</literal></quote>,
40 <quote><literal>cat</literal></quote>, and
41 <quote><literal>at </literal></quote>.
42 The set of trigrams in the string
43 <quote><literal>foo|bar</literal></quote> is
44 <quote><literal> f</literal></quote>,
45 <quote><literal> fo</literal></quote>,
46 <quote><literal>foo</literal></quote>,
47 <quote><literal>oo </literal></quote>,
48 <quote><literal> b</literal></quote>,
49 <quote><literal> ba</literal></quote>,
50 <quote><literal>bar</literal></quote>, and
51 <quote><literal>ar </literal></quote>.
57 <title>Functions and Operators</title>
60 The functions provided by the <filename>pg_trgm</filename> module
61 are shown in <xref linkend="pgtrgm-func-table">, the operators
62 in <xref linkend="pgtrgm-op-table">.
65 <table id="pgtrgm-func-table">
66 <title><filename>pg_trgm</filename> Functions</title>
70 <entry>Function</entry>
71 <entry>Returns</entry>
72 <entry>Description</entry>
78 <entry><function>similarity(text, text)</function><indexterm><primary>similarity</primary></indexterm></entry>
79 <entry><type>real</type></entry>
81 Returns a number that indicates how similar the two arguments are.
82 The range of the result is zero (indicating that the two strings are
83 completely dissimilar) to one (indicating that the two strings are
88 <entry><function>show_trgm(text)</function><indexterm><primary>show_trgm</primary></indexterm></entry>
89 <entry><type>text[]</type></entry>
91 Returns an array of all the trigrams in the given string.
92 (In practice this is seldom useful except for debugging.)
97 <function>word_similarity(text, text)</function>
98 <indexterm><primary>word_similarity</primary></indexterm>
100 <entry><type>real</type></entry>
102 Returns a number that indicates how similar the first string
103 to the most similar word of the second string. The function searches in
104 the second string a most similar word not a most similar substring. The
105 range of the result is zero (indicating that the two strings are
106 completely dissimilar) to one (indicating that the first string is
107 identical to one of the words of the second string).
111 <entry><function>show_limit()</function><indexterm><primary>show_limit</primary></indexterm></entry>
112 <entry><type>real</type></entry>
114 Returns the current similarity threshold used by the <literal>%</>
115 operator. This sets the minimum similarity between
116 two words for them to be considered similar enough to
117 be misspellings of each other, for example
118 (<emphasis>deprecated</emphasis>).
122 <entry><function>set_limit(real)</function><indexterm><primary>set_limit</primary></indexterm></entry>
123 <entry><type>real</type></entry>
125 Sets the current similarity threshold that is used by the <literal>%</>
126 operator. The threshold must be between 0 and 1 (default is 0.3).
127 Returns the same value passed in (<emphasis>deprecated</emphasis>).
134 <table id="pgtrgm-op-table">
135 <title><filename>pg_trgm</filename> Operators</title>
139 <entry>Operator</entry>
140 <entry>Returns</entry>
141 <entry>Description</entry>
147 <entry><type>text</> <literal>%</literal> <type>text</></entry>
148 <entry><type>boolean</type></entry>
150 Returns <literal>true</> if its arguments have a similarity that is
151 greater than the current similarity threshold set by
152 <varname>pg_trgm.similarity_threshold</>.
156 <entry><type>text</> <literal><%</literal> <type>text</></entry>
157 <entry><type>boolean</type></entry>
159 Returns <literal>true</> if its first argument has the similar word in
160 the second argument and they have a similarity that is greater than the
161 current word similarity threshold set by
162 <varname>pg_trgm.word_similarity_threshold</> parameter.
166 <entry><type>text</> <literal>%></literal> <type>text</></entry>
167 <entry><type>boolean</type></entry>
169 Commutator of the <literal><%</> operator.
173 <entry><type>text</> <literal><-></literal> <type>text</></entry>
174 <entry><type>real</type></entry>
176 Returns the <quote>distance</> between the arguments, that is
177 one minus the <function>similarity()</> value.
182 <type>text</> <literal><<-></literal> <type>text</>
184 <entry><type>real</type></entry>
186 Returns the <quote>distance</> between the arguments, that is
187 one minus the <function>word_similarity()</> value.
192 <type>text</> <literal><->></literal> <type>text</>
194 <entry><type>real</type></entry>
196 Commutator of the <literal><<-></> operator.
205 <title>GUC Parameters</title>
208 <varlistentry id="guc-pgtrgm-similarity-threshold" xreflabel="pg_trgm.similarity_threshold">
210 <varname>pg_trgm.similarity_threshold</> (<type>real</type>)
212 <primary><varname>pg_trgm.similarity_threshold</> configuration parameter</primary>
217 Sets the current similarity threshold that is used by the <literal>%</>
218 operator. The threshold must be between 0 and 1 (default is 0.3).
222 <varlistentry id="guc-pgtrgm-word-similarity-threshold" xreflabel="pg_trgm.word_similarity_threshold">
224 <varname>pg_trgm.word_similarity_threshold</> (<type>real</type>)
227 <varname>pg_trgm.word_similarity_threshold</> configuration parameter
233 Sets the current word similarity threshold that is used by
234 <literal><%</> and <literal>%></> operators. The threshold
235 must be between 0 and 1 (default is 0.6).
243 <title>Index Support</title>
246 The <filename>pg_trgm</filename> module provides GiST and GIN index
247 operator classes that allow you to create an index over a text column for
248 the purpose of very fast similarity searches. These index types support
249 the above-described similarity operators, and additionally support
250 trigram-based index searches for <literal>LIKE</>, <literal>ILIKE</>,
251 <literal>~</> and <literal>~*</> queries. (These indexes do not
252 support equality nor simple comparison operators, so you may need a
253 regular B-tree index too.)
260 CREATE TABLE test_trgm (t text);
261 CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops);
265 CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);
270 At this point, you will have an index on the <structfield>t</> column that
271 you can use for similarity searching. A typical query is
273 SELECT t, similarity(t, '<replaceable>word</>') AS sml
275 WHERE t % '<replaceable>word</>'
276 ORDER BY sml DESC, t;
278 This will return all values in the text column that are sufficiently
279 similar to <replaceable>word</>, sorted from best match to worst. The
280 index will be used to make this a fast operation even over very large data
285 A variant of the above query is
287 SELECT t, t <-> '<replaceable>word</>' AS dist
289 ORDER BY dist LIMIT 10;
291 This can be implemented quite efficiently by GiST indexes, but not
292 by GIN indexes. It will usually beat the first formulation when only
293 a small number of the closest matches is wanted.
297 Also you can use an index on the <structfield>t</> column for word
298 similarity. For example:
300 SELECT t, word_similarity('<replaceable>word</>', t) AS sml
302 WHERE '<replaceable>word</>' <% t
303 ORDER BY sml DESC, t;
305 This will return all values in the text column that have a word
306 which sufficiently similar to <replaceable>word</>, sorted from best
307 match to worst. The index will be used to make this a fast operation
308 even over very large data sets.
312 A variant of the above query is
314 SELECT t, '<replaceable>word</>' <<-> t AS dist
316 ORDER BY dist LIMIT 10;
318 This can be implemented quite efficiently by GiST indexes, but not
324 Beginning in <productname>PostgreSQL</> 9.1, these index types also support
325 index searches for <literal>LIKE</> and <literal>ILIKE</>, for example
327 SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';
329 The index search works by extracting trigrams from the search string
330 and then looking these up in the index. The more trigrams in the search
331 string, the more effective the index search is. Unlike B-tree based
332 searches, the search string need not be left-anchored.
336 Beginning in <productname>PostgreSQL</> 9.3, these index types also support
337 index searches for regular-expression matches
338 (<literal>~</> and <literal>~*</> operators), for example
340 SELECT * FROM test_trgm WHERE t ~ '(foo|bar)';
342 The index search works by extracting trigrams from the regular expression
343 and then looking these up in the index. The more trigrams that can be
344 extracted from the regular expression, the more effective the index search
345 is. Unlike B-tree based searches, the search string need not be
350 For both <literal>LIKE</> and regular-expression searches, keep in mind
351 that a pattern with no extractable trigrams will degenerate to a full-index
356 The choice between GiST and GIN indexing depends on the relative
357 performance characteristics of GiST and GIN, which are discussed elsewhere.
362 <title>Text Search Integration</title>
365 Trigram matching is a very useful tool when used in conjunction
366 with a full text index. In particular it can help to recognize
367 misspelled input words that will not be matched directly by the
368 full text search mechanism.
372 The first step is to generate an auxiliary table containing all
373 the unique words in the documents:
376 CREATE TABLE words AS SELECT word FROM
377 ts_stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');
380 where <structname>documents</> is a table that has a text field
381 <structfield>bodytext</> that we wish to search. The reason for using
382 the <literal>simple</> configuration with the <function>to_tsvector</>
383 function, instead of using a language-specific configuration,
384 is that we want a list of the original (unstemmed) words.
388 Next, create a trigram index on the word column:
391 CREATE INDEX words_idx ON words USING GIN (word gin_trgm_ops);
394 Now, a <command>SELECT</command> query similar to the previous example can
395 be used to suggest spellings for misspelled words in user search terms.
396 A useful extra test is to require that the selected words are also of
397 similar length to the misspelled word.
402 Since the <structname>words</> table has been generated as a separate,
403 static table, it will need to be periodically regenerated so that
404 it remains reasonably up-to-date with the document collection.
405 Keeping it exactly current is usually unnecessary.
411 <title>References</title>
414 GiST Development Site
415 <ulink url="http://www.sai.msu.su/~megera/postgres/gist/"></ulink>
418 Tsearch2 Development Site
419 <ulink url="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/"></ulink>
424 <title>Authors</title>
427 Oleg Bartunov <email>oleg@sai.msu.su</email>, Moscow, Moscow University, Russia
430 Teodor Sigaev <email>teodor@sigaev.ru</email>, Moscow, Delta-Soft Ltd.,Russia
433 Alexander Korotkov <email>a.korotkov@postgrespro.ru</email>, Moscow, Postgres Professional, Russia
436 Documentation: Christopher Kings-Lynne
439 This module is sponsored by Delta-Soft Ltd., Moscow, Russia.