-<!-- $PostgreSQL: pgsql/doc/src/sgml/config.sgml,v 1.85 2006/09/08 15:55:52 tgl Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/config.sgml,v 1.86 2006/09/14 11:16:27 teodor Exp $ -->
<chapter Id="runtime-config">
<title>Server Configuration</title>
</para>
</listitem>
</varlistentry>
-
+
+ <varlistentry id="guc-gin-fuzzy-search-limit" xreflabel="gin_fuzzy_search_limit">
+ <term><varname>gin_fuzzy_search_limit</varname> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>gin_fuzzy_search_limit</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Soft upper limit of the size of the returned set by GIN index. For more
+ information see <xref linkend="gin-tips">.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
</sect1>
-<!-- $PostgreSQL: pgsql/doc/src/sgml/filelist.sgml,v 1.46 2006/09/05 03:09:56 momjian Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/filelist.sgml,v 1.47 2006/09/14 11:16:27 teodor Exp $ -->
<!entity history SYSTEM "history.sgml">
<!entity info SYSTEM "info.sgml">
<!entity catalogs SYSTEM "catalogs.sgml">
<!entity geqo SYSTEM "geqo.sgml">
<!entity gist SYSTEM "gist.sgml">
+<!entity gin SYSTEM "gin.sgml">
<!entity planstats SYSTEM "planstats.sgml">
<!entity indexam SYSTEM "indexam.sgml">
<!entity nls SYSTEM "nls.sgml">
-<!-- $PostgreSQL: pgsql/doc/src/sgml/geqo.sgml,v 1.36 2006/03/10 19:10:48 momjian Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/geqo.sgml,v 1.37 2006/09/14 11:16:27 teodor Exp $ -->
<chapter id="geqo">
<chapterinfo>
methods</firstterm> (e.g., nested loop, hash join, merge join in
<productname>PostgreSQL</productname>) to process individual joins
and a diversity of <firstterm>indexes</firstterm> (e.g.,
- B-tree, hash, GiST in <productname>PostgreSQL</productname>) as access
- paths for relations.
+ B-tree, hash, GiST and GIN in <productname>PostgreSQL</productname>) as
+ access paths for relations.
</para>
<para>
--- /dev/null
+<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.1 2006/09/14 11:16:27 teodor Exp $ -->
+
+<chapter id="GIN">
+<title>GIN Indexes</title>
+
+ <indexterm>
+ <primary>index</primary>
+ <secondary>GIN</secondary>
+ </indexterm>
+
+<sect1 id="gin-intro">
+ <title>Introduction</title>
+
+ <para>
+ <acronym>GIN</acronym> stands for Generalized Inverted Index. It is
+ an index structure storing a set of (key, posting list) pairs, where
+ 'posting list' is a set of rows in which the key occurs. The
+ row may contain many keys.
+ </para>
+
+ <para>
+ It is generalized in the sense that a <acronym>GIN</acronym> index
+ does not need to be aware of the operation that it accelerates.
+ Instead, it uses custom strategies defined for particular data types.
+ </para>
+
+ <para>
+ One advantage of <acronym>GIN</acronym> is that it allows the development
+ of custom data types with the appropriate access methods, by
+ an expert in the domain of the data type, rather than a database expert.
+ This is much the same advantage as using <acronym>GiST</acronym>.
+ </para>
+
+ <para>
+ The <acronym>GIN</acronym>
+ implementation in <productname>PostgreSQL</productname> is primarily
+ maintained by Teodor Sigaev and Oleg Bartunov, and there is more
+ information on their
+ <ulink url="http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gin">website</ulink>.
+ </para>
+
+</sect1>
+
+<sect1 id="gin-extensibility">
+ <title>Extensibility</title>
+
+ <para>
+ The <acronym>GIN</acronym> interface has a high level of abstraction,
+ requiring the access method implementer to only implement the semantics of
+ the data type being accessed. The <acronym>GIN</acronym> layer itself
+ takes care of concurrency, logging and searching the tree structure.
+ </para>
+
+ <para>
+ All it takes to get a <acronym>GIN</acronym> access method working
+ is to implement four user-defined methods, which define the behavior of
+ keys in the tree. In short, <acronym>GIN</acronym> combines extensibility
+ along with generality, code reuse, and a clean interface.
+ </para>
+
+</sect1>
+
+<sect1 id="gin-implementation">
+ <title>Implementation</title>
+
+ <para>
+ Internally, <acronym>GIN</acronym> consists of a B-tree index constructed
+ over keys, where each key is an element of the indexed value
+ (element of array, for example) and where each tuple in a leaf page is
+ either a pointer to a B-tree over heap pointers (PT, posting tree), or a
+ list of heap pointers (PL, posting list) if the tuple is small enough.
+ </para>
+
+ <para>
+ There are four methods that an index operator class for
+ <acronym>GIN</acronym> must provide (prototypes are in pseudocode):
+ </para>
+
+ <variablelist>
+ <varlistentry>
+ <term>int compare( Datum a, Datum b )</term>
+ <listitem>
+ <para>
+ Compares keys (not indexed values!) and returns an integer less than
+ zero, zero, or greater than zero, indicating whether the first key is
+ less than, equal to, or greater than the second.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Datum* extractValue(Datum inputValue, uint32 *nkeys)</term>
+ <listitem>
+ <para>
+ Returns an array of keys of value to be indexed, nkeys should
+ contain the number of returned keys.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Datum* extractQuery(Datum query, uint32 nkeys,
+ StrategyNumber n)</term>
+ <listitem>
+ <para>
+ Returns an array of keys of the query to be executed. n contains
+ strategy number of operation (see <xref linkend="xindex-strategies">).
+ Depending on n, query may be different type.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>bool consistent( bool check[], StrategyNumber n, Datum query)</term>
+ <listitem>
+ <para>
+ Returns TRUE if indexed value satisfies query qualifier with strategy n
+ (or may satisfy in case of RECHECK mark in operator class).
+ Each element of the check array is TRUE if indexed value has a
+ corresponding key in the query: if (check[i] == TRUE ) the i-th key of
+ the query is present in the indexed value.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+</sect1>
+
+<sect1 id="gin-tips">
+<title>GIN tips and trics</title>
+
+ <variablelist>
+ <varlistentry>
+ <term>Create vs insert</term>
+ <listitem>
+ <para>
+ In most cases, insertion into <acronym>GIN</acronym> index is slow because
+ many GIN keys may be inserted for each table row. So, when loading data
+ in bulk it may be useful to drop index and recreate it
+ after the data is loaded in the table.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>gin_fuzzy_search_limit</term>
+ <listitem>
+ <para>
+ The primary goal of development <acronym>GIN</acronym> indices was
+ support for highly scalable, full-text search in
+ <productname>PostgreSQL</productname> and there are often situations when
+ a full-text search returns a very large set of results. Since reading
+ tuples from the disk and sorting them could take a lot of time, this is
+ unacceptable for production. (Note that the index search itself is very
+ fast.)
+ </para>
+ <para>
+ Such queries usually contain very frequent words, so the results are not
+ very helpful. To facilitate execution of such queries
+ <acronym>GIN</acronym> has a configurable soft upper limit of the size
+ of the returned set, determined by the
+ <varname>gin_fuzzy_search_limit</varname> GUC variable. It is set to 0 by
+ default (no limit).
+ </para>
+ <para>
+ If a non-zero search limit is set, then the returned set is a subset of
+ the whole result set, chosen at random.
+ </para>
+ <para>
+ "Soft" means that the actual number of returned results could slightly
+ differ from the specified limit, depending on the query and the quality
+ of the system's random number generator.
+ </para>
+ </listitem>
+ </varlistentry>
+ <variablelist>
+
+</sect1>
+
+<sect1 id="gin-limit">
+ <title>Limitations</title>
+
+ <para>
+ <acronym>GIN</acronym> doesn't support full scan of index due to it's
+ extremely inefficiency: because of a lot of keys per value,
+ each heap pointer will returned several times.
+ </para>
+
+ <para>
+ When extractQuery returns zero number of keys, <acronym>GIN</acronym> will
+ emit a error: for different opclass and strategy semantic meaning of void
+ query may be different (for example, any array contains void array,
+ but they aren't overlapped with void one), and <acronym>GIN</acronym> can't
+ suggest reasonable answer.
+ </para>
+
+ <para>
+ <acronym>GIN</acronym> searches keys only by equality matching. This may
+ be improved in future.
+ </para>
+</sect1>
+<sect1 id="gin-examples">
+ <title>Examples</title>
+
+ <para>
+ The <productname>PostgreSQL</productname> source distribution includes
+ <acronym>GIN</acronym> classes for one-dimensional arrays of all internal
+ types. The following
+ <filename>contrib</> modules also contain <acronym>GIN</acronym>
+ operator classes:
+ </para>
+
+ <variablelist>
+ <varlistentry>
+ <term>intarray</term>
+ <listitem>
+ <para>Enhanced support for int4[]</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>tsearch2</term>
+ <listitem>
+ <para>Support for inverted text indexing. This is much faster for very
+ large, mostly-static sets of documents.
+ </para>
+ </listitem>
+ </varlistentry>
+
+</chapter>
-<!-- $PostgreSQL: pgsql/doc/src/sgml/indices.sgml,v 1.61 2006/09/13 23:42:26 tgl Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/indices.sgml,v 1.62 2006/09/14 11:16:27 teodor Exp $ -->
<chapter id="indexes">
<title id="indexes-title">Indexes</title>
<para>
<productname>PostgreSQL</productname> provides several index types:
- B-tree, Hash, and GiST. Each index type uses a different
+ B-tree, Hash, GIN and GiST. Each index type uses a different
algorithm that is best suited to different types of queries.
By default, the <command>CREATE INDEX</command> command will create a
B-tree index, which fits the most common situations.
classes are available in the <literal>contrib</> collection or as separate
projects. For more information see <xref linkend="GiST">.
</para>
+ <para>
+ <indexterm>
+ <primary>index</primary>
+ <secondary>GIN</secondary>
+ </indexterm>
+ <indexterm>
+ <primary>GIN</primary>
+ <see>index</see>
+ </indexterm>
+ GIN is a inverted index and it's usable for values which have more
+ than one key, arrays for example. Like to GiST, GIN may support
+ many different user-defined indexing strategies and the particular
+ operators with which a GIN index can be used vary depending on the
+ indexing strategy.
+ As an example, the standard distribution of
+ <productname>PostgreSQL</productname> includes GIN operator classes
+ for one-dimentional arrays, which support indexed
+ queries using these operators:
+
+ <simplelist>
+ <member><literal><@</literal></member>
+ <member><literal>@></literal></member>
+ <member><literal>=</literal></member>
+ <member><literal>&&</literal></member>
+ </simplelist>
+
+ (See <xref linkend="functions-array"> for the meaning of
+ these operators.)
+ Another GIN operator classes are available in the <literal>contrib</>
+ tsearch2 and intarray modules. For more information see <xref linkend="GIN">.
+ </para>
</sect1>
-<!-- $PostgreSQL: pgsql/doc/src/sgml/mvcc.sgml,v 2.58 2006/09/03 01:59:09 momjian Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/mvcc.sgml,v 2.59 2006/09/14 11:16:27 teodor Exp $ -->
<chapter id="mvcc">
<title>Concurrency Control</title>
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term>
+ <acronym>GIN</acronym> indexes
+ </term>
+ <listitem>
+ <para>
+ Short-term share/exclusive page-level locks are used for
+ read/write access. Locks are released immediately after each
+ index row is fetched or inserted. However, note that GIN index
+ usually requires several inserts per one table row.
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
applications; since they also have more features than hash
indexes, they are the recommended index type for concurrent
applications that need to index scalar data. When dealing with
- non-scalar data, B-trees are not useful, and GiST indexes should
+ non-scalar data, B-trees are not useful, and GiST or GIN indexes should
be used instead.
</para>
</sect1>
<!--
-$PostgreSQL: pgsql/doc/src/sgml/ref/create_opclass.sgml,v 1.15 2006/09/10 17:36:52 tgl Exp $
+$PostgreSQL: pgsql/doc/src/sgml/ref/create_opclass.sgml,v 1.16 2006/09/14 11:16:27 teodor Exp $
PostgreSQL documentation
-->
<para>
The data type actually stored in the index. Normally this is
the same as the column data type, but some index methods
- (only GiST at this writing) allow it to be different. The
+ (GIN and GiST for now) allow it to be different. The
<literal>STORAGE</> clause must be omitted unless the index
method allows a different type to be used.
</para>
-<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.45 2006/09/05 03:09:56 momjian Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.46 2006/09/14 11:16:27 teodor Exp $ -->
<sect1 id="xindex">
<title>Interfacing Extensions To Indexes</title>
</tgroup>
</table>
+ <para>
+ GIN indexes are similar to GiST in flexibility: it hasn't a fixed set
+ of strategies. Instead, the <quote>consistency</> support routine
+ interprets the strategy numbers accordingly with operator class
+ definition. As an example, strategies of operator class over arrays
+ is shown in <xref linkend="xindex-gin-array-strat-table">.
+ </para>
+
+ <table tocentry="1" id="xindex-gin-array-strat-table">
+ <title>GiST Two-Dimensional <quote>R-tree</> Strategies</title>
+ <tgroup cols="2">
+ <thead>
+ <row>
+ <entry>Operation</entry>
+ <entry>Strategy Number</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>overlap</entry>
+ <entry>1</entry>
+ </row>
+ <row>
+ <entry>contains</entry>
+ <entry>2</entry>
+ </row>
+ <row>
+ <entry>is contained by</entry>
+ <entry>3</entry>
+ </row>
+ <row>
+ <entry>equal</entry>
+ <entry>4</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
<para>
Note that all strategy operators return Boolean values. In
practice, all operators defined as index method strategies must
</thead>
<tbody>
<row>
- <entry>consistent</entry>
+ <entry>consistent - determine whether key satifies the
+ query qualifier</entry>
<entry>1</entry>
</row>
<row>
- <entry>union</entry>
+ <entry>union - compute union of of a set of given keys</entry>
<entry>2</entry>
</row>
<row>
- <entry>compress</entry>
+ <entry>compress - computes a compressed representation of a key or value
+ to be indexed</entry>
<entry>3</entry>
</row>
<row>
- <entry>decompress</entry>
+ <entry>decompress - computes a decompressed representation of a
+ compressed key </entry>
<entry>4</entry>
</row>
<row>
- <entry>penalty</entry>
+ <entry>penalty - compute penalty for inserting new key into subtree
+ with given subtree's key</entry>
<entry>5</entry>
</row>
<row>
- <entry>picksplit</entry>
+ <entry>picksplit - determine which entries of a page are to be moved
+ to the new page and compute the union keys for resulting pages </entry>
<entry>6</entry>
</row>
<row>
- <entry>equal</entry>
+ <entry>equal - compare two keys and returns true if they are equal
+ </entry>
<entry>7</entry>
</row>
</tbody>
</tgroup>
</table>
+ <para>
+ GIN indexes require four support functions,
+ shown in <xref linkend="xindex-gin-support-table">.
+ </para>
+
+ <table tocentry="1" id="xindex-gin-support-table">
+ <title>GIN Support Functions</title>
+ <tgroup cols="2">
+ <thead>
+ <row>
+ <entry>Function</entry>
+ <entry>Support Number</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>
+ compare - Compare two keys and return an integer less than zero, zero, or
+ greater than zero, indicating whether the first key is less than, equal to,
+ or greater than the second.
+ </entry>
+ <entry>1</entry>
+ </row>
+ <row>
+ <entry>extractValue - extract keys from value to be indexed</entry>
+ <entry>2</entry>
+ </row>
+ <row>
+ <entry>extractQuery - extract keys from query</entry>
+ <entry>3</entry>
+ </row>
+ <row>
+ <entry>consistent - determine whether value matches by the
+ query</entry>
+ <entry>4</entry>
+ </row>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
<para>
Unlike strategy operators, support functions return whichever data
type the particular index method expects; for example in the case